How to a build a dataset using s3 uris

I’m getting started with Hugging face datasets to build an object detection model.

The dataset is in coco format and my attempts at building a loading script dataset failed!

The coco dataset has an s3 path (s3://bucket_name/path/to/image.jpeg) which is used to lazy download images. The following tutorial seemed pretty easy to adapt for the use case Create an image dataset.

The test command fails with this error

    self.session = aiobotocore.session.AioSession(**self.kwargs)
TypeError: AioSession.__init__() got an unexpected keyword argument 'hf'

I’ve tried leveraging load_dataset_builder as described in this tutorial but the storage_options still have that hf which shouldn’t be there! Cloud storage

Here’s the solution I came up with which doesn’t seem to work. The use of the dl_manager in the generate_examples method seems odd. But how can the files be lazy loaded in the split generator?

from typing import List, Dict, Tuple
import json

import datasets
from PIL import Image as PILImage

class CocoDataset(datasets.GeneratorBasedBuilder):

    def __init__(self, **kwargs):
        self.dataset_path = "data/coco_dataset.json"
        self._init_coco_dataset()
        super(datasets.GeneratorBasedBuilder, self).__init__(version=datasets.Version("1.0.0"), **kwargs)

    def _init_coco_dataset(self):
        with open(self.dataset_path, "r") as file:
            dataset_dict = json.load(file)
        if "info" not in dataset_dict:
            self.coco_info = None
        elif isinstance(dataset_dict["info"], dict):
            self.coco_info = dataset_dict["info"]
        else:
            raise ValueError("Invalid COCO dataset info")

        self.dataset = dataset_dict
        self.id2cat = {cat["id"]: cat["name"] for cat in self.dataset["categories"]}

    def _info(self):
        category_names = [category["name"] for category in self.dataset["categories"]]
        return datasets.DatasetInfo(
            description=self.coco_info.get("description", "") if self.coco_info else "",
            homepage=self.coco_info.get("url", "") if self.coco_info else "",
            version="1.0.0",
            features=datasets.Features(
                {
                    "image": datasets.Image(),
                    "image_id": datasets.Value(dtype="int64"),
                    "file_path": datasets.Value(dtype="string"),
                    "width": datasets.Value(dtype="int64"),
                    "height": datasets.Value(dtype="int64"),
                    "objects": {
                        "bbox": datasets.Sequence(
                            feature=datasets.Sequence(feature=datasets.Value(dtype="int64"), length=4)
                        ),
                        "category": datasets.Sequence(datasets.ClassLabel(names=category_names)),
                    },
                }
            ),
        )

    def _parse_http_to_s3(self, http_url: str) -> str:
        bucket = http_url.replace("https://", "").split(".s3.")[0]
        object_key = http_url.split(".amazonaws.com/")[-1]
        return f"s3://{bucket}/{object_key}"

    def _split_generators(self, dl_manager):
        samples = dict()
        for image in self.dataset["images"]:
            metadata = {
                "s3_uri": self._parse_http_to_s3(image["coco_url"]),
                "file_name": image["file_name"],
                "width": image["width"],
                "height": image["height"],
            }
            samples[image["id"]] = metadata

        for ann in self.dataset["annotations"]:
            im_id = ann["image_id"]
            anns = samples[im_id].get("annotation", list())
            anns.append(ann)
            samples[im_id]["annotation"] = anns
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={"dl_manager": dl_manager, "data": samples},
            )
        ]

    def _generate_examples(self, dl_manager, data):
        """Generate images and labels for splits."""
        for image_id, metadata in data.items():
            file_path = dl_manager.download(metadata["s3_uri"])
            labels = []
            bboxes = []
            for ann in metadata.get("annotation", list()):
                labels.append(self.id2cat[ann["category_id"]])
                bboxes.append(ann["bbox"])

            features = {
                "image_id": image_id,
                "image": PILImage.open(file_path),
                "file_path": file_path,
                "width": metadata["width"],
                "height": metadata["height"],
                "objects": {"bbox": bboxes, "category": labels},
            }
            yield image_id, features



Didn’t solve it but found out that

  • if the images are downloaded in _split_generators the download manager uses the correct storage_options. However, all the images are downloaded together of course.
  • Instantiating a new download manager with the right config in the _generate_examples works too
    def _generate_examples(self, dl_config, data):
        """Generate images and labels for splits."""
        s3_session = aiobotocore.session.AioSession()
        storage_options = {"session": s3_session}
        dl_config.storage_options = storage_options
        dl_manager = datasets.DownloadManager(download_config=dl_config)
        for image_id, metadata in data.items():
            ...

Is there a way to leverage stream for this, somehow?

Hi ! All the files must be downloaded during _split_generators(), and then you can pass a dictionary of {image_id: image_path} to _generate_examples() via the gen_kwargs .

And in streaming mode all the downloads are lazy :wink: In that case the image_path contains the URLs instead of the local path to the images. Since we extend open() to stream remote files in _generate_examples, you can open() a remote file and pass it to PILImage

Thanks. I realised that _split_generators() needs to download all images. I tired playing around with iter_paths() and custom_download() trying to get a similar behaviour to iter_archive() and its generators but got nowhere.

Good, stream is indeed what I’m after then! I’m not quite sure how I could itera over my list of s3 paths then because they’d all have a different open() I understand that iter_archive() works because it opens the archive once and deflates/reads chunks. But what happens if you want to open a list of URLs? Would I need to implement my own iter_urls that yields an file descriptor (df) ? And have that fd as input param to _generate_examples(df, metadata)?

You can pass a list of urls to download() and it will return a list :slight_smile: Then pass the list of local paths (that are actually urls when streaming mode is enabled) to _generate_examples(). Then in _generate_examples() you can open() a file when needed to yield its content with the rest of tis metadata