Handling large image datasets

Hey everyone,

I am currently working for an author of the “Ecoset” Dataset, trying to bring it to HF datasets.

Ecoset was created as a clean and ecologically valid alternative to Imagenet. It is a large image recognition dataset, similar to Imagenet in size and structure. The authors of ecoset claim several improvements over Imagenet, like:

  • more ecologically valid classes (e.g. not over-focussed on distinguishing overly specific categories like dog breeds)
  • less NSFW content
  • ‘pre-packed image recognition models’ that come with the dataset and can be used for validation of other models.
    Ecoset was published in this paper, described on this website, and hosted in this bucket.

I see that theoretically, HF datasets supports most of the functionality we would like to have for this dataset, like:

  • loading image (.jpeg) files
  • loading datasets hosted from an Amazon S3 bucket
  • flexibly loading/streaming parts of the dataset
  • custom loading functions (e.g. allowing to pass a password in order to extract the files)

However, I am not sure how well all of this works in practice, when put together for one large dataset. Ecoset is about the size of full Imagenet, and consequently probably one of the largest image datasets currently on HF. I’ve seen that @lhoestq, @mariosasko and other contributors are currently working on a imagenet-1k implementation, which has a very similar structure to Ecoset. Still, I am not sure whether our idea of ecoset can currently be realistically implemented on hf datasets, and If they can, what the best overall strategy would be. More specifically I have the following questions:

  1. The project is hosted in an AWS S3 bucket structure, however, the entire dataset is zipped up in on single .zip file. Therefore I am not sure whether this dataset can even profit from the available datasets functions to read in S3 datasets. Should we still load the dataset from here, or should we instead request to manually download the dataset, unpack it, and then load the dataset locally?
  2. Considering the data being packed in a single .zip file, does streaming the dataset even make sense here?
  3. The data is structured similar, but not exactly like imagenet. It looks somewhat like this:
├── train
│   ├──0001_house
│   │   ├──n07731952_10097.jpg
│   │   ├──n07731952_101419.jpg
│   ├──0002_woman
│   │   ...
├── val
│   ├──0001_house
│   │   ...
├── test

Given this structure, what would be the best way to create a dataset from this? I’ve seen This thread where a similar dataset is loaded by creating multiple small datasets and then joining them. Should I do the same, but for each split * for each image category folder?

Would love to get some advice from the more advanced dataset users and devs, as this is a quite large (and hopefully useful) dataset, and we want to do this properly if possible.

3 Likes

cc also @sasha who mentioned recently she was working on something similar

I am definitely interested in following this conversation! We are aiming to do something similar with data from LILA soon :hugs:

2 Likes

This is cool! Any updates on the progress. We are trying for a smaller subset of the data.

Hi! This dataset seems super useful!

Answers to your questions:

  1. We try to avoid a manual download whenever possible as it’s not practical to use by end-users (our lib’s goal is to be able to download a dataset with a single line of code). Also, one suggestion regarding the download URL - our download manager expects a direct download URL, so use the following URL to fetch data from Ecoset’s bucket: https://s3.amazonaws.com/codeocean-datasets/0ab003f4-ff2d-4de3-b4f8-b6e349c0e5e5/ecoset.zip
  2. Yes, it does. Essentially, the goal of streaming is to avoid downloading a dataset locally to save space, and this dataset is pretty big, so being able to stream it would be very beneficial.
  3. Since the dataset doesn’t follow the exact image folder structure, I suggest writing a loading script rather than doing something similar to what I did in the linked discussion for Tiny ImageNet, to hide the “preprocessing complexity” from users (the code there is not trivial).
2 Likes

Thank you very much for the help!

I’ll try to implement it like this. I hope it’s okay if I raise some questions here if I encounter any problems :slight_smile:

1 Like

Okay, so I managed to write a basic dataset loading script. As everything seems to work fine, I’m thinking of how to make stuff more efficient. Concerning this, I have 2 questions:

  1. Streaming the dataset is not yet implemented but sounds really cool. Are there any tutorials or other resources on how to integrate Streaming into your datase loader?
  2. Does anybody know how well this works in combination with zip files (i.e. only reading in a subset of a remotely stored dataset)?

Thanks for the help so far, I would really appreciate any further suggestions :slight_smile:

Cool!

  1. You can find an example with streaming + DataLoader here: Stream
  2. TAR format with one archive for each split is probably more efficient for images than ZIP. Additionally, it makes sense to shard the archives and pass a list of them in gen_kwargs as we parallelize loading over shards if num_workers is > 1. Let me know if you need help with that. Perhaps we can benchmark these approaches to find the best one.