I am currently working for an author of the “Ecoset” Dataset, trying to bring it to HF datasets.
Ecoset was created as a clean and ecologically valid alternative to Imagenet. It is a large image recognition dataset, similar to Imagenet in size and structure. The authors of ecoset claim several improvements over Imagenet, like:
- more ecologically valid classes (e.g. not over-focussed on distinguishing overly specific categories like dog breeds)
- less NSFW content
- ‘pre-packed image recognition models’ that come with the dataset and can be used for validation of other models.
Ecoset was published in this paper, described on this website, and hosted in this bucket.
I see that theoretically, HF datasets supports most of the functionality we would like to have for this dataset, like:
- loading image (.jpeg) files
- loading datasets hosted from an Amazon S3 bucket
- flexibly loading/streaming parts of the dataset
- custom loading functions (e.g. allowing to pass a password in order to extract the files)
However, I am not sure how well all of this works in practice, when put together for one large dataset. Ecoset is about the size of full Imagenet, and consequently probably one of the largest image datasets currently on HF. I’ve seen that @lhoestq, @mariosasko and other contributors are currently working on a imagenet-1k implementation, which has a very similar structure to Ecoset. Still, I am not sure whether our idea of ecoset can currently be realistically implemented on hf datasets, and If they can, what the best overall strategy would be. More specifically I have the following questions:
- The project is hosted in an AWS S3 bucket structure, however, the entire dataset is zipped up in on single .zip file. Therefore I am not sure whether this dataset can even profit from the available datasets functions to read in S3 datasets. Should we still load the dataset from here, or should we instead request to manually download the dataset, unpack it, and then load the dataset locally?
- Considering the data being packed in a single .zip file, does streaming the dataset even make sense here?
- The data is structured similar, but not exactly like imagenet. It looks somewhat like this:
├── train │ ├──0001_house │ │ ├──n07731952_10097.jpg │ │ ├──n07731952_101419.jpg │ ├──0002_woman │ │ ... ├── val │ ├──0001_house │ │ ... ├── test
Given this structure, what would be the best way to create a dataset from this? I’ve seen This thread where a similar dataset is loaded by creating multiple small datasets and then joining them. Should I do the same, but for each split * for each image category folder?
Would love to get some advice from the more advanced dataset users and devs, as this is a quite large (and hopefully useful) dataset, and we want to do this properly if possible.