How to structure an image dataset repo using the image folder approach?

Hi fellows!

I am playing with creating an image dataset on the hugging face hub, and I have some questions related to the dataset repository/workflow:

  1. Using the image folder approach, we can create and load the dataset without a dataset loading script. What are the advantages of adding one to the repo?
  2. In the above documentation link, the repository structure for use the image folder approach is the following: folder/train/image_name.ext; If we want to add another dataset’ subset like the validation and testing, we use folder/data-subset/image_name.ext? Yes, using folder/data-subset/image_name.ext it’s possible to add different data subsets.
  3. Is there any option to incorporate additional information about the images apart from using JSON line files? Is it possible to use a CSV file? Or more than one? (e.g. the images are from a card game, another relevant information are the skills, card text, artist, etc)
  4. If the data need some pre-processing to standardize the image dimension, is it best practice to include this transformation in the dataset repo (I guess in the loading_script)? or is the idea that the data is ready to go on the dataset repository?

Thanks!

Hi! Response to your questions:

  1. The main advatage is you can load a dataset without writing a loading script for it.
  2. You answered this one correctly for yourself :slight_smile: .
  3. We don’t currently support CSV, but it shouldn’t be hard to add support for it. Yes, you can have more than one metadata file (and they can even be nested if data files are stored in nested dirs inside a repo) as long as they share the same set of features (same column names and types).
  4. One option is to add a code snippet with the preprocessing logic to the dataset README. Another one is to use the loading script approach and include this logic in the script.

Hi!

Thanks for your detailed answer and the time.

I am just wondering if there is some rule of thumb about how many images the ImageFolder approach is suitable. Currently, I curate a dataset with 1.5k images, and I noticed that load_dataset() it’s take a lot of time (~5 minutes.)

From this forum discussion about image dataset best practices, I know that the ImageFolder is highly inefficient for data streaming. Still, I don’t know if this could be the same for loading the dataset. Is it possible to tar the folder structure to speed up the data loading? If so, does it require a custom loading script?

best,
Cristóbal

TAR will be supported at one point I think, maybe @mariosasko knows better

Hi again!

I am just wondering if there is some rule of thumb about how many images the ImageFolder approach is suitable. Currently, I curate a dataset with 1.5k images, and I noticed that load_dataset() it’s take a lot of time (~5 minutes.)

What version of datasets are you using? If you can paste the stack trace you get by interrupting (CTRL + C) the loading process while waiting for it to finish, that would also be helpful.

From this forum discussion about image dataset best practices , I know that the ImageFolder is highly inefficient for data streaming. Still, I don’t know if this could be the same for loading the dataset. Is it possible to tar the folder structure to speed up the data loading? If so, does it require a custom loading script?

Loading from archives skips the globbing step that fetches all the image files, making the loading process faster. TAR archives are not currently supported (meaning it requires a custom loading script), but we are working on it.

I had a somewgat related question on ImageFolder. I know it infers integer class labels from the directory names, but i don’t see where the num->string mapping is stored. For example, below I use the cats-dogs example, but I cannot tell if 0=cat or dog!

from datasets import load_dataset

url = "https://ml.machinelearningnuggets.com/train.zip" # cats dogs

#url = "https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip"

ds = load_dataset("imagefolder", data_files=url, split="train")

print(ds[0])

# 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=121x121 at 0x7FC9CA5F7460>, 'label': 0}

Sorry, I just realized that the URL i sent points to a zip file where all the cats and dogs are in the same folder, so HF always uses label=0. After I partition the data into train/cats and train/dogs then I see labels=0 or 1, as desired. But the mapping is still ambiguous (of course I can figure it out by visualizing the image, so this is more of a feature request than a how-to question :slight_smile:

I just found the answer in this blog post (Image search with 🤗 datasets): use

dataset.features['label']
# ClassLabel(num_classes=2, names=['cats', 'dogs'], id=None)

which tells me that 0=cats, 1=dogs.
Maybe this could be added to the documentation?