How to structure an image dataset repo using the image folder approach?

Hi fellows!

I am playing with creating an image dataset on the hugging face hub, and I have some questions related to the dataset repository/workflow:

  1. Using the image folder approach, we can create and load the dataset without a dataset loading script. What are the advantages of adding one to the repo?
  2. In the above documentation link, the repository structure for use the image folder approach is the following: folder/train/image_name.ext; If we want to add another dataset’ subset like the validation and testing, we use folder/data-subset/image_name.ext? Yes, using folder/data-subset/image_name.ext it’s possible to add different data subsets.
  3. Is there any option to incorporate additional information about the images apart from using JSON line files? Is it possible to use a CSV file? Or more than one? (e.g. the images are from a card game, another relevant information are the skills, card text, artist, etc)
  4. If the data need some pre-processing to standardize the image dimension, is it best practice to include this transformation in the dataset repo (I guess in the loading_script)? or is the idea that the data is ready to go on the dataset repository?


Hi! Response to your questions:

  1. The main advatage is you can load a dataset without writing a loading script for it.
  2. You answered this one correctly for yourself :slight_smile: .
  3. We don’t currently support CSV, but it shouldn’t be hard to add support for it. Yes, you can have more than one metadata file (and they can even be nested if data files are stored in nested dirs inside a repo) as long as they share the same set of features (same column names and types).
  4. One option is to add a code snippet with the preprocessing logic to the dataset README. Another one is to use the loading script approach and include this logic in the script.