Proper way of preparing dataset with images

MKZuziak · July 31, 2024, 10:05pm

I would like to ask a question about how to properly format and prepare an image dataset before uploading it to HuggingFace’s Datasets Hub. I’ve consulted documentation and a image folder template and it is not 100% clear to me how I should structure my format. I’ve also looked at some popular datasets like mnist, but it does not make it easier that they are loading the dataset through an external script.

I have a dataset of 20k+ images in a .jpeg format. There are eight classes. Image-class membership is described as a .csv file named metadata.csv, as advised in the tutorial. The csv file contains nine columns: file_name (that matches .jpg names) and columns for each class constituting a one-hot vector encoding,
Currently, I have tried to format repository as follows:

-my dataset
--train (zip)
---IMG0001.jpg
---IMG0002.jpg
...
---IMG9999.jpg
--metadata.csv
--README.md

However, this format was throwing out an error when I tried to load the dataset back from the hub using datasets.load_dataset method.
Upon consulting the forum, I’ve included the metadata.csv in the archived train file. However, this also seems not correct.

According to the template, every .jpg file should be contained in a separate file. The new structure would look like this:

-my_dataset
--train (csv)
---metadata.csv
---train
----IMG0001
-----IMG0001.jpg
----IMG0002
-----IMG0002.jpg
...
----IMG9999
-----IMG9999.jpg

However, I do not completely understand where the nested format of placing each jpg within a nested directory comes from. Is this a proper way to upload an (archived) image dataset to HG Datasets Hub? Secondly, should I change the one-hot encoding in the .csv file to a simple one-column format where each sample is attributed to one class by a digit?
I would be very grateful for your help with this one, as the documentation (at least for me) is a little bit ambiguous when it comes to uploading image datasets.

Topic		Replies	Views
Uploading image dataset to Huggingface Hub 🤗Datasets	2	2583	October 14, 2022
How to structure an image dataset repo using the image folder approach? Beginners	7	2900	September 29, 2022
How do I structure this? 🤗Datasets	2	27	February 19, 2025
Uploading an csv file into HF Beginners	1	11	May 30, 2025
Image dataset best practices? 🤗Datasets	9	17263	January 15, 2023

Proper way of preparing dataset with images

Related topics