I would like to ask a question about how to properly format and prepare an image dataset before uploading it to HuggingFace’s Datasets Hub. I’ve consulted documentation and a image folder template and it is not 100% clear to me how I should structure my format. I’ve also looked at some popular datasets like mnist, but it does not make it easier that they are loading the dataset through an external script.
I have a dataset of 20k+ images in a .jpeg format. There are eight classes. Image-class membership is described as a .csv file named metadata.csv, as advised in the tutorial. The csv file contains nine columns: file_name (that matches .jpg names) and columns for each class constituting a one-hot vector encoding,
Currently, I have tried to format repository as follows:
-my dataset
--train (zip)
---IMG0001.jpg
---IMG0002.jpg
...
---IMG9999.jpg
--metadata.csv
--README.md
However, this format was throwing out an error when I tried to load the dataset back from the hub using datasets.load_dataset
method.
Upon consulting the forum, I’ve included the metadata.csv in the archived train file. However, this also seems not correct.
According to the template, every .jpg file should be contained in a separate file. The new structure would look like this:
-my_dataset
--train (csv)
---metadata.csv
---train
----IMG0001
-----IMG0001.jpg
----IMG0002
-----IMG0002.jpg
...
----IMG9999
-----IMG9999.jpg
However, I do not completely understand where the nested format of placing each jpg within a nested directory comes from. Is this a proper way to upload an (archived) image dataset to HG Datasets Hub? Secondly, should I change the one-hot encoding in the .csv file to a simple one-column format where each sample is attributed to one class by a digit?
I would be very grateful for your help with this one, as the documentation (at least for me) is a little bit ambiguous when it comes to uploading image datasets.