I am playing with creating an image dataset on the hugging face hub, and I have some questions related to the dataset repository/workflow:
- Using the image folder approach, we can create and load the dataset without a dataset loading script. What are the advantages of adding one to the repo?
- In the above documentation link, the repository structure for use the image folder approach is the following:
folder/train/image_name.ext; If we want to add another dataset’ subset like the validation and testing, we use
folder/data-subset/image_name.ext? Yes, using
folder/data-subset/image_name.extit’s possible to add different data subsets.
- Is there any option to incorporate additional information about the images apart from using JSON line files? Is it possible to use a CSV file? Or more than one? (e.g. the images are from a card game, another relevant information are the skills, card text, artist, etc)
- If the data need some pre-processing to standardize the image dimension, is it best practice to include this transformation in the dataset repo (I guess in the loading_script)? or is the idea that the data is ready to go on the dataset repository?