How to structure an image dataset repo using the image folder approach?

mariosasko · August 16, 2022, 5:02pm

Hi again!

I am just wondering if there is some rule of thumb about how many images the ImageFolder approach is suitable. Currently, I curate a dataset with 1.5k images, and I noticed that load_dataset() it’s take a lot of time (~5 minutes.)

What version of datasets are you using? If you can paste the stack trace you get by interrupting (CTRL + C) the loading process while waiting for it to finish, that would also be helpful.

From this forum discussion about image dataset best practices , I know that the ImageFolder is highly inefficient for data streaming. Still, I don’t know if this could be the same for loading the dataset. Is it possible to tar the folder structure to speed up the data loading? If so, does it require a custom loading script?

Loading from archives skips the globbing step that fetches all the image files, making the loading process faster. TAR archives are not currently supported (meaning it requires a custom loading script), but we are working on it.

Topic		Replies	Views
Proper way of preparing dataset with images 🤗Datasets	0	73	July 31, 2024
Image dataset best practices? 🤗Datasets	9	17228	January 15, 2023
Uploading image dataset to Huggingface Hub 🤗Datasets	2	2580	October 14, 2022
Undesired behavior when using load_dataset 🤗Datasets	4	945	April 17, 2023
Loading images directly in data folder 🤗Datasets	2	753	April 26, 2024

How to structure an image dataset repo using the image folder approach?

Related topics