How to structure an image dataset repo using the image folder approach?

Hi again!

I am just wondering if there is some rule of thumb about how many images the ImageFolder approach is suitable. Currently, I curate a dataset with 1.5k images, and I noticed that load_dataset() it’s take a lot of time (~5 minutes.)

What version of datasets are you using? If you can paste the stack trace you get by interrupting (CTRL + C) the loading process while waiting for it to finish, that would also be helpful.

From this forum discussion about image dataset best practices , I know that the ImageFolder is highly inefficient for data streaming. Still, I don’t know if this could be the same for loading the dataset. Is it possible to tar the folder structure to speed up the data loading? If so, does it require a custom loading script?

Loading from archives skips the globbing step that fetches all the image files, making the loading process faster. TAR archives are not currently supported (meaning it requires a custom loading script), but we are working on it.