5M small images (~100Gb)

yiwc · June 29, 2022, 6:52am

I 'm collecting and training on a dataset of 5M small image files, which cost about 100Gb, what’s your recommended solution for this size dataset?

I’m trying imagefolder, looks super slow.

mariosasko · June 30, 2022, 3:26pm

Hi! Can you run the command datasets-cli env and paste the output here? Does passing ignore_verifications=True to load_dataset improve the loading speed? You can also interrupt the process while waiting for it to finish and paste the stack trace here to make it easier for us to debug this issue.

yiwc · July 5, 2022, 8:02am

thank for response, I use load from JSON file, (the meta json), and process images using map with multiple workers, it works fine now.

Topic		Replies	Views
Extremely slow data loading of imagefolder 🤗Datasets	9	2441	January 4, 2024
Building an imagefolder dataset takes too long 🤗Datasets	1	516	September 6, 2022
Creating a HF Dataset from lakeFS with S3 storage takes too much time! 🤗Datasets	7	54	June 23, 2025
Best practice loading images files 🤗Datasets	3	1593	March 27, 2024
Unable to load images 🤗Datasets	2	146	December 31, 2024

5M small images (~100Gb)

Related topics