Building an imagefolder dataset takes too long

vanewu · September 6, 2022, 4:53am

There are about 20,000+ images and text information in the local folder, and it took about 30 minutes to build an imagefolder dataset. The build process appears to be traversing the folders doing a series of confirmations. If there are billions of data, how to deal with it?
code:

dataset = load_dataset('imagefolder',
                        data_dir='/home/data/ms_coco/val2017/',
                        streaming=True,
                        ignore_verifications=True,
                        cache_dir='/home/data/ms_coco/huggingface/valid'
            )

Please help me .

mariosasko · September 6, 2022, 11:19am

Hi! Can you interrupt (CTRL + C or CMD + C) the process while waiting for it to finish and paste the returned error stack trace here to help us debug the issue? Also, what’s the output of the datasets-cli env command?

Topic		Replies	Views
Extremely slow data loading of imagefolder 🤗Datasets	9	2431	January 4, 2024
5M small images (~100Gb) 🤗Datasets	2	517	July 5, 2022
Creating a HF Dataset from lakeFS with S3 storage takes too much time! 🤗Datasets	7	49	June 23, 2025
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3739	May 16, 2022
[Solved] Image dataset seems slow for larger image size 🤗Datasets	7	3407	December 16, 2021

Building an imagefolder dataset takes too long

Related topics