5M small images (~100Gb)

I 'm collecting and training on a dataset of 5M small image files, which cost about 100Gb, what’s your recommended solution for this size dataset?

I’m trying imagefolder, looks super slow.

Hi! Can you run the command datasets-cli env and paste the output here? Does passing ignore_verifications=True to load_dataset improve the loading speed? You can also interrupt the process while waiting for it to finish and paste the stack trace here to make it easier for us to debug this issue.

thank for response, I use load from JSON file, (the meta json), and process images using map with multiple workers, it works fine now.