Extremely slow data loading of imagefolder

sienna223 · October 27, 2023, 7:44am

Hi, I’m new to the Huggingface’s Datasets and I’m trying to train controlnet for stablediffusion on a custom dataset with approximately 300k images, the size of these images is (768, 768).

Now, I stuck in following lines of code:

dataset = load_dataset("imagefolder", data_dir="path/to/the/dataset")
print(dataset['train'][0])

Then, I have few questions.

Does imagefolder load images (load and decode) in memory at setup, if it is, can I disable it?
Are there any implicit process Datasets do when I first call load_dataset, so it takes that long time?
What’s the best practice to load a relatively large dataset? I see someone mention that saving dataset as Arrow and then load it, but I don’t know how to do it specifically. There is a urgent need for a detailed tutorial in official docs for this.

kopyl · December 10, 2023, 12:32am

Agree. 3 hours in, still not even a log line -_-

I have 3 million images in my directory…

panigrah · December 10, 2023, 1:01am

Does passing streaming=True work better?

Here is more about how and when to use it.

kopyl · December 19, 2023, 4:31am

i’m basically loading it to upload to HF… So it still need to download it, right?

panigrah · December 19, 2023, 6:25am

if you are downloading only to upload it, then streaming will not work. push_to_hub does not support streaming datasets.

kopyl · December 19, 2023, 6:29am

@panigrah thank you very much. Maybe you also know if it’s possible to download a dataset in multi-processed way? For some reason setting num_proc does not work at all… My dataset has 58 parquet files and i was hoping passing num_proc to load_dataset would spawn 58 Python processes each downloading its own parquet so I can load my dataset in 1 minutes instead of 50…

panigrah · December 19, 2023, 12:20pm

Where are you downloading from? I can try to replicate and see why

kopyl · December 20, 2023, 2:47pm

From my private dataset. The structure is very similar to this one: kopyl/833-icons-dataset-1024-blip-large · Datasets at Hugging Face

except that has 3 000 000 rows instead of 833

Thank you

francescorubbo · December 30, 2023, 3:35pm

To answer @sienna223’s original questions:

the FolderBasedBuilder doesn’t seem to be optimized for large datasets. It performs several costly operations parsing and validating the data.
a simple loading script sidesteps most of the costly operations, e.g.

import datasets
import glob


_IMAGES = glob.glob("my-image-folder/*.png")


class MyDataset(datasets.GeneratorBasedBuilder):
    """My image dataset"""

    def _info(self):
        return datasets.DatasetInfo(
            features=datasets.Features(
                {
                    "image": datasets.Image(),
                    # add labels or additional metadata as needed
                }
        ),
    )

    def _split_generators(self, dl_manager):
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "images": _IMAGES,
                },
            ),
        ]

    def _generate_examples(self, images):
        for file_path in images:
            yield file_path, {
                "image": file_path,
                # parse labels or additional metadata as needed
            }

In my case, I was trying to create a 15M image dataset: the “imagefolder” approach seemed to run out of memory after 8 hours or so; with the loading script I was able to create the dataset in minutes.

lhoestq · January 4, 2024, 5:49pm

We just released datasets 2.16.1 which optimizes the data files resolutions and makes it possible to load datasets with millions of images. It also requires huggingface-hub >= 0.20.1

Older versions of datasets and huggingface-hub are slow to handle that many files

Topic		Replies	Views
Creating a HF Dataset from lakeFS with S3 storage takes too much time! 🤗Datasets	7	49	June 23, 2025
Best practice loading images files 🤗Datasets	3	1590	March 27, 2024
Allow streaming of large datasets with image/audio 🤗Datasets	18	3954	May 30, 2022
5M small images (~100Gb) 🤗Datasets	2	517	July 5, 2022
How to load large-scale text-image pair dataset 🤗Datasets	4	1020	February 7, 2025

Extremely slow data loading of imagefolder

Related topics