Extremely slow data loading of imagefolder

Hi, I’m new to the Huggingface’s Datasets and I’m trying to train controlnet for stablediffusion on a custom dataset with approximately 300k images, the size of these images is (768, 768).

Now, I stuck in following lines of code:

dataset = load_dataset("imagefolder", data_dir="path/to/the/dataset")
print(dataset['train'][0])

Then, I have few questions.

  1. Does imagefolder load images (load and decode) in memory at setup, if it is, can I disable it?
  2. Are there any implicit process Datasets do when I first call load_dataset, so it takes that long time?
  3. What’s the best practice to load a relatively large dataset? I see someone mention that saving dataset as Arrow and then load it, but I don’t know how to do it specifically. There is a urgent need for a detailed tutorial in official docs for this.
2 Likes

Agree. 3 hours in, still not even a log line -_-

I have 3 million images in my directory…

Does passing streaming=True work better?

Here is more about how and when to use it.

i’m basically loading it to upload to HF… So it still need to download it, right?

if you are downloading only to upload it, then streaming will not work. push_to_hub does not support streaming datasets.

@panigrah thank you very much. Maybe you also know if it’s possible to download a dataset in multi-processed way? For some reason setting num_proc does not work at all… My dataset has 58 parquet files and i was hoping passing num_proc to load_dataset would spawn 58 Python processes each downloading its own parquet so I can load my dataset in 1 minutes instead of 50…

Where are you downloading from? I can try to replicate and see why

From my private dataset. The structure is very similar to this one: kopyl/833-icons-dataset-1024-blip-large · Datasets at Hugging Face

except that has 3 000 000 rows instead of 833

Thank you :slight_smile:

To answer @sienna223’s original questions:

  • the FolderBasedBuilder doesn’t seem to be optimized for large datasets. It performs several costly operations parsing and validating the data.
  • a simple loading script sidesteps most of the costly operations, e.g.
import datasets
import glob


_IMAGES = glob.glob("my-image-folder/*.png")


class MyDataset(datasets.GeneratorBasedBuilder):
    """My image dataset"""

    def _info(self):
        return datasets.DatasetInfo(
            features=datasets.Features(
                {
                    "image": datasets.Image(),
                    # add labels or additional metadata as needed
                }
        ),
    )

    def _split_generators(self, dl_manager):
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "images": _IMAGES,
                },
            ),
        ]

    def _generate_examples(self, images):
        for file_path in images:
            yield file_path, {
                "image": file_path,
                # parse labels or additional metadata as needed
            }

In my case, I was trying to create a 15M image dataset: the “imagefolder” approach seemed to run out of memory after 8 hours or so; with the loading script I was able to create the dataset in minutes.

We just released datasets 2.16.1 which optimizes the data files resolutions and makes it possible to load datasets with millions of images. It also requires huggingface-hub >= 0.20.1 :slight_smile:

Older versions of datasets and huggingface-hub are slow to handle that many files