How to load large-scale text-image pair dataset

ywlee88 · August 31, 2023, 8:21am

Hi,

I have tried to finetune SDXL model on a subset of LAION-aesthetic-5+ (89M) using the example code.

I used this code for loading data:

dataset = load_dataset(“imagefolder”, data_dir=args.train_data_dir, split=“train”)

args.train_data_dir denote the data directory including over 89M image-text pairs.

But the data loading time is too long and eventually it failed to load the data.

For this large-scale Text-to-Image mode training, is there a more efficient way?

Thanks in advance

mariosasko · August 31, 2023, 6:58pm

ImageFolder’s file resolution is currently not optimized for large datasets like this one. In your case, it’s best to create a dataset loading script or use Dataset.from_generator (with a generator that yields {"image": pil_image, "text": text} dictionaries) instead of load_dataset to generate the dataset.

mikehemberger · December 12, 2023, 1:26pm

Hi @ywlee88 ,
Did the loading script solution work for you or did you find another way to load your data?

Best,
Mike

panigrah · December 13, 2023, 12:28am

wouldn’t streaming work here? setting streaming=True on the load_dataset call? My understanding is that as long as you are not trying to randomize the training data, streaming should work.

Ryoo72 · February 7, 2025, 11:21am

@mariosasko
I’m also experiencing difficulties in handling a large-scale image dataset in the millions. I’m curious if using Dataset.from_generator method works as map data (specifically, I’m wondering about fast access speed and whether shuffling is possible). Additionally, I’m wondering which method would be better: storing images in TAR and calling them, or storing image paths as strings and loading them in the collate function.

Thank you for always putting in so much effort for the community

Topic		Replies	Views
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3739	May 16, 2022
Extremely slow data loading of imagefolder 🤗Datasets	9	2430	January 4, 2024
Big text dataset loading for training 🤗Datasets	2	98	May 7, 2025
5M small images (~100Gb) 🤗Datasets	2	517	July 5, 2022
How to load text + image dataset? 🤗Datasets	2	704	February 19, 2024

How to load large-scale text-image pair dataset

Related topics