How to load large-scale text-image pair dataset

Hi,

I have tried to finetune SDXL model on a subset of LAION-aesthetic-5+ (89M) using the example code.

I used this code for loading data:

dataset = load_dataset(ā€œimagefolderā€, data_dir=args.train_data_dir, split=ā€œtrainā€)

args.train_data_dir denote the data directory including over 89M image-text pairs.

But the data loading time is too long :frowning: and eventually it failed to load the data.

For this large-scale Text-to-Image mode training, is there a more efficient way?

Thanks in advance :slight_smile:

1 Like

ImageFolderā€™s file resolution is currently not optimized for large datasets like this one. In your case, itā€™s best to create a dataset loading script or use Dataset.from_generator (with a generator that yields {"image": pil_image, "text": text} dictionaries) instead of load_dataset to generate the dataset.

2 Likes

Hi @ywlee88 ,
Did the loading script solution work for you or did you find another way to load your data?

Best,
Mike

wouldnā€™t streaming work here? setting streaming=True on the load_dataset call? My understanding is that as long as you are not trying to randomize the training data, streaming should work.

@mariosasko
Iā€™m also experiencing difficulties in handling a large-scale image dataset in the millions. Iā€™m curious if using Dataset.from_generator method works as map data (specifically, Iā€™m wondering about fast access speed and whether shuffling is possible). Additionally, Iā€™m wondering which method would be better: storing images in TAR and calling them, or storing image paths as strings and loading them in the collate function.

Thank you for always putting in so much effort for the community :slight_smile:

1 Like