How to load large-scale text-image pair dataset

Hi,

I have tried to finetune SDXL model on a subset of LAION-aesthetic-5+ (89M) using the example code.

I used this code for loading data:

dataset = load_dataset(ā€œimagefolderā€, data_dir=args.train_data_dir, split=ā€œtrainā€)

args.train_data_dir denote the data directory including over 89M image-text pairs.

But the data loading time is too long :frowning: and eventually it failed to load the data.

For this large-scale Text-to-Image mode training, is there a more efficient way?

Thanks in advance :slight_smile:

1 Like

ImageFolderā€™s file resolution is currently not optimized for large datasets like this one. In your case, itā€™s best to create a dataset loading script or use Dataset.from_generator (with a generator that yields {"image": pil_image, "text": text} dictionaries) instead of load_dataset to generate the dataset.

1 Like

Hi @ywlee88 ,
Did the loading script solution work for you or did you find another way to load your data?

Best,
Mike

wouldnā€™t streaming work here? setting streaming=True on the load_dataset call? My understanding is that as long as you are not trying to randomize the training data, streaming should work.