Hi,
I have tried to finetune SDXL model on a subset of LAION-aesthetic-5+ (89M) using the example code.
I used this code for loading data:
dataset = load_dataset(āimagefolderā, data_dir=args.train_data_dir, split=ātrainā)
args.train_data_dir
denote the data directory including over 89M image-text pairs.
But the data loading time is too long and eventually it failed to load the data.
For this large-scale Text-to-Image mode training, is there a more efficient way?
Thanks in advance
1 Like
ImageFolderās file resolution is currently not optimized for large datasets like this one. In your case, itās best to create a dataset loading script or use Dataset.from_generator
(with a generator that yields {"image": pil_image, "text": text}
dictionaries) instead of load_dataset
to generate the dataset.
1 Like
Hi @ywlee88 ,
Did the loading script solution work for you or did you find another way to load your data?
Best,
Mike
wouldnāt streaming work here? setting streaming=True
on the load_dataset call? My understanding is that as long as you are not trying to randomize the training data, streaming should work.