ImageFolderās file resolution is currently not optimized for large datasets like this one. In your case, itās best to create a dataset loading script or use Dataset.from_generator (with a generator that yields {"image": pil_image, "text": text} dictionaries) instead of load_dataset to generate the dataset.
wouldnāt streaming work here? setting streaming=True on the load_dataset call? My understanding is that as long as you are not trying to randomize the training data, streaming should work.
@mariosasko
Iām also experiencing difficulties in handling a large-scale image dataset in the millions. Iām curious if using Dataset.from_generator method works as map data (specifically, Iām wondering about fast access speed and whether shuffling is possible). Additionally, Iām wondering which method would be better: storing images in TAR and calling them, or storing image paths as strings and loading them in the collate function.
Thank you for always putting in so much effort for the community