Loading a large parquet dataset with varying image resolutions

SwayStar123 · October 24, 2024, 7:43am

I have created this dataset here: SwayStar123/preprocessed_commoncatalog-cc-by at main

It contains parquet files grouped by the resolution of the images into folders. In order to train with this dataset, when creating a dataloader, i need to ensure that the images are all of the same size in a batch, while they can vary in different batches.

So if I naively create a dataloader using

ds = load_dataset("SwayStar123/preprocessed_commoncatalog-cc-by")
dl = DataLoader(ds, batch_size=512, shuffle=True)

will this just work out of the box here? Im guessing not.

What would be the best way for me to ensure batches of the same resolutions whilst shuffling the dataset? My current idea is to load each of the resolution folders as its own dataset, make dataloaders for them all, and then shuffle those dataloaders, and create a Aggregate custom dataloader that randomly samples a resolution. If theres a better way please let me know.

John6666 · October 24, 2024, 8:06am

If you are loading a dataset and using it to train a model, it is common practice to pre-process the images. There are rumors that the image preprocessor in the current version of the transformers library has some bugs in terms of accuracy, but basically it works. If you are concerned about bugs, use torchvision or other software to manually process images.

I think your method is the right one for organizing the data set and creating a new subdivided data set.

SwayStar123 · October 24, 2024, 8:24am

The dataset is already preprocessed and divided into folders based on their resolutions. I am asking about the best way to load this dataset into a pytorch dataloader

Topic		Replies	Views
PIL.UnidentifiedImageError: cannot identify image file 🤗Datasets	4	8379	March 3, 2023
Extremely slow data loading of imagefolder 🤗Datasets	9	2430	January 4, 2024
Using External Datasets with HuggingFace Data Loader Beginners	9	4361	April 27, 2022
How to publish a text to-image dataset on huggingface 🤗Datasets	1	59	February 9, 2025
Cannot stream custom dataset 🤗Datasets	1	536	October 11, 2023

Loading a large parquet dataset with varying image resolutions

Related topics