Loading a large parquet dataset with varying image resolutions

I have created this dataset here: SwayStar123/preprocessed_commoncatalog-cc-by at main

It contains parquet files grouped by the resolution of the images into folders. In order to train with this dataset, when creating a dataloader, i need to ensure that the images are all of the same size in a batch, while they can vary in different batches.

So if I naively create a dataloader using

ds = load_dataset("SwayStar123/preprocessed_commoncatalog-cc-by")
dl = DataLoader(ds, batch_size=512, shuffle=True)

will this just work out of the box here? Im guessing not.

What would be the best way for me to ensure batches of the same resolutions whilst shuffling the dataset? My current idea is to load each of the resolution folders as its own dataset, make dataloaders for them all, and then shuffle those dataloaders, and create a Aggregate custom dataloader that randomly samples a resolution. If theres a better way please let me know.

1 Like

If you are loading a dataset and using it to train a model, it is common practice to pre-process the images. There are rumors that the image preprocessor in the current version of the transformers library has some bugs in terms of accuracy, but basically it works. If you are concerned about bugs, use torchvision or other software to manually process images.

I think your method is the right one for organizing the data set and creating a new subdivided data set.

The dataset is already preprocessed and divided into folders based on their resolutions. I am asking about the best way to load this dataset into a pytorch dataloader