I have created this dataset here: SwayStar123/preprocessed_commoncatalog-cc-by at main
It contains parquet files grouped by the resolution of the images into folders. In order to train with this dataset, when creating a dataloader, i need to ensure that the images are all of the same size in a batch, while they can vary in different batches.
So if I naively create a dataloader using
ds = load_dataset("SwayStar123/preprocessed_commoncatalog-cc-by")
dl = DataLoader(ds, batch_size=512, shuffle=True)
will this just work out of the box here? Im guessing not.
What would be the best way for me to ensure batches of the same resolutions whilst shuffling the dataset? My current idea is to load each of the resolution folders as its own dataset, make dataloaders for them all, and then shuffle those dataloaders, and create a Aggregate custom dataloader that randomly samples a resolution. If theres a better way please let me know.