When training the model on datasets with diverse image size e.g. from 256 upto 1024, I typically resize every image to a specific size e.g. 512 and then train the model on 512x512.
After training, the model tend to generate pixelated images when CFG is set to high value. (I think it’s because of the 256x256 images being upscaled to 512)
- Is “converting everything into the same size” how people typically train the model?
- Should we add extra tags to the caption e.g. “pixelated” or “low quality” to small images and use them as negative prompts at inference?
- Any best practices of how I can tackle training the model on diverse image sizes using huggingface
datasets
? How do I batch several images with similar sizes together?
I believe in our training scripts, converting to the same size is usually what we do. However if that’s recommended always, I do see potential issues with it.
I don’t know about number 2
You can post process the images as they’re loadable through standard torch datasets, we have examples in some of the training scripts (controlnet iirc)
Regarding number 3, I also refer to images with different aspect ratios which people usually do Aspect Ratio Bucketing (ARB) with. How do I do ARB with huggingface datasets? Is it possible?
Yes that’s a good point, aspect ratio bucketing makes a lot of sense. I don’t think there’s a generic way to load only images of a particular aspect ratio as it should depend on how the dataset is stored. Assuming there’s some set of index files that point to url’s of the images, I’d maybe recommend forking the dataset into multiple datasets such that the index files are filtered on resolution. That seems like the most straightforward way.