When training the model on datasets with diverse image size e.g. from 256 upto 1024, I typically resize every image to a specific size e.g. 512 and then train the model on 512x512.
After training, the model tend to generate pixelated images when CFG is set to high value. (I think it’s because of the 256x256 images being upscaled to 512)
- Is “converting everything into the same size” how people typically train the model?
- Should we add extra tags to the caption e.g. “pixelated” or “low quality” to small images and use them as negative prompts at inference?
- Any best practices of how I can tackle training the model on diverse image sizes using huggingface
datasets? How do I batch several images with similar sizes together?
I believe in our training scripts, converting to the same size is usually what we do. However if that’s recommended always, I do see potential issues with it.
I don’t know about number 2
You can post process the images as they’re loadable through standard torch datasets, we have examples in some of the training scripts (controlnet iirc)
Regarding number 3, I also refer to images with different aspect ratios which people usually do Aspect Ratio Bucketing (ARB) with. How do I do ARB with huggingface datasets? Is it possible?
Yes that’s a good point, aspect ratio bucketing makes a lot of sense. I don’t think there’s a generic way to load only images of a particular aspect ratio as it should depend on how the dataset is stored. Assuming there’s some set of index files that point to url’s of the images, I’d maybe recommend forking the dataset into multiple datasets such that the index files are filtered on resolution. That seems like the most straightforward way.
Hey @offchan, could you tell me how did you tackle the problem of training controlnet with different image sizes?
I am trying to train on laion2B-en which has various size ratios. Did you separate similar sized images or other type of action? I would really appreciate your answer!
I crop and resize them to be a square image of same size e.g. 512x512.
I only train on images which are bigger than 512x512. If they are not bigger, then I’ll drop them from the training set. The reason is because I don’t want to resize small images to be big as they will become pixelated and might hurt model quality. I’m not sure if this is how most people do it but it seems to work fine for now.
I haven’t done Aspect Ratio Bucketing yet so I cannot train the model with varying aspect ratios. They’re currently all cropped to be square.