Aspect ratio bucketing when fine-tuning SDXL

I am trying to understand how to use aspect ratio bucketing with the diffusers library, so that I can fine tune SDXL 1.0 using training images of a variety of different resolutions / aspect ratios without having to do heavy cropping.

I have been trying to do research online to understand the concept of bucketing and how it is implemented, and am hoping that people here can clarify a few technical concepts that I’m finding difficult to understand, and help me understand how to implement bucketing for training SDXL with the diffusers library.

In the original SDXL paper they state that the base model was trained on a variety of image sizes using bucketing:

… we propose to condition the U-Net model on the original image resolution, which is trivially available during training. In particular, we provide the original (i.e., before any rescaling) height and width of the images as an additional conditioning to the model C_size=(h,w). Each component is independently embedded using a Fourier feature encoding, and these encodings are concatenated into a single vector that we feed into the model by adding it to the timestep embedding. At inference time, a user can then set the desired apparent resolution of the image via this size-conditioning. Evidently, the model has learned to associate the conditioning C_size with the resolution."
[…]
“Real-world datasets include images of widely varying sizes and aspect-ratios While the common output resolutions for text-to-image models are square images of 512 x 512 or 1024 x 1024 pixels, we argue that this is a rather unnatural choice, given the widespread distribution and use of landscape (e.g., 16:9) or portrait format screens. Motivated by this, we finetune our model to handle multiple aspect-ratios simultaneously: We follow common practice and partition the data into buckets of different aspect ratios, where we keep the pixel count as close to 1024² pixels as possibly, varying height and width accordingly in multiples of 64.”
[…]
“During optimization, a training batch is composed of images from the same bucket, and we alternate between bucket sizes for each training step. Additionally, the model receives the bucket size (or, target size) as a conditioning, represented as a tuple of integers C_ar=(h,w) which are embedded into a Fourier space in analogy to the size- and crop-conditionings described above. In practice, we apply multi-aspect training as a finetuning stage after pretraining the model at a fixed aspect-ratio and resolution and combine it with the conditioning techniques introduced in Sec. 2.2 via concatenation along the channel axis.”

So my main questions are:

  • Why do images for each batch need to come from the same bucket? Given that dimension information is being embedded along with the image, I don’t understand why this is necessary. Why is it important that all images within a batch are the same size/shape?
  • What is the rationale behind “applying multi-aspect training as a finetuning stage after pretraining the model at a fixed aspect ratio and resolution”? Why not just use bucketing for the whole training process?
  • What is meant by “concatenation along the channel axis”?
  • Can someone explain the mechanics to me about how the c_size = (h, w) information is embedded, and how this relates to what happens when I specify a resolution during inference/generation? How can I embed this information about training image dimensions when training with the diffusers library?

Thanks!

bump, would also like to get some reply on those topics