Meaning of vector fields in Flux and SD3 loss function

lacrimosist · September 13, 2024, 1:13pm

Hi, I’m trying to analyze to match the theoretical loss function and its implementation in diffusers for Flux or SD3 model. (I guess the loss functions for both are same.)

In the research paper of SD3 (Scaling Rectified Flow Transformers for High-Resolution Image Synthesis), they represent the loss function as Loss_CFM in eq.(8) and eq.(12).

My questions are two-fold:
[1] According to eq.(8), the loss function is to optimize the vector fields (difference between GT vector fields(as u_t) and predicted vector fields(as v_theta). But in the code for loss in diffusers/examples/dreambooth/ train_dreambooth_lora_flux.py,

according to the following two lines,

github.com

huggingface/diffusers/blob/6dc6486565ea1d8d1be567eefc1094e9185560a1/examples/dreambooth/train_dreambooth_lora_flux.py#L1688


      
              height=int(model_input.shape[2] * vae_scale_factor / 2),
              width=int(model_input.shape[3] * vae_scale_factor / 2),
              vae_scale_factor=vae_scale_factor,
          )
          
          # these weighting schemes use a uniform timestep sampling
          # and instead post-weight the loss
          weighting = compute_loss_weighting_for_sd3(weighting_scheme=args.weighting_scheme, sigmas=sigmas)
          
          # flow matching loss
          target = noise - model_input
          
          if args.with_prior_preservation:
              # Chunk the noise and model_pred into two parts and compute the loss on each part separately.
              model_pred, model_pred_prior = torch.chunk(model_pred, 2, dim=0)
              target, target_prior = torch.chunk(target, 2, dim=0)
          
              # Compute prior loss
              prior_loss = torch.mean(
                  (weighting.float() * (model_pred_prior.float() - target_prior.float()) ** 2).reshape(
                      target_prior.shape[0], -1

github.com

huggingface/diffusers/blob/6dc6486565ea1d8d1be567eefc1094e9185560a1/examples/dreambooth/train_dreambooth_lora_flux.py#L1706


      
              prior_loss = torch.mean(
                  (weighting.float() * (model_pred_prior.float() - target_prior.float()) ** 2).reshape(
                      target_prior.shape[0], -1
                  ),
                  1,
              )
              prior_loss = prior_loss.mean()
          
          # Compute regular loss.
          loss = torch.mean(
              (weighting.float() * (model_pred.float() - target.float()) ** 2).reshape(target.shape[0], -1),
              1,
          )
          loss = loss.mean()
          
          if args.with_prior_preservation:
              # Add the prior loss to the instance loss.
              loss = loss + args.prior_loss_weight * prior_loss
          
          accelerator.backward(loss)
          if accelerator.sync_gradients:

it seems to be ‘the vector fields’ are represented as the difference between noise(epsilon) and model_input(x_0). So I’m a bit confused by the gap. What am I missing here?

[2] The second question is also related to the above. According to eq.(12), the model output(model_pred) seems to be the noise, same as the stable diffuion XL case. But as I can infer by the above lines, the model output is ‘noise - model_input’.

Can someone help me to understand all these difficulties?
Thank you

John6666 · September 13, 2024, 11:03pm

Ugh, I have no idea about math…
You might want to send a direct mentions to those who seem to know more.
(@+username)
I don’t have a complete grasp of the geography of HF either, but people around here might have a clue.

Gezhiwa · November 29, 2024, 3:02am

Assume that MMDIt/UNet predicts the noise added to Xt from Xt-1.
In Flux or SD3:
xt = (1-t)x0 + t * noise
xt-1 = (1-(t-1))x0 + (t-1) * noise
noise_target = xt - xt-1 = (1-t)x0 + t * noise - x0 - (1-t)x0 - t*noise + noise = noise - x0
noise_pred = MMDit(xt) <—> noise_traget = noise - x0
so the target will be noise - x0
In SD1.5
xt = xt-1 + noise = xt-1 + noise_target = DDPM.add_noise(x0, noise, t) <PS: something like alpha_t * x0 + beta_t * noise>
noise_pred = UNet(xt) <—> noise_target = noise = xt - xt-1
The noise here is exactly the noise added from t-1 to t, that is, target.

Topic		Replies	Views
Explicit support of masked loss and schedulefree optimizers 🧨 Diffusers	0	72	December 29, 2024
Create a weighted loss function to handle imbalance? 🤗Transformers	3	901	May 21, 2025
Create Custom Loss function for transformers using a diffusion model and CLIP Intermediate	0	540	February 19, 2024
VQModel usage issues 🧨 Diffusers	0	396	October 20, 2023
What are the fir and sde_vp kernels in ResnetBlock2D 🧨 Diffusers	0	408	December 26, 2022

Meaning of vector fields in Flux and SD3 loss function

Related topics