Meaning of vector fields in Flux and SD3 loss function

Hi, I’m trying to analyze to match the theoretical loss function and its implementation in diffusers for Flux or SD3 model. (I guess the loss functions for both are same.)

In the research paper of SD3 (Scaling Rectified Flow Transformers for High-Resolution Image Synthesis), they represent the loss function as Loss_CFM in eq.(8) and eq.(12).

My questions are two-fold:
[1] According to eq.(8), the loss function is to optimize the vector fields (difference between GT vector fields(as u_t) and predicted vector fields(as v_theta). But in the code for loss in diffusers/examples/dreambooth/ train_dreambooth_lora_flux.py,

according to the following two lines,

it seems to be ‘the vector fields’ are represented as the difference between noise(epsilon) and model_input(x_0). So I’m a bit confused by the gap. What am I missing here?

[2] The second question is also related to the above. According to eq.(12), the model output(model_pred) seems to be the noise, same as the stable diffuion XL case. But as I can infer by the above lines, the model output is ‘noise - model_input’.

Can someone help me to understand all these difficulties?
Thank you :slight_smile:

1 Like

Ugh, I have no idea about math…
You might want to send a direct mentions to those who seem to know more.
(@+username)
I don’t have a complete grasp of the geography of HF either, but people around here might have a clue.

Assume that MMDIt/UNet predicts the noise added to Xt from Xt-1.
In Flux or SD3:
xt = (1-t)x0 + t * noise
xt-1 = (1-(t-1))x0 + (t-1) * noise
noise_target = xt - xt-1 = (1-t)x0 + t * noise - x0 - (1-t)x0 - t*noise + noise = noise - x0
noise_pred = MMDit(xt) <—> noise_traget = noise - x0
so the target will be noise - x0
In SD1.5
xt = xt-1 + noise = xt-1 + noise_target = DDPM.add_noise(x0, noise, t) <PS: something like alpha_t * x0 + beta_t * noise>
noise_pred = UNet(xt) <—> noise_target = noise = xt - xt-1
The noise here is exactly the noise added from t-1 to t, that is, target.

1 Like