I’m studying the learning process. It looks like model_pred is the result of dinoizing latent with unet.
I don’t know why loss compare this result to noise. Shouldn’t we compare it with the latent encoded by VAE encoder?
And what role is elif’s noise_scheduler.get_velocity?
# Get the target for loss depending on the prediction type
if noise_scheduler.config.prediction_type == "epsilon":
target = noise
elif noise_scheduler.config.prediction_type == "v_prediction":
target = noise_scheduler.get_velocity(latents, noise, timesteps)
else:
raise ValueError(f"Unknown prediction type {noise_scheduler.config.prediction_type}")
# Predict the noise residual and compute loss
'''이게 디노이징이면 타겟은 노이즈 먹이기 전 edit_image를 vae로 인코딩한 값이어야 하지 않는가'''
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")