Why target is noise data when calculating loss?

I’m studying the learning process. It looks like model_pred is the result of dinoizing latent with unet.

I don’t know why loss compare this result to noise. Shouldn’t we compare it with the latent encoded by VAE encoder?

And what role is elif’s noise_scheduler.get_velocity?

# Get the target for loss depending on the prediction type
if noise_scheduler.config.prediction_type == "epsilon":
    target = noise
elif noise_scheduler.config.prediction_type == "v_prediction":
    target = noise_scheduler.get_velocity(latents, noise, timesteps)
    raise ValueError(f"Unknown prediction type {noise_scheduler.config.prediction_type}")

# Predict the noise residual and compute loss
'''이게 디노이징이면 타겟은 노이즈 먹이기 전 edit_image를 vae로 인코딩한 값이어야 하지 않는가'''
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")

That noise is based on the latent image.

Is noise based on the latent image, right?
Becase above some lines, noise is set to random noise.
noise = torch.randn_like(latents)

The random noise is added to the image in img2img.