Why is the loss of Diffusion model calculated between "RANDOM noise" and "model predicted noise"? Not between "Actual added noise" and "model predicted noise"?

Each training step is like this:

  1. we have a clean image from the training data.
  2. we generate a random gaussian noise with std=1. let’s call it big noise
  3. we scale down the big noise based on timestep. For early timesteps, we’ll scale the big noise down just a little bit. For late timesteps, we’ll scale the big noise down a lot. Now the std will be less than 1. Let’s call the scaled noise as small noise.
  4. we add small noise to the clean image, resulting in a noisy image
  5. we train the unet to predict big noise given noisy image and timestep as input.

Notice in the last step, we are asking the model to predict the big noise we have calculated in step 2. We are not asking it to predict a new random noise.
And timestep is indeed relevant in step 3. One crucial detail here is that timestep is randomly sampled. It’s not sampled sequentially in a for loop like our intuition might suggest. You can read more about the logic behind random timestep in my other answer here

I urge you to look at the training code to gain deeper understanding.
Here’s the code from the text-to-image training script by diffusers:

The clean image is named latents and the big noise is named noise in the code above. But the small noise is not shown here. The line that calls add_noise() function is internally computing the small noise and produces noisy image which is named noisy_latents above.