Why is the loss of Diffusion model calculated between "RANDOM noise" and "model predicted noise"? Not between "Actual added noise" and "model predicted noise"?

offchan · November 27, 2023, 1:31pm

Each training step is like this:

we have a clean image from the training data.
we generate a random gaussian noise with std=1. let’s call it big noise
we scale down the big noise based on timestep. For early timesteps, we’ll scale the big noise down just a little bit. For late timesteps, we’ll scale the big noise down a lot. Now the std will be less than 1. Let’s call the scaled noise as small noise.
we add small noise to the clean image, resulting in a noisy image
we train the unet to predict big noise given noisy image and timestep as input.

Notice in the last step, we are asking the model to predict the big noise we have calculated in step 2. We are not asking it to predict a new random noise.
And timestep is indeed relevant in step 3. One crucial detail here is that timestep is randomly sampled. It’s not sampled sequentially in a for loop like our intuition might suggest. You can read more about the logic behind random timestep in my other answer here

I urge you to look at the training code to gain deeper understanding.
Here’s the code from the text-to-image training script by diffusers:

github.com

huggingface/diffusers/blob/ebf581e85f3aad7faa30ceb4678148ee87375446/examples/text_to_image/train_text_to_image.py#L893-L912


      
          # Sample noise that we'll add to the latents
          noise = torch.randn_like(latents)
          if args.noise_offset:
              # https://www.crosslabs.org//blog/diffusion-with-offset-noise
              noise += args.noise_offset * torch.randn(
                  (latents.shape[0], latents.shape[1], 1, 1), device=latents.device
              )
          if args.input_perturbation:
              new_noise = noise + args.input_perturbation * torch.randn_like(noise)
          bsz = latents.shape[0]
          # Sample a random timestep for each image
          timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (bsz,), device=latents.device)
          timesteps = timesteps.long()
          
          # Add noise to the latents according to the noise magnitude at each timestep
          # (this is the forward diffusion process)
          if args.input_perturbation:
              noisy_latents = noise_scheduler.add_noise(latents, new_noise, timesteps)
          else:
              noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

The clean image is named latents and the big noise is named noise in the code above. But the small noise is not shown here. The line that calls add_noise() function is internally computing the small noise and produces noisy image which is named noisy_latents above.

Topic		Replies	Views
Why using ground-truth noise in a diffusion model does not work? Beginners	0	346	August 1, 2023
Why target is noise data when calculating loss? 🧨 Diffusers	3	379	April 18, 2024
A few questions about how (vanilla) diffusion works Beginners	1	849	September 25, 2022
Does ControlNet (and other diffusers) only include 1 noise injection per iteration in training loop? 🧨 Diffusers	1	789	May 21, 2023
Image reconstruction with diffusion model 🧨 Diffusers	0	741	March 9, 2024

Why is the loss of Diffusion model calculated between "RANDOM noise" and "model predicted noise"? Not between "Actual added noise" and "model predicted noise"?

Related topics