Why is the loss of Diffusion model calculated between "RANDOM noise" and "model predicted noise"? Not between "Actual added noise" and "model predicted noise"?

egshkim · May 28, 2023, 8:14am

Why is the loss of Diffusion model calculated between “RANDOM noise” and “model predicted noise”? Not between “Actual added noise” and “model predicted noise”?

When i trained the U-Net with loss between “the Actual added noise” and “model predicted noise”,
it seems the model be optimized much much faster on my training dataset.
May I use this loss ?

Anybody has insight ?

williamberman · May 29, 2023, 5:24pm

sorry I’m not sure I understand. what is the alternative noise you’re talking about?

egshkim · May 30, 2023, 12:35am

In the above picture, “noise” is randomly sampled noise by “noise = torch.randn(sample_image.shape)” and is used for loss calculation.

But I think “Actually added noise” should be used for loss calculation.
the noise added between the “t-1” step and “t” step.

Why we are using random noise for loss calculation?

egshkim · May 30, 2023, 12:50am

I trained U-Net using the loss with “the actual added noise” that is, the noise added between “t-1” step and “t” step, NOT “random noise”.
then U-Net seems to be optimized faster.

Why we are using “random noise” for diffusion loss calculation and how this be possible?

offchan · May 30, 2023, 5:23am

The training process is like this:

we generate a random noise vector with std=1
we scale it down to have std<1
we add the scaled down noise to the image
we predict the noise with std=1

If I understand correctly, you are asking why we are predicting the noise with std=1 instead of the one with std<1, right?

There are many ways to formulate the output of the model:

predict the noise with std=1 (normalized noise)
predict the noise with std<1 (actual added noise)
predict the original clean image itself

You can actually try all these formulations. In fact, someone has tried it using the 3rd formulation on a toy project and it also works. Check the 2nd notebook in this repo here: diffusion-models-class/unit1 at main · huggingface/diffusion-models-class · GitHub

But I think the 1st formulation has advantage that it’s making the low noise prediction more important than 2nd formulation.
Even if the actual added noise has std only 0.01, the 1st approach would still predict it as std=1, therefore making it 100x more important in the loss. This should result in a model that cares a lot about denoising correctly at the last few inference steps.

Another advantange is probably about forcing the model to always predict noise that has std=1, this might help the model stabilize? I’m not sure about this reasoning though.

In short, it’s all about the trade-offs on the model performance. Research is still going on about which formulation is the best.

egshkim · June 1, 2023, 2:09am

You may have no idea how much you’ve helped me.
and I bet there will be many who wonder this one as well.
Hope they find the repo you mentioned.
Thanks a lot, @offchan
Good luck !

fortmeier · June 2, 2023, 1:19pm

Another advantange is probably about forcing the model to always predict noise that has std=1, this might help the model stabilize? I’m not sure about this reasoning though.

Intuitively, this makes a lot of sense. It makes the model output scaling independent of the timestep, right?

offchan · June 17, 2023, 4:19pm

Yeah, that’s what I think.

nongnongzi · November 22, 2023, 2:51am

Hi I have read the solutions and I am still confused about it. Why we want to use unet to predict the random noise, which is irrelevant with timesteps. If we need the prediction of random noise, why can’t we directly generate a random noise in the reverse process?

Errenn · November 23, 2023, 7:35am

Same question. Have you found the explanation yet?

nongnongzi · November 27, 2023, 3:30am

@offchan Hi sorry to bother, I am appreciate your answer for this topic. However, I am still confused about it. Why we want to use unet to predict the random noise, which is irrelevant with timesteps. If we need the prediction of random noise, why can’t we directly generate a random noise in the reverse process?

offchan · November 27, 2023, 1:31pm

Each training step is like this:

we have a clean image from the training data.
we generate a random gaussian noise with std=1. let’s call it big noise
we scale down the big noise based on timestep. For early timesteps, we’ll scale the big noise down just a little bit. For late timesteps, we’ll scale the big noise down a lot. Now the std will be less than 1. Let’s call the scaled noise as small noise.
we add small noise to the clean image, resulting in a noisy image
we train the unet to predict big noise given noisy image and timestep as input.

Notice in the last step, we are asking the model to predict the big noise we have calculated in step 2. We are not asking it to predict a new random noise.
And timestep is indeed relevant in step 3. One crucial detail here is that timestep is randomly sampled. It’s not sampled sequentially in a for loop like our intuition might suggest. You can read more about the logic behind random timestep in my other answer here

I urge you to look at the training code to gain deeper understanding.
Here’s the code from the text-to-image training script by diffusers:

github.com

huggingface/diffusers/blob/ebf581e85f3aad7faa30ceb4678148ee87375446/examples/text_to_image/train_text_to_image.py#L893-L912


      
          # Sample noise that we'll add to the latents
          noise = torch.randn_like(latents)
          if args.noise_offset:
              # https://www.crosslabs.org//blog/diffusion-with-offset-noise
              noise += args.noise_offset * torch.randn(
                  (latents.shape[0], latents.shape[1], 1, 1), device=latents.device
              )
          if args.input_perturbation:
              new_noise = noise + args.input_perturbation * torch.randn_like(noise)
          bsz = latents.shape[0]
          # Sample a random timestep for each image
          timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (bsz,), device=latents.device)
          timesteps = timesteps.long()
          
          # Add noise to the latents according to the noise magnitude at each timestep
          # (this is the forward diffusion process)
          if args.input_perturbation:
              noisy_latents = noise_scheduler.add_noise(latents, new_noise, timesteps)
          else:
              noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

The clean image is named latents and the big noise is named noise in the code above. But the small noise is not shown here. The line that calls add_noise() function is internally computing the small noise and produces noisy image which is named noisy_latents above.

offchan · November 27, 2023, 1:39pm

Check the answer above.

Topic		Replies	Views
Why using ground-truth noise in a diffusion model does not work? Beginners	0	346	August 1, 2023
Why target is noise data when calculating loss? 🧨 Diffusers	3	379	April 18, 2024
A few questions about how (vanilla) diffusion works Beginners	1	849	September 25, 2022
Does ControlNet (and other diffusers) only include 1 noise injection per iteration in training loop? 🧨 Diffusers	1	792	May 21, 2023
Image reconstruction with diffusion model 🧨 Diffusers	0	747	March 9, 2024

Why is the loss of Diffusion model calculated between "RANDOM noise" and "model predicted noise"? Not between "Actual added noise" and "model predicted noise"?

Related topics