Does ControlNet (and other diffusers) only include 1 noise injection per iteration in training loop?

Yes, you are understanding it correctly.
Why is it done like this? Here is my reasoning:

  1. When randomizing the noise level (the timesteps value), it forces the model to work with all kinds of noise levels. In theory, the timesteps value doesn’t even need to be an int. It can be a float.So if you are trying to cram all the noise levels into one batch, you will fail as you can’t exhaustively sample all floating point values. In current implementation, it’s an int value between 0 and 1000. But even with only 1000 levels, you can’t put 1000 samples into one batch, can you?
  2. Because you can’t put all noise levels into one batch, you have to pick some levels. Let’s say you can put 24 samples in a batch, would you choose the first 24 timesteps or would you randomize the timesteps? Clearly the uniform randomization is better because it would yield a gradient estimate that is more representative of the whole noise range between 0 and 1000. It’s the same reason why we shuffle training samples when training.
  3. Why not include 24 different images (each having one noise level) in a training batch, instead of including 24 noise levels from a single image? The first option obviously gives a better gradient estimate of the whole dataset.

In short, it’s all about estimating gradient better and thinking about trade-offs of what to put into the batch. Remember that training every noise level is costly and you don’t have unlimited batch space. We want gradients that are estimating the whole dataset, not a single image.

2 Likes