A few questions about how (vanilla) diffusion works

Hello, I’ve run a few experiments in the huggingface’s google colab, and some question have arisen.

If I add noise to an image (from the distribution the model was trained on) to turn it in an isotropic gaussian (concretely., I add T=1000 steps of noise to the image), I’d expect for the model to output that same image, since I believed that X_T gaussian is a sample that should represent the original image, like an embedding, is that not so? If I run this through the model, what happens is that the model produces a realistic image, but it is not the same as the one I used as input. Maybe I’ve run things wrong, but is this behaviour expected?

Then I’ve tried adding only 200-300 steps of noise, and try to denoise that, and that setup works as intended, it actually reverses the noise to the almost exactly the original image, like intended.

My questions, what is happening here? Why does adding all T steps of noise make the model produce an “arbitrary” image, while adding fewer noise steps produces the original image. I mean, it makes sense, but I thought the point was that even for all T steps of noise we should be able to get back exactly to the original image, not an arbitrary one.

Is there some big gap in my understanding of diffusion models?

Also, if I add noise to the image with two different random seeds, I know both of them will produce isotropic gaussians in the end (after all 1000 forward diffusion steps), but they will be different samples, since they came from a different seed, correct? That means one image can have many different “embeddings”? Is that that case? I mean, I’m still not convinced those serve as embeddings at all (since they don’t produce the original, as mentioned above).

Thanks in advance, these questions have been really bugging me.

Right. Adding noise is a lossy operation. And not just destructive, but also a random one as you pointed out.

After 999 steps of noise, there are many possible states. And since it’s random, it’s also possible that two distinct inputs could arrive at the same noisy output.

When we do inference using a model trained this way and we ask it about some noisy data (with a timestep up near 1000), it can’t perfectly deduce what the earlier state was like some kind of lossless decompression. It can only give us some slightly-less-noisy data it thinks it probably could have led to that state.

The curious thing to me about how we use this trained model for image synthesis is that even though this inference is all guesswork and probability, we proceed from 1000 back to 0 in a monotonic fashion.

There’s never any “It was probably this, but could also have been that,” nor any distinction between “I’m 99% confident about this prediction” and “it could have been a lot of things, but I guess this seems most likely.”

At least that’s how I understand it from what I’ve learned so far. It’s a bewildering phenomenon.