In the stable diffusion model or other diffusion model, since the input of the UNet is (noisy_image, text_embedding, time_embedding), why the output of it is the noise, but not an denoised image?
In the traditional UNet for image
In the stable diffusion model or other diffusion model, since the input of the UNet is (noisy_image, text_embedding, time_embedding), why the output of it is the noise, but not an denoised image?
In the traditional UNet for image