Why the output of the UNet is noise?

In the stable diffusion model or other diffusion model, since the input of the UNet is (noisy_image, text_embedding, time_embedding), why the output of it is the noise, but not an denoised image?

In the traditional UNet for image