Why I get this UNet generated latents result been so messy?

Windsander · May 30, 2024, 5:41am

Why does UNet single-step prediction for text-to-image generation produce the following result: output-s1-cguided_pred.png The correct result should be: output-s1-dnoised.png What could be the reason for this discrepancy>

output-s1-cguided_pred.png:

this is one-step-inference, and not move to dnoise.prediction_method(such v_prediction) & scheduler.step.
just checking UNet output, and send to VAE to get upper pic.

Windsander · May 30, 2024, 5:42am

the correct output should be like this below:

output-s1-dnoised.png:

Windsander · May 31, 2024, 3:32am

I’ve found out that is caused by wrong unconditional embedding.

for prompts after tokenizer → encoding → embedding, in embedding stage, I mix hidden_state with uncond(negative CLIP out) with wrong summed & mean. get wrong shape on [1, max_length, vocab_size] which should be [2, max_length, vocab_size], as UNet input. if 1, some nodes will calculated died.

system · May 31, 2024, 3:32pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Debugging Custom Stable Diffusion Pipeline for 1D Signal Generation 🧨 Diffusers	1	14	July 2, 2025
Unconditional Latent Diffusion using AutoencoderKL 🧨 Diffusers	0	771	September 16, 2023
UNet1DModel not converging on single batch 🧨 Diffusers	3	656	June 19, 2023
OOM error while generating latents from Unet for Self attention guidelines diffusion 🧨 Diffusers	0	466	August 18, 2023
Why the output of the UNet is noise? 🧨 Diffusers	0	272	November 14, 2023

Why I get this UNet generated latents result been so messy?

Related topics