Why I get this UNet generated latents result been so messy?

Why does UNet single-step prediction for text-to-image generation produce the following result: output-s1-cguided_pred.png The correct result should be: output-s1-dnoised.png What could be the reason for this discrepancy> :joy:

output-s1-cguided_pred.png:

this is one-step-inference, and not move to dnoise.prediction_method(such v_prediction) & scheduler.step.
just checking UNet output, and send to VAE to get upper pic. :sweat_smile:

the correct output should be like this below:

output-s1-dnoised.png:

Iโ€™ve found out that is caused by wrong unconditional embedding.

for prompts after tokenizer โ†’ encoding โ†’ embedding, in embedding stage, I mix hidden_state with uncond(negative CLIP out) with wrong summed & mean. get wrong shape on [1, max_length, vocab_size] which should be [2, max_length, vocab_size], as UNet input. if 1, some nodes will calculated died.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.