Why does UNet single-step prediction for text-to-image generation produce the following result: output-s1-cguided_pred.png The correct result should be: output-s1-dnoised.png What could be the reason for this discrepancy>
this is one-step-inference, and not move to dnoise.prediction_method(such v_prediction) & scheduler.step.
just checking UNet output, and send to VAE to get upper pic.
Iโve found out that is caused by wrong unconditional embedding.
for prompts after tokenizer โ encoding โ embedding, in embedding stage, I mix hidden_state with uncond(negative CLIP out) with wrong summed & mean. get wrong shape on [1, max_length, vocab_size] which should be [2, max_length, vocab_size], as UNet input. if 1, some nodes will calculated died.