SDXL custom pipeline - Input to unet? - Why 2 text encoders?

pcuenq · October 24, 2023, 11:31am

Hello!

The use of the two text encoders can be observed here, this is the function that converts prompt(s) to embeddings for the UNet. In particular:

The “pooled_output” of the second text encoder is kept here.
The outputs from the two text encoders are concatenated here.

I would recommend you step through that function with your debugger using the standard SDXL pipeline. There are many options that make it appear long and complicated, but you’ll see that most of the work is just calling the text encoders (each one requires their own tokenizer) and combining the outputs.

Several people have tried the use of different prompts as mentioned in this thread, but I’m not sure what the current best practices are, feel free to update with your results!

Happy hacking!

Topic		Replies	Views
Did SDXL-inpainting fine-tune the text_encoder? Beginners	0	131	April 29, 2024
Use prompt tokens instead of prompt for sdxl? for the purpose of interpolation 🧨 Diffusers	0	203	April 2, 2024
Text_encoder_2, local model, not working 🧨 Diffusers	1	1333	May 25, 2024
Add additional trainable layers to StableDiffusion for fine-tuning 🧨 Diffusers	0	1018	October 8, 2023
Access CLIP from StableDiffusionPipeline and use the same models for multiple pipelines 🧨 Diffusers	3	2628	October 11, 2023

SDXL custom pipeline - Input to unet? - Why 2 text encoders?

Related topics