Hello!
The use of the two text encoders can be observed here, this is the function that converts prompt(s) to embeddings for the UNet. In particular:
- The “pooled_output” of the second text encoder is kept here.
- The outputs from the two text encoders are concatenated here.
I would recommend you step through that function with your debugger using the standard SDXL pipeline. There are many options that make it appear long and complicated, but you’ll see that most of the work is just calling the text encoders (each one requires their own tokenizer) and combining the outputs.
Several people have tried the use of different prompts as mentioned in this thread, but I’m not sure what the current best practices are, feel free to update with your results!
Happy hacking!