SDXL custom pipeline - Input to unet? - Why 2 text encoders?

ClementP-XXII · September 8, 2023, 10:02am

According to their paper ([2307.01952] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis), it’s just that instead of using 1 big encoder, they use 2 and concatenate the embedding vector. (CLIP ViT-L & OpenCLIP ViT-bigG)

Specifically,
we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the
penultimate text encoder outputs along the channel-axis

It’s not clear if the training was done using different prompts for each encoder, but the HF docs says that even if the default usage is to use the same prompt for both encoders, you can use two different prompts.

I suppose a natural intuition would be use one text encoder for the semantic information, and the other encoder for all the style information, but I have no idea if one encoder is more specialized on style than the other or vice verse, or even if it’s a good idea to have two different prompts.

Did someone made some tests on what to expect on these double encoder use variations ?

Topic		Replies	Views
Did SDXL-inpainting fine-tune the text_encoder? Beginners	0	131	April 29, 2024
Use prompt tokens instead of prompt for sdxl? for the purpose of interpolation 🧨 Diffusers	0	203	April 2, 2024
Text_encoder_2, local model, not working 🧨 Diffusers	1	1333	May 25, 2024
Add additional trainable layers to StableDiffusion for fine-tuning 🧨 Diffusers	0	1018	October 8, 2023
Access CLIP from StableDiffusionPipeline and use the same models for multiple pipelines 🧨 Diffusers	3	2628	October 11, 2023

SDXL custom pipeline - Input to unet? - Why 2 text encoders?

Related topics