SDXL custom pipeline - Input to unet? - Why 2 text encoders?

Hishambarakat · January 13, 2025, 7:54pm

Yep. So from what i understood it works like this.

the SDXL uses two Tokenizers and Encoders, they’re both Concatenated together like this [tokenizer, tokenizer_2] [text_encoder, text_encoder_2]

the tokenizer = (batch size, sequence length, context_embedding) = (1, 77 , 768)
the tokenizer_2 = (batch size, sequence length, context_embedding) = (1, 77 ,1280)

its a bit of a complicated but you’re supposed to tokenizer the same prompt using both tokenizers, you can find the code on the stable diffusion XL pipeline source code around line 373. you also do the same for the negative prompt and concatenate those two together as well.

SDXL also takes in text_embeds and time_ids, these were a pain to understand essentially, the text_embeds are extracted form the prompt_embeds which are extracted from the output of the encoder.

for the time_ids, its extracted by using the image

original_size=(1024,1024),
target_size= (1024,1024),
crops_coords_top_left=(0,0),
text_encoder_projection_dim=self.text_encoder_2.config.projection_dim

the encoder_hidden_states are just the prompt_embeds which you get from encoding the prompt, both positive and negative togehter… well i hope that steers you towards the right direction.

The most important variable is the time_ids, so it should be something like this tensor([[1024., 1024., 0., 0., 1024., 1024.]] and the shape is ([1, 6].

if you dont get the time_ids correct (like the example i showed, because I got bad results when its not the same numbers) , even if all else is correct, the quality would be horrible, by horrible I mean look at the following images. This is only for 1024x1024 images, i didnt try other dimensions. the image on the left is getting the time_ids wrong and the one on the right is getting it correct:

the is a clear difference in image detail

i wish we had more information available explaining this kind of stuff. the way. all the best @ingo-m @potsu-potsu

Topic		Replies	Views
Did SDXL-inpainting fine-tune the text_encoder? Beginners	0	131	April 29, 2024
Use prompt tokens instead of prompt for sdxl? for the purpose of interpolation 🧨 Diffusers	0	203	April 2, 2024
Text_encoder_2, local model, not working 🧨 Diffusers	1	1333	May 25, 2024
Add additional trainable layers to StableDiffusion for fine-tuning 🧨 Diffusers	0	1018	October 8, 2023
Access CLIP from StableDiffusionPipeline and use the same models for multiple pipelines 🧨 Diffusers	3	2628	October 11, 2023

SDXL custom pipeline - Input to unet? - Why 2 text encoders?

Related topics