SDXL custom pipeline - Input to unet? - Why 2 text encoders?

With stable-diffusion-v1-4 it was possible to use the components of the pipeline independently, as explained in this very helpful tutorial: Stable Diffusion with 🧨 Diffusers

In other words, one could write a custom pipeline by using the tokenizer, text encoder, unet, and vae one after another.

I’m struggling to figure out how to write a custom pipeline for stable-diffusion-xl-base-1.0. One can check the pipeline components like this:

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")

for item in pipe.components:
	print(item)

# Returns:
# vae
# text_encoder
# text_encoder_2
# tokenizer
# tokenizer_2
# unet
# scheduler

Why are there two tokenizers and two text encoders? And how can the output of the two text encoders be passed into the unet?

The unet forward pass should look like this (right?):

unet_output = unet.forward(
	sample=sample,
	timestep=timestep,
	encoder_hidden_states=encoder_hidden_states,
	added_cond_kwargs={
    	"text_embeds": text_embeds,
    	"time_ids": time_ids,
    	},
	)

But how does the output of the text encoder(s) relate to the inputs to the unet, i.e. sample, encoder_hidden_states, text_embeds, and time_ids?

Maybe I’m missing something obvious :thinking:

1 Like

+1 I am also trying to figure out what each component of SDXL means and how to utilize them? Have you found a good tutorial or documentation for it?

1 Like

@asrielh no unfortunately I haven’t made any progress on this. Conceptually my understanding for stable-diffusion-v1-4 is that the components are connected like this: tokenizer → text_encoder → unet → vae. I can’t make sense of the two two text encoders & tokenizers in SDXL.

According to their paper ([2307.01952] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis), it’s just that instead of using 1 big encoder, they use 2 and concatenate the embedding vector. (CLIP ViT-L & OpenCLIP ViT-bigG)

Specifically,
we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the
penultimate text encoder outputs along the channel-axis

It’s not clear if the training was done using different prompts for each encoder, but the HF docs says that even if the default usage is to use the same prompt for both encoders, you can use two different prompts.

I suppose a natural intuition would be use one text encoder for the semantic information, and the other encoder for all the style information, but I have no idea if one encoder is more specialized on style than the other or vice verse, or even if it’s a good idea to have two different prompts.

Did someone made some tests on what to expect on these double encoder use variations ?

2 Likes

@ClementP-XXII thanks for this information, I will try to use a custom pipeline by concatenating the encoder outputs.

@ingo-m Hi, were you able to create a custom pipeline?

@potsu-potsu No unfortunately not yet :confused:

Hello!

The use of the two text encoders can be observed here, this is the function that converts prompt(s) to embeddings for the UNet. In particular:

  • The “pooled_output” of the second text encoder is kept here.
  • The outputs from the two text encoders are concatenated here.

I would recommend you step through that function with your debugger using the standard SDXL pipeline. There are many options that make it appear long and complicated, but you’ll see that most of the work is just calling the text encoders (each one requires their own tokenizer) and combining the outputs.

Several people have tried the use of different prompts as mentioned in this thread, but I’m not sure what the current best practices are, feel free to update with your results!

Happy hacking! :slight_smile:

2 Likes

Did you figure it out with the two text encoders?

Yep. So from what i understood it works like this.

the SDXL uses two Tokenizers and Encoders, they’re both Concatenated together like this [tokenizer, tokenizer_2] [text_encoder, text_encoder_2]

the tokenizer = (batch size, sequence length, context_embedding) = (1, 77 , 768)
the tokenizer_2 = (batch size, sequence length, context_embedding) = (1, 77 ,1280)

its a bit of a complicated but you’re supposed to tokenizer the same prompt using both tokenizers, you can find the code on the stable diffusion XL pipeline source code around line 373. you also do the same for the negative prompt and concatenate those two together as well.

SDXL also takes in text_embeds and time_ids, these were a pain to understand essentially, the text_embeds are extracted form the prompt_embeds which are extracted from the output of the encoder.

for the time_ids, its extracted by using the image

  • original_size=(1024,1024),
  • target_size= (1024,1024),
  • crops_coords_top_left=(0,0),
  • text_encoder_projection_dim=self.text_encoder_2.config.projection_dim

the encoder_hidden_states are just the prompt_embeds which you get from encoding the prompt, both positive and negative togehter… well i hope that steers you towards the right direction.

The most important variable is the time_ids, so it should be something like this tensor([[1024., 1024., 0., 0., 1024., 1024.]] and the shape is ([1, 6].

if you dont get the time_ids correct (like the example i showed, because I got bad results when its not the same numbers) , even if all else is correct, the quality would be horrible, by horrible I mean look at the following images. This is only for 1024x1024 images, i didnt try other dimensions. the image on the left is getting the time_ids wrong and the one on the right is getting it correct:

the is a clear difference in image detail

i wish we had more information available explaining this kind of stuff. the way. all the best @ingo-m @potsu-potsu

1 Like