With stable-diffusion-v1-4 it was possible to use the components of the pipeline independently, as explained in this very helpful tutorial: Stable Diffusion with 🧨 Diffusers
In other words, one could write a custom pipeline by using the tokenizer, text encoder, unet, and vae one after another.
I’m struggling to figure out how to write a custom pipeline for stable-diffusion-xl-base-1.0. One can check the pipeline components like this:
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
for item in pipe.components:
print(item)
# Returns:
# vae
# text_encoder
# text_encoder_2
# tokenizer
# tokenizer_2
# unet
# scheduler
Why are there two tokenizers and two text encoders? And how can the output of the two text encoders be passed into the unet?
The unet forward pass should look like this (right?):
@asrielh no unfortunately I haven’t made any progress on this. Conceptually my understanding for stable-diffusion-v1-4 is that the components are connected like this: tokenizer → text_encoder → unet → vae. I can’t make sense of the two two text encoders & tokenizers in SDXL.
Specifically,
we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the
penultimate text encoder outputs along the channel-axis
It’s not clear if the training was done using different prompts for each encoder, but the HF docs says that even if the default usage is to use the same prompt for both encoders, you can use two different prompts.
I suppose a natural intuition would be use one text encoder for the semantic information, and the other encoder for all the style information, but I have no idea if one encoder is more specialized on style than the other or vice verse, or even if it’s a good idea to have two different prompts.
Did someone made some tests on what to expect on these double encoder use variations ?