With stable-diffusion-v1-4
it was possible to use the components of the pipeline independently, as explained in this very helpful tutorial: Stable Diffusion with 🧨 Diffusers
In other words, one could write a custom pipeline by using the tokenizer, text encoder, unet, and vae one after another.
I’m struggling to figure out how to write a custom pipeline for stable-diffusion-xl-base-1.0
. One can check the pipeline components like this:
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
for item in pipe.components:
print(item)
# Returns:
# vae
# text_encoder
# text_encoder_2
# tokenizer
# tokenizer_2
# unet
# scheduler
Why are there two tokenizers and two text encoders? And how can the output of the two text encoders be passed into the unet?
The unet forward pass should look like this (right?):
unet_output = unet.forward(
sample=sample,
timestep=timestep,
encoder_hidden_states=encoder_hidden_states,
added_cond_kwargs={
"text_embeds": text_embeds,
"time_ids": time_ids,
},
)
But how does the output of the text encoder(s) relate to the inputs to the unet, i.e. sample
, encoder_hidden_states
, text_embeds
, and time_ids
?
Maybe I’m missing something obvious