CLIP Embedding Order For Stable Diffusion

brschroe · January 21, 2023, 11:27pm

I am trying to manipulate the CLIP embeddings that are generated from a prompt. However, I wanted to know if the order of embeddings fed to Stable Diffusion matters? Or is a positionl encoding already built into the embedding?

nielsr · January 23, 2023, 11:21am

Hi,

Yes positional encodings is included in the embedding. Basically, each text token first gets embedded, after which a positional embedding gets added. Next, they are fed to the Transformer encoder.

brschroe · January 23, 2023, 6:12pm

Thank you. When I say ‘embeddings’ I am referring the CLIP embeddings that are produced as a result of the prompt being run through the CLIP model, such as below. I believe text_features are the embeddings, generated something like this:

text = clip.tokenize(["brown dog on green grass"]).to(device)
text_features = model.encode_text(text)

What I wanted to know was whether the post-Transformer CLIP embeddings need to be fed to Stable Diffusion need to be in the order they were produced (assuming this is the process for passing them to Stable Diffusion to generate an image)? Assume I have one CLIP embedding per word produced by text_features (e.g. set = {CLIP_embed(“brown”), CLIP_embed(“dog”), …}), so a set of CLIP embeddings. If I change the order of this set (e.g. new_set = {CLIP_embed(“dog”), CLIP_embed(“grass”), …}), if I now feed this ‘new_set’ to Stable Diffusion, will I get the same type of image (e.g. image describing original prompt [“brown dog on green grass”] as the original ‘set’?

The TL;DR question is: does the set order of the CLIP embeddings corresponding input prompt, passed to Stable Diffusion, matter?

vmedea · April 10, 2023, 8:46am

The 77 vectors are passed to the u-net without any specific ordering. You can shuffle them without any effect (the image will be the same). It isn’t even strictly necessary to pass 77, the cross-attention mechanism can cope with any number. That said, things can get weird if you leave out part.

(so to answer your original question, yes, there’s must be an implicit positional embedding from CLIP-at the least there’s cross-pollination between tokens that merges information from earlier tokens like adjectives into nouns, that somehow communicates the structure)

FWIW DAAM is kind of useful to study which vectors of the input affect what part of the image. It stores the attention scores of the cross-attention layers and accumulates this over the generation process to generate heatmaps.

Topic		Replies	Views
How to condition Stable-Diffusion on CLIP image embeddings? 🧨 Diffusers	0	1291	February 4, 2024
Provide CLIP embeddings directly to diffuser Beginners	0	319	August 5, 2023
Decoding a CLIP embedding 🧨 Diffusers	1	2706	November 1, 2022
Stable Diffusion CLIP similarity 🧨 Diffusers	6	4581	December 6, 2022
Providing embeddings directly to the diffusion pipeline 🧨 Diffusers	0	351	August 4, 2023

CLIP Embedding Order For Stable Diffusion

Related topics