CLIP Embedding Order For Stable Diffusion

I am trying to manipulate the CLIP embeddings that are generated from a prompt. However, I wanted to know if the order of embeddings fed to Stable Diffusion matters? Or is a positionl encoding already built into the embedding?


Yes positional encodings is included in the embedding. Basically, each text token first gets embedded, after which a positional embedding gets added. Next, they are fed to the Transformer encoder.

1 Like

Thank you. When I say ‘embeddings’ I am referring the CLIP embeddings that are produced as a result of the prompt being run through the CLIP model, such as below. I believe text_features are the embeddings, generated something like this:

text = clip.tokenize(["brown dog on green grass"]).to(device)
text_features = model.encode_text(text)

What I wanted to know was whether the post-Transformer CLIP embeddings need to be fed to Stable Diffusion need to be in the order they were produced (assuming this is the process for passing them to Stable Diffusion to generate an image)? Assume I have one CLIP embedding per word produced by text_features (e.g. set = {CLIP_embed(“brown”), CLIP_embed(“dog”), …}), so a set of CLIP embeddings. If I change the order of this set (e.g. new_set = {CLIP_embed(“dog”), CLIP_embed(“grass”), …}), if I now feed this ‘new_set’ to Stable Diffusion, will I get the same type of image (e.g. image describing original prompt [“brown dog on green grass”] as the original ‘set’?

The TL;DR question is: does the set order of the CLIP embeddings corresponding input prompt, passed to Stable Diffusion, matter?

The 77 vectors are passed to the u-net without any specific ordering. You can shuffle them without any effect (the image will be the same). It isn’t even strictly necessary to pass 77, the cross-attention mechanism can cope with any number. That said, things can get weird if you leave out part.

(so to answer your original question, yes, there’s must be an implicit positional embedding from CLIP-at the least there’s cross-pollination between tokens that merges information from earlier tokens like adjectives into nouns, that somehow communicates the structure)

FWIW DAAM is kind of useful to study which vectors of the input affect what part of the image. It stores the attention scores of the cross-attention layers and accumulates this over the generation process to generate heatmaps.