CLIP Embedding Order For Stable Diffusion

I am trying to manipulate the CLIP embeddings that are generated from a prompt. However, I wanted to know if the order of embeddings fed to Stable Diffusion matters? Or is a positionl encoding already built into the embedding?


Yes positional encodings is included in the embedding. Basically, each text token first gets embedded, after which a positional embedding gets added. Next, they are fed to the Transformer encoder.

1 Like

Thank you. When I say ‘embeddings’ I am referring the CLIP embeddings that are produced as a result of the prompt being run through the CLIP model, such as below. I believe text_features are the embeddings, generated something like this:

text = clip.tokenize(["brown dog on green grass"]).to(device)
text_features = model.encode_text(text)

What I wanted to know was whether the post-Transformer CLIP embeddings need to be fed to Stable Diffusion need to be in the order they were produced (assuming this is the process for passing them to Stable Diffusion to generate an image)? Assume I have one CLIP embedding per word produced by text_features (e.g. set = {CLIP_embed(“brown”), CLIP_embed(“dog”), …}), so a set of CLIP embeddings. If I change the order of this set (e.g. new_set = {CLIP_embed(“dog”), CLIP_embed(“grass”), …}), if I now feed this ‘new_set’ to Stable Diffusion, will I get the same type of image (e.g. image describing original prompt [“brown dog on green grass”] as the original ‘set’?

The TL;DR question is: does the set order of the CLIP embeddings corresponding input prompt, passed to Stable Diffusion, matter?