Cross Attention Probabilities in SD 1.4 of start of sentence token

Maximal · May 25, 2023, 5:51pm

Hi,

I was reimplementing the prompt-to-prompt paper with SD 1.4 which got me into taking a deep dive into the XA layers in the U-Net and I found something that I found REALLY weird. So the Clip tokenizer adds a “start of sentence” token before the tokens from the actual text prompt. Then, it fills the conditioning vector with “end of sentence” tokens until it reaches the maximum length of 77 in Stable Diffusion 1.4.

Now, intuitively, I would have expected that there is barely any probability mass on the start/end of sentence tokens and that the cross-attention layer attends to the words from the prompt more than some placeholder tokens. But in reality, it seems to be the case that the vast majority of the softmax output is concentrated on this “start of sentence” token. I checked this at different attention layers inside the U-Net and also different heads in each of them but it seems to be the case for all of them. So for a single head/prompt the attention_probs variable in the attention processor is a matrix of size (spatial_dim x number_of_tokens), where the number of tokens is always 77 and spatial dim reduces multiple times insdie the U-Net. Now if you take the average over all spatial dims for the token 0, you get values somewhere between 0.9 and 0.99 in most of the cases. This means that almost all locations mostly attend to the start of sentence token.

I tested this with PyTorch 1.13 without flash attention, in 32 bit and with an unmodified stable diffusion pipeline. Am I the only one who thinks this is super unintuitive and can somebody shed some light on this?

Cheers,
Max

williamberman · May 29, 2023, 5:22pm

Hey @Maximal that is quite interesting and counterintuitive. I unfortunately don’t have time to look into myself but please do let me know what else you find there

Topic		Replies	Views
"Initializing global attention on CLS token" on Longformer Training Beginners	1	1130	October 7, 2021
T5: why do we have more tokens expressed via cross attentions than the decoded sequence? Intermediate	1	386	February 21, 2023
What the tokens are cross attentions output for? 🤗Transformers	1	270	October 25, 2024
T5 cross-attention - inconsistent results Intermediate	1	1382	May 10, 2021
Analysis of attention map Research	2	209	October 24, 2024

Cross Attention Probabilities in SD 1.4 of start of sentence token

Related topics