I was reimplementing the prompt-to-prompt paper with SD 1.4 which got me into taking a deep dive into the XA layers in the U-Net and I found something that I found REALLY weird. So the Clip tokenizer adds a “start of sentence” token before the tokens from the actual text prompt. Then, it fills the conditioning vector with “end of sentence” tokens until it reaches the maximum length of 77 in Stable Diffusion 1.4.
Now, intuitively, I would have expected that there is barely any probability mass on the start/end of sentence tokens and that the cross-attention layer attends to the words from the prompt more than some placeholder tokens. But in reality, it seems to be the case that the vast majority of the softmax output is concentrated on this “start of sentence” token. I checked this at different attention layers inside the U-Net and also different heads in each of them but it seems to be the case for all of them. So for a single head/prompt the attention_probs variable in the attention processor is a matrix of size (spatial_dim x number_of_tokens), where the number of tokens is always 77 and spatial dim reduces multiple times insdie the U-Net. Now if you take the average over all spatial dims for the token 0, you get values somewhere between 0.9 and 0.99 in most of the cases. This means that almost all locations mostly attend to the start of sentence token.
I tested this with PyTorch 1.13 without flash attention, in 32 bit and with an unmodified stable diffusion pipeline. Am I the only one who thinks this is super unintuitive and can somebody shed some light on this?