More expressive attention with negative weights

AngLv · November 14, 2024, 8:06am

Hi everyone. I want to introduce our new research paper titled “More expressive attention with negative weights”. [2411.07176] More Expressive Attention with Negative Weights

We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors:

(1) Cog Attention can shift the token deletion and copying function from a static OV matrix to dynamic QK inner products, with the OV matrix now focusing more on refinement or modification. The attention head can simultaneously delete, copy, or retain tokens by assigning them negative, positive, or minimal attention weights, respectively. As a result, a single attention head becomes more flexible and expressive.

(2) Cog Attention improves the model’s robustness against representational collapse, which can occur when earlier tokens are over-squashed into later positions, leading to homogeneous representations. Negative weights reduce effective information paths from earlier to later tokens, helping to mitigate this issue.

We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules.

We want to challenge the common belief that attention weights should naturally be non-negative, and we addressed many difficulties such as training instability, numerical overflow, and difficulties in attention normalization due to issues like division by zero, etc.

Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.

Here is an attention pattern figure obtained from our pretrained models. Details can be found in the paper, and hope you find it interesting.

daisyqqqqq · November 14, 2024, 8:53am

Goooood idea!

Topic		Replies	Views
Understanding what went wrong in attention Research	5	1663	July 31, 2020
Access and modify attention weights at runtime Beginners	0	2148	August 25, 2021
Extracting attention weights of summarization model Intermediate	0	439	August 12, 2021
How to visualize attention of a large encoder-decoder transformer model that isn't a model on hugging face? 🤗Transformers	0	2335	June 28, 2021
Why reshaping attn_weights when outputting attentions? 🤗Transformers	0	303	April 13, 2021

More expressive attention with negative weights

Related topics