Is attention of different encoder layers comprabale?

frap · December 6, 2022, 8:41am

Hello, I’m new to this community. I studied a little theory on the attention mechanism, and now I would like to jump into practice. At this moment, I’m looking at the attention extracted from a BertModel. I’m wondering if the attention value between two tokens from different encoder layer (and/or heads) is comparable.

As an example, let’s say we have this sentence: The cat sat on the mat. Is the attention between ‘cat’ and ‘sat’ at layer 3-head 3 (attn_3_3) comparable to the attention between the same words at layer 10-head 10 (attn_10_10) ? Let’s suppose attn_3_3 is greather than attn_10_10, this means that BERT pay more attention at the pair <cat,sat> on layer 3 and head 3 than on layer-10 head-10? Or, on the contrary, is nothing like that?

Topic		Replies	Views
Can I compare the attention of different encoder layers? Beginners	0	206	December 13, 2022
About the Cross-attention Layer Shape in Encoder-Decoder Model 🤗Transformers	1	1912	March 18, 2022
Training a model with custom attention masks in each layer 🤗Transformers	0	665	December 6, 2023
MarianMT model cross attention layers alignment problem! Models	0	334	April 3, 2023
Role of attention mask in base Bert 🤗Transformers	0	329	December 22, 2022

Is attention of different encoder layers comprabale?

Related topics