Can I compare the attention of different encoder layers?

frap · December 13, 2022, 8:12am

Hello, I’m new to this community. I studied a little theory on the attention mechanism, and now I would like to jump into practice. At this moment, I’m looking at the attention extracted from a BertModel. I’m wondering if the attention value between two tokens from different encoder layer (and/or heads) is comparable.

As an example, let’s say we have this sentence: The cat sat on the mat. Is the attention between ‘cat’ and ‘sat’ at layer 3-head 3 (attn_3_3) comparable to the attention between the same words at layer 10-head 10 (attn_10_10) ? Let’s suppose attn_3_3 is greather than attn_10_10, this means that BERT pay more attention at the pair <cat,sat> on layer 3 and head 3 than on layer-10 head-10? Or, on the contrary, is nothing like that?

Topic		Replies	Views
Is attention of different encoder layers comprabale? 🤗Transformers	0	278	December 6, 2022
About the Cross-attention Layer Shape in Encoder-Decoder Model 🤗Transformers	1	1912	March 18, 2022
Training a model with custom attention masks in each layer 🤗Transformers	0	665	December 6, 2023
MarianMT model cross attention layers alignment problem! Models	0	334	April 3, 2023
How to change BERT attention value during testing Intermediate	0	407	October 6, 2021

Can I compare the attention of different encoder layers?

Related topics