Hello, I’m new to this community. I studied a little theory on the attention mechanism, and now I would like to jump into practice. At this moment, I’m looking at the attention extracted from a BertModel. I’m wondering if the attention value between two tokens from different encoder layer (and/or heads) is comparable.
As an example, let’s say we have this sentence: The cat sat on the mat. Is the attention between ‘cat’ and ‘sat’ at layer 3-head 3 (attn_3_3) comparable to the attention between the same words at layer 10-head 10 (attn_10_10) ? Let’s suppose attn_3_3 is greather than attn_10_10, this means that BERT pay more attention at the pair <cat,sat> on layer 3 and head 3 than on layer-10 head-10? Or, on the contrary, is nothing like that?