Self-attention extraction from Long T5

giuseppe-trimigno · March 5, 2024, 8:05am

Hi there, I’m trying to extract the encoder self-attention weights from Long T5 (using the LongT5Encoder model).
I can get encoder attention weights by adding the output_attentions=True arg in the forward method, but I’m having some problem in understanding the output, which is a tensor of shape (batch_size, num_layers, num_heads, 128, 3*128 + K), where I think that K depends on the sequence length, while 128 should be the relative_attention_max_distance.
When doing that for BERT-like models, last two dimensions are a squared matrix (for each column, i.e. for each token, we have the attention score respect to all other tokens), which is what I want. But, how can I achieve this with Long T5 model?
Thanks in advance!

Topic		Replies	Views
Self-attention masking for T5 encoder? 🤗Transformers	0	1702	February 27, 2022
How to get cross-attention values of T5? 🤗Transformers	2	3842	October 9, 2020
Error when trying to visualize attention in T5 model Beginners	4	1644	March 20, 2024
Code example of getting cross attention from T5? Intermediate	0	366	February 15, 2023
T5: why do we have more tokens expressed via cross attentions than the decoded sequence? Intermediate	1	386	February 21, 2023

Self-attention extraction from Long T5

Related topics