Google T5 cross_attentions output

Yevhenii247 · August 29, 2024, 5:10pm

Hi everyone! I am trying to build attention heat map for google T5 small model ( for translation task ). But, I find the structure of cross_attention quite misleading

cross_attentions = outputs.cross_attentions

Moreover, it’s not aligned with official documentation as well:

cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

After my investigation the actual output of cross_attentions is tuple of len(num_ouput_tokens - 1). Each of cross_attentions[I] consist of [ num_heads ] (in case of small model - 6) tuples where each element is tensor of size (batch_size, 8, 1, num_of_input_tokens), where I find hard times to figure out what 8 actually is.

I would highly appreciate any explanation of how should I deal with this nested output to build heat maps similar to these:

Topic		Replies	Views
How to get cross-attention values of T5? 🤗Transformers	2	3842	October 9, 2020
T5 cross-attention - inconsistent results Intermediate	1	1382	May 10, 2021
T5: why do we have more tokens expressed via cross attentions than the decoded sequence? Intermediate	1	386	February 21, 2023
What the tokens are cross attentions output for? 🤗Transformers	1	270	October 25, 2024
What is the dimensionality of output_attentions? 🤗Transformers	0	464	July 9, 2022

Google T5 cross_attentions output

Related topics