T5: why do we have more tokens expressed via cross attentions than the decoded sequence?

veritas2019 · February 17, 2023, 9:15am

Hi all,

I’m trying to obtain cross attention weights for each generated token over all input tokens. The model loaded is T5ForConditionalGeneration.from_pretrained(*).

After setting output_attentions=True and return_dict_in_generate=True when calling model.generate(*), I get the returned result of type BeamSearchEncoderDecoderOutput, and to which I can get its member cross_attentions, which I understand from documentation is a tuple containing one element per generated token.

Question: why is len(cross_attentions) different from the length of generated sequence? I’m using batch size = 1, # of beams = 4.

My guess is cross_attentions actually recorded all tokens that have been predicted on all beams? If so how do I interpret cross_attentions so I can get just the weights for the selected generated sequence?

Thanks a lot!

joaogante · February 21, 2023, 3:52pm

Hey @veritas2019 Auxiliary outputs in .generate() with Beam Search (i.e. num_beams>1) can be longer than the output sequence. This is because the output is the highest scoring sequence out of the candidate sequences, and the discarded candidate sequences can be longer – i.e. .generate() was to run for more iterations than the number of generated tokens in the output sequence, which corresponds to longer auxiliary outputs.

Finally, a word of caution If you want to access the values of e.g. cross attention that correspond to a given token, you have to descramble the beam index. In other words, the output sequence may correspond to beam 0 for the 1st token, beam 3 for the 2nd token, and so on… so you have to fetch the correct beam index that corresponds to the beam index of the generated token. See this issue for more information and examples

Topic		Replies	Views
What the tokens are cross attentions output for? 🤗Transformers	1	270	October 25, 2024
Problem with returning decoder cross attentions through generate function 🤗Transformers	0	25	October 25, 2024
T5 cross-attention - inconsistent results Intermediate	1	1382	May 10, 2021
Customizing GenerationMixin to output attentions Beginners	4	1820	September 10, 2020
T5 transformer tokens and scores Beginners	0	709	July 26, 2022

T5: why do we have more tokens expressed via cross attentions than the decoded sequence?

Related topics