T5: why do we have more tokens expressed via cross attentions than the decoded sequence?

Hi all,

I’m trying to obtain cross attention weights for each generated token over all input tokens. The model loaded is T5ForConditionalGeneration.from_pretrained(*).

After setting output_attentions=True and return_dict_in_generate=True when calling model.generate(*), I get the returned result of type BeamSearchEncoderDecoderOutput, and to which I can get its member cross_attentions, which I understand from documentation is a tuple containing one element per generated token.

Question: why is len(cross_attentions) different from the length of generated sequence? I’m using batch size = 1, # of beams = 4.

My guess is cross_attentions actually recorded all tokens that have been predicted on all beams? If so how do I interpret cross_attentions so I can get just the weights for the selected generated sequence?

Thanks a lot!

Hey @veritas2019 :wave: Auxiliary outputs in .generate() with Beam Search (i.e. num_beams>1) can be longer than the output sequence. This is because the output is the highest scoring sequence out of the candidate sequences, and the discarded candidate sequences can be longer – i.e. .generate() was to run for more iterations than the number of generated tokens in the output sequence, which corresponds to longer auxiliary outputs.

Finally, a word of caution :warning: If you want to access the values of e.g. cross attention that correspond to a given token, you have to descramble the beam index. In other words, the output sequence may correspond to beam 0 for the 1st token, beam 3 for the 2nd token, and so on… so you have to fetch the correct beam index that corresponds to the beam index of the generated token. See this issue for more information and examples :slight_smile: