Why reshaping attn_weights when outputting attentions?

kexunz · April 13, 2021, 2:25am

In the bart source code, when the BartAttention class is to output attention weights, the weights are reshaped twice to “keep its gradient”, I wonder why this operation is necessary because attn_weights are in the same shape before this operation.

    if output_attentions:
        # this operation is a bit akward, but it's required to
        # make sure that attn_weights keeps its gradient.
        # In order to do so, attn_weights have to reshaped
        # twice and have to be reused in the following
        attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
        attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
    else:
        attn_weights_reshaped = None

Topic		Replies	Views
Loading weights of BART model into a different architecture Models	0	389	December 29, 2021
Bug in BartForConditionalGeneration's intialisation of lm_head 🤗Transformers	0	263	October 16, 2021
[Bart] Question for BartModel Output shape Beginners	2	375	July 20, 2020
Passing output of BART to another model Beginners	0	232	September 24, 2022
Is attention_mask needed for training Bart? Beginners	1	207	March 10, 2021

Why reshaping attn_weights when outputting attentions?

Related topics