Forward-looking or left-context attention mask (left-to-right) generation with BertGeneration and RobertaForCausalLM

claartje-barkhof · October 26, 2020, 11:30am

Hi,

I am trying to build an altered version of the models proposed in “Leveraging Pre-trained Checkpoints for Sequence Generation Tasks” by Rothe et al. (2020).

In the paper they say that for the BERT-like architectures that are used for generation:

“If not stated otherwise, the implementation of the decoder layers are also identical to the BERT implementation with two adjustments. First the self-attention mechanism is masked to look only at the left context.”

I am using the RobertaForCausalLM class as a basis, but the same would hold for the BertGeneration class. I do not see how this left-context or forward-looking attention mask is implemented. I see that I could provide it myself by passing it to the function, but I feel it is strange that is not noted in the code anywhere, as if I am missing something.

If someone could point me out what I am missing or where I can find this attention mask, that would be very helpful.

Thanks in advance,

Claartje Barkhof

claartje-barkhof · October 26, 2020, 12:31pm

In other words, I am not sure where the ‘causal mask’ is implemented?

claartje-barkhof · October 26, 2020, 12:46pm

Oh, and it seems that @patrickvonplaten implemented / is involved with these models, maybe you could point me to where this is happening? That would be very helpful Thanks in advance.

claartje-barkhof · October 27, 2020, 6:49am

I have found it. It happens in get_extended_attention_mask in modeling utils. Consider this question solved

Topic		Replies	Views
Causal masks in BERT vs. GPT2 Intermediate	4	2717	December 30, 2022
Longformer's attention_mask Beginners	0	260	August 30, 2020
Quick question on attention masking in transformer models Models	0	126	January 8, 2025
Modification of self attention in BERT without pretraining Research	1	362	June 15, 2023
Optimal methods to monitor attention matrices when doing training/inference using BERT-type models Intermediate	2	712	September 11, 2021

Forward-looking or left-context attention mask (left-to-right) generation with BertGeneration and RobertaForCausalLM

Related topics