Forward-looking or left-context attention mask (left-to-right) generation with BertGeneration and RobertaForCausalLM


I am trying to build an altered version of the models proposed in “Leveraging Pre-trained Checkpoints for Sequence Generation Tasks” by Rothe et al. (2020).

In the paper they say that for the BERT-like architectures that are used for generation:

“If not stated otherwise, the implementation of the decoder layers are also identical to the BERT implementation with two adjustments. First the self-attention mechanism is masked to look only at the left context.”

I am using the RobertaForCausalLM class as a basis, but the same would hold for the BertGeneration class. I do not see how this left-context or forward-looking attention mask is implemented. I see that I could provide it myself by passing it to the function, but I feel it is strange that is not noted in the code anywhere, as if I am missing something.

If someone could point me out what I am missing or where I can find this attention mask, that would be very helpful.

Thanks in advance,

Claartje Barkhof

In other words, I am not sure where the ‘causal mask’ is implemented?

Oh, and it seems that @patrickvonplaten implemented / is involved with these models, maybe you could point me to where this is happening? That would be very helpful :pray: Thanks in advance.

I have found it. It happens in get_extended_attention_mask in modeling utils. Consider this question solved :slight_smile: