I am trying to build an altered version of the models proposed in “Leveraging Pre-trained Checkpoints for Sequence Generation Tasks” by Rothe et al. (2020).
In the paper they say that for the BERT-like architectures that are used for generation:
“If not stated otherwise, the implementation of the decoder layers are also identical to the BERT implementation with two adjustments. First the self-attention mechanism is masked to look only at the left context.”
I am using the RobertaForCausalLM class as a basis, but the same would hold for the BertGeneration class. I do not see how this left-context or forward-looking attention mask is implemented. I see that I could provide it myself by passing it to the function, but I feel it is strange that is not noted in the code anywhere, as if I am missing something.
If someone could point me out what I am missing or where I can find this attention mask, that would be very helpful.
Thanks in advance,