Causal masks in BERT vs. GPT2

Hi all - just a quick clarification about causal masks in GPT2 and BertLMHeadModel. Its clear that GPT2 automatically adds causal masks (lines 115 and 151 in I believe that this should be the case for BertLMHeadModel as well - my understanding is that it is mainly meant to be used as a decoder in the EncoderDecoder construct (please correct me if this is not the case). However, I am having a hard time finding where this occurs in the source. Are causal masks automatically added? I suspect I am just missing where this happens in the code as the attention_mask input (meant for masking padding tokens) wouldn’t take care of this. Thanks!

1 Like

Hi @dilawn, have you figured this out yet by any chance? I have just posted a similar question on not finding causal masks in RobertaForCausalLM and BertGeneration classes. I am just wondering if I misunderstand the concept or missing the line of code that applies these causal masks.

Link to my post.

For me the answer was in the function get_extended_attention_mask in modeling utils.

1 Like

Yes, I believe that @claartje-barkhof is correct as in get_extended_attention_mask there is a check on config.is_decoder and the causual mask creation (see line 236 - 255)

I think only GPT2 is autoregressive, hence the need for causal masks (I guess everywhere) in the architecture.