Hi all - just a quick clarification about causal masks in GPT2 and BertLMHeadModel. Its clear that GPT2 automatically adds causal masks (lines 115 and 151 in modeling_gpt2.py). I believe that this should be the case for BertLMHeadModel as well - my understanding is that it is mainly meant to be used as a decoder in the EncoderDecoder construct (please correct me if this is not the case). However, I am having a hard time finding where this occurs in the source. Are causal masks automatically added? I suspect I am just missing where this happens in the code as the attention_mask input (meant for masking padding tokens) wouldn’t take care of this. Thanks!
1 Like
Hi @dilawn, have you figured this out yet by any chance? I have just posted a similar question on not finding causal masks in RobertaForCausalLM and BertGeneration classes. I am just wondering if I misunderstand the concept or missing the line of code that applies these causal masks.
For me the answer was in the function get_extended_attention_mask
in modeling utils.
1 Like
Yes, I believe that @claartje-barkhof is correct as in get_extended_attention_mask
there is a check on config.is_decoder
and the causual mask creation (see line 236 - 255)
I think only GPT2 is autoregressive, hence the need for causal masks (I guess everywhere) in the architecture.