Causal masks in BERT vs. GPT2

dilawn · October 14, 2020, 9:43pm

Hi all - just a quick clarification about causal masks in GPT2 and BertLMHeadModel. Its clear that GPT2 automatically adds causal masks (lines 115 and 151 in modeling_gpt2.py). I believe that this should be the case for BertLMHeadModel as well - my understanding is that it is mainly meant to be used as a decoder in the EncoderDecoder construct (please correct me if this is not the case). However, I am having a hard time finding where this occurs in the source. Are causal masks automatically added? I suspect I am just missing where this happens in the code as the attention_mask input (meant for masking padding tokens) wouldn’t take care of this. Thanks!

claartje-barkhof · October 26, 2020, 4:13pm

Hi @dilawn, have you figured this out yet by any chance? I have just posted a similar question on not finding causal masks in RobertaForCausalLM and BertGeneration classes. I am just wondering if I misunderstand the concept or missing the line of code that applies these causal masks.

Link to my post.

claartje-barkhof · October 27, 2020, 6:50am

For me the answer was in the function get_extended_attention_mask in modeling utils.

Jung · October 27, 2020, 7:29am

Yes, I believe that @claartje-barkhof is correct as in get_extended_attention_mask there is a check on config.is_decoder and the causual mask creation (see line 236 - 255)

edmond · December 30, 2022, 10:36am

I think only GPT2 is autoregressive, hence the need for causal masks (I guess everywhere) in the architecture.

Topic		Replies	Views
Forward-looking or left-context attention mask (left-to-right) generation with BertGeneration and RobertaForCausalLM 🤗Transformers	3	1357	October 27, 2020
Where does causal mask get generated for T5 decoder? Beginners	2	680	January 9, 2024
Difference between transformer encoder and decoder Models	1	11858	March 12, 2021
Questions on the `BertModelLMHeadModel` 🤗Transformers	7	6286	October 5, 2020
Remove causal mask from Llama decoder Intermediate	5	810	October 22, 2024

Causal masks in BERT vs. GPT2

Related topics