Is causal language modeling (CLM) vs masked language modeling (MLM) a common distinction in NLP research?

ComfortEagle · April 21, 2021, 2:30pm

GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned using a masked language modeling (MLM) loss.

I have two questions regarding this statement:

Is this a common distinction you’d find in the NLP literature (any literature on this distinction)?
Is it a sensible distinction in your opinion? I have two questions While I totally agree with CLM, I don’t understand why you would call BERT & co. “masked language models”, since causal language models do the actual masking in next token prediction?

Thanks!

Topic		Replies	Views
MLM vs CLM, can be exchanged? Models	0	1053	August 21, 2022
Fine-tune BERT for Masked Language Modeling 🤗Transformers	3	3025	January 25, 2021
Is the huggingface run_mlm Script dynamically masked? 🤗Transformers	8	1648	June 1, 2022
Sequence classification VS MaskedLM Beginners	1	738	October 8, 2020
Using BERT and RoBERTa for (causal?) language modeling 🤗Transformers	6	5336	October 2, 2021

Is causal language modeling (CLM) vs masked language modeling (MLM) a common distinction in NLP research?

Related topics