The huggingface documentation states:
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned using a masked language modeling (MLM) loss.
I have two questions regarding this statement:
- Is this a common distinction you’d find in the NLP literature (any literature on this distinction)?
- Is it a sensible distinction in your opinion? I have two questions While I totally agree with CLM, I don’t understand why you would call BERT & co. “masked language models”, since causal language models do the actual masking in next token prediction?
Thanks!