Is causal language modeling (CLM) vs masked language modeling (MLM) a common distinction in NLP research?

The huggingface documentation states:

GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned using a masked language modeling (MLM) loss.

I have two questions regarding this statement:

  • Is this a common distinction you’d find in the NLP literature (any literature on this distinction)?
  • Is it a sensible distinction in your opinion? I have two questions While I totally agree with CLM, I don’t understand why you would call BERT & co. “masked language models”, since causal language models do the actual masking in next token prediction?