Questions on the `BertModelLMHeadModel`

I have a few questions about the BertModelLMHeadModel:

  1. Is BertModelLMHeadModel used to conduct the regular language modeling (next token prediction), as it is the case for the GPT2LMHeadModel?

  2. For GPT2LMHeadModel, I can just specify labels = input_ids for convenience. I just specify the labels in this way for the BertModelLMHeadModel as well?


Hi @h56cho
do you mean the BertLMHeadModel ?

If yes then, it’s intended to be used with the EncoderDecoder model which allows you to use pre-trained encoder for as both encoder and decoder for seq2seq tasks. It’s not intended for language modeling.

While you can use that class as a standalone decoder by passing is_decoder=True to config it might not give you good results as it’s trained as an encoder.

HuggingFace Transformer documentation seem to point out that BertLMHeadModel can be used for causal language modeling( If you look at the returned values from this model, it includes causalLMoutput. doesn’t the term “causal language modeling” refer to regular language modeling, as in the case for GPT-2? I am not so interested in the accuracy of the results, my intention is to examine the distribution of the attention weights.

Also, when providing “labels” for the causal language modeling with the BertLMHeadModel, can I just use labels = input_ids as in the case for GPT-2, for convinence?

Thank you,

That’s what I said in the last comment,
It can be used as a standalone decoder (standalone decoder = causal LM).

Yes, you can pass labels = input_ids

Thank you!

Sorry I have some additional question. This question is about the BertForMaskedLM model.
The documentation for BertForMaskedLM provides the following example to illustrate the model’s usage:

>>> from transformers import BertTokenizer, BertForMaskedLM
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)
>>> input_ids = tokenizer("Hello, my dog is cute", return_tensors="pt")["input_ids"]

>>> outputs = model(input_ids, labels=input_ids)
>>> loss = outputs.loss
>>> prediction_logits = outputs.logits

In the example above, I don’t see any [MASK] token in the input; can the BertForMaskedLM model really be used with an input string that does not include [MASK] token? If I provide BertForMaskedLM model an input string that does not include the [MASK] token, from which token will the output of the model be produced from? In this case, would BertForMaskedLM automatically insert [MASK] token in the beginning of the input sequence?

Thank you again,

That’s probably a mistake,

This might help.

Yes it’s definitely a mistake. Will fix this morning.