Should I use BertModel or BertModelForLM?

frap · February 10, 2022, 11:08am

Hi, I would want to know what class should I use to extract word embeddings.

Given a sentence, I just want extract contextualised word embeddings.
However, I don’t really get how can I do it. In particular, I want to sum up the last four hidden layers of BERT. What’s the difference between hidden_states from BertForMaskedLM and hidden_states from BertModel? Thank you in advance.

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
inputs = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"] 
outputs = model(**inputs, output_hidden_states=True)
outputs.hidden_states

or

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"] 
outputs = model(**inputs, output_hidden_states=True)
outputs.hidden_states

nielsr · February 10, 2022, 2:58pm

Both will return the same thing. BertForMaskedLM just adds a language modeling head on top of BertModel. The hidden states (which are the outputs of each layer of the Transformer encoder + the initial embeddings) are identical. As you don’t need to perform masked language modeling, you can just use BertModel, like so:

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"] 
outputs = model(**inputs, output_hidden_states=True)
contextualized_vector = outputs.last_hidden_state

This contextualized vector contains the last hidden states of the tokens of your input sentence. It’s a tensor of shape (batch_size, number of tokens, hidden size).

frap · February 10, 2022, 3:02pm

Ok, thank you. What about BertConfig? You’re right. However, if I use BertConfig the outputs are different

from transformers import AutoConfig, AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
config = AutoConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
model = AutoModelForMaskedLM.from_config(config)
print(model(**tokenizer(sent, return_tensors="pt"), output_hidden_states=True).hidden_states)

Topic		Replies	Views
What should be used as sentence embedding for BertModel? Beginners	0	1909	May 24, 2021
Difference BertModel, AutoModel and AutoModelForMaskedLM 🤗Transformers	8	5033	March 9, 2025
How to get [CLS] embeddings from BertForTokenClassification model Beginners	3	15207	November 27, 2023
Should I use BertConfig? Why these output are different? Beginners	1	520	February 11, 2022
Questions on the `BertModelLMHeadModel` 🤗Transformers	7	6251	October 5, 2020

Should I use BertModel or BertModelForLM?

Related topics