Hi, I would want to know what class should I use to extract word embeddings.
Given a sentence, I just want extract contextualised word embeddings.
However, I don’t really get how can I do it. In particular, I want to sum up the last four hidden layers of BERT. What’s the difference between hidden_states from BertForMaskedLM and hidden_states from BertModel? Thank you in advance.
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
inputs = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
outputs = model(**inputs, output_hidden_states=True)
outputs.hidden_states
or
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
outputs = model(**inputs, output_hidden_states=True)
outputs.hidden_states
Both will return the same thing. BertForMaskedLM just adds a language modeling head on top of BertModel. The hidden states (which are the outputs of each layer of the Transformer encoder + the initial embeddings) are identical. As you don’t need to perform masked language modeling, you can just use BertModel, like so:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
outputs = model(**inputs, output_hidden_states=True)
contextualized_vector = outputs.last_hidden_state
This contextualized vector contains the last hidden states of the tokens of your input sentence. It’s a tensor of shape (batch_size, number of tokens, hidden size).