How to get embedding matrix of bert in hugging face

I have tried to build sentence-pooling by bert provided by hugging face

from transformers import BertModel, BertTokenizer
model_name = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(model_name)
# load
model = BertModel.from_pretrained(model_name)
input_text = "Here is some text to encode"
# tokenizer-> token_id
input_ids = tokenizer.encode(input_text, add_special_tokens=True)
# input_ids: [101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]
input_ids = torch.tensor([input_ids])

with torch.no_grad():
    last_hidden_states = model(input_ids)[0] # Models outputs are now tuples
last_hidden_states = last_hidden_states.mean(1)
# size of last_hidden_states is [1,768]

Now I want to know what does this vector refers to in dictionary.
So how can I get the matrix in embedding whose size is [sequence_length,embedding_length], and then do the last_hidden_states @ matrix to find the word this vector refers to in dictionary?
Please help me.


The last_hidden_states are a tensor of shape (batch_size, sequence_length, hidden_size). In your example, the text “Here is some text to encode” gets tokenized into 9 tokens (the input_ids) - actually 7 but 2 special tokens are added, namely [CLS] at the start and [SEP] at the end. So the sequence length is 9. The batch size is 1, as we only forward a single sentence through the model. And the hidden_size of a BERT-base-sized model is 768. Hence, the last hidden states will have shape (1, 9, 768). You can then get the last hidden state vector of each token, e.g. if you want to get it for the first token, you would have to type last_hidden_states[:,0,:]. If you want to get it for the second token, then you have to type last_hidden_states[:,1,:], etc.

Also, the code example you refer to seems a bit outdated. Where did you get it from? We’ll update it.

Really,really thanks for your help!
Actually I am a student from China and I get these codes at a chinese cooding net. You don’t need to update it :slight_smile:
But I still have the question, actually I want to get the word that my last_hidden_state refer to. There are 7 words in input sentences. And I actually get the mean vector of them, so the size is [1,768]. I want to “decode” it to the word that it refers in dictionary.
Usually in bert, we first change words to one-hot code by dictionary provided and then we embed it and put the embedding sequence into encoder. I want to “de-embed” the tensor out of the bert, which is use this tensor class the transpose of embedding matrix. But how can I get the transpose of the matrix.
The second question is that, actually the document did not provide enough guide code to let us know the strcture of model(may be I am too weak).

Actually, that’s not possible, unless you compute cosine similarity between the mean of the last hidden state and the embedding vectors of each token in BERT’s vocabulary. You can do that easily using sklearn.

The embedding matrix of BERT can be obtained as follows:

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")
embedding_matrix = model.embeddings.word_embeddings.weight

However, I’m not sure it is useful to compare the vector of an entire sentence with each of the rows of the embedding matrix, as the sentence vector is a “summary” of the entire sentence.


If I modify this embedding matrix then how to forward it to bert encoder layers

Are these embeddings include position and segment embeddings? I mean are these embeddings acquired with summation of token embeddings, segment embeddings, and positional embeddings?
And, this embedding is embedding before entering the encoding layer. Am I right?
Thanks in advance.


These only include the token embeddings. The position embeddings and token type (segment) embeddings are contained in separate matrices.

And yes, the token, position and token type embeddings all get summed before being fed to the Transformer encoder.

1 Like

Thank you so much.