Generate raw word embeddings using transformer models like BERT for downstream process

I am new to using transformer based models. I have a few basic questions, hopefully, someone can shed light, please.

I’ve been training GloVe and word2vec on my corpus to generate word embedding, where a unique word has a vector to use in the downstream process. Now, my questions are:

  1. Can we generate a similar embedding using the BERT model on the same corpus?
  2. Can we have one unique word with its vector? BERT is contextual, not sure how the vector will look like for the same word which is repeated in different sentences.
  3. If a word is repeated and not unique, not sure how I can use these vectors in the downstream process.

Appreciate your valuable inputs. I tried to look over the internet but was not able to find a clear answer. If someone can help with the above it will be really helpful.


  1. Yes you can get a word embedding for a specific word in a sentence. You have to take care though, because in language models we often use a subword tokenizer. It chops words into smaller pieces. That means that you do not necessarily get one output for every word in a sentence, but probably more than one, namely one for all its subword components. What we then typically do is average the outputs of those tokens of the right word, to get one representation for that word. I’m on mobile now, but this is a modified script that I have used in the past to get the output of a specific word.
 import numpy as np
 import torch
 from transformers import AutoTokenizer, AutoModel
 def get_word_idx(sent: str, word: str):
     return sent.split(" ").index(word)
 def get_hidden_states(encoded, token_ids_word, model, layers):
     """Push input IDs through model. Stack and sum `layers` (last four by default).
        Select only those subword token outputs that belong to our word of interest
        and average them."""
     with torch.no_grad():
         output = model(**encoded)
     # Get all hidden states
     states = output.hidden_states
     # Stack and sum all requested layers
     output = torch.stack([states[i] for i in layers]).sum(0).squeeze()
     # Only select the tokens that constitute the requested word
     word_tokens_output = output[token_ids_word]
     return word_tokens_output.mean(dim=0)
 def get_word_vector(sent, idx, tokenizer, model, layers):
     """Get a word vector by first tokenizing the input sentence, getting all token idxs
        that make up the word of interest, and then `get_hidden_states`."""
     encoded = tokenizer.encode_plus(sent, return_tensors="pt")
     # get all token idxs that belong to the word of interest
     token_ids_word = np.where(np.array(encoded.word_ids()) == idx)
     return get_hidden_states(encoded, token_ids_word, model, layers)
 def main(layers=None):
     # Use last four layers by default
     layers = [-4, -3, -2, -1] if layers is None else layers
     tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
     model = AutoModel.from_pretrained("bert-base-cased", output_hidden_states=True)
     sent = "I like cookies ." 
     idx = get_word_idx(sent, "cookies")

     word_embedding = get_word_vector(sent, idx, tokenizer, model, layers)
    return word_embedding 
 if __name__ == '__main__':
  1. Word embeddings are always contextual. You can extract values from the embedding layer only but that seems counter intuitive and will probably not work well. The whole point of (bidirectional) is to include context.
  2. Not sure what you mean here. Unique in that sentence or unique in what sense?

Thanks! This is really helpful.

For unique word, i was trying to relate how GloVe has one word and its vector. The reason i was mentioning this is I wanted to combine GloVe and Transformer embeddings for a word in the sentence in the initial layer similar to this paper ( and was not sure how to do this. I think somehow use transformer dynamically to get the word embedding and combine with Glove on the fly i guess.

Hi, Bram. Is it possible to generate this contextual embeddings for span of various tokens instead of for a single token? For example, in “three days ago I ate meat”, I would like to get contextual embeddings for “three days ago”. If possible, how should i do it?

Thank you so much.

Basically the same as above for each word in the span, and then sum/average those word vectors so you get one representation for the whole thing. Depending on your downstream task, this may not work very well, though.

Hi Bram,
Just trying to use these BERT (vblagoje’s finetuned model) embedding outputs for downstream POS tagging ( using as input to logistic regression layer with their tags)… would you have any idea why the performance is poor? It performs no better than it would at random.

Thanks in advance!

Please create a new post. That’s a completely separate subject than this topic.

Ah sorry my intention of creating it here is because I wanted to ask if the contextualised word embeddings generated above are actually appropriate to use for this, but I will create a separate post,

Hi Bram Vanroy, I was trying the above mentioned code with respect to BertTokenizer instead of AutoTokenizer but i get a error as mentioned below.

ValueError: word_ids() is not available when using Python-based tokenizers

Can you please let me know what changes to be made in code to get a list indicating the word corresponding to each token. Special tokens added by the tokenizer are mapped to None and other tokens are mapped to the index of their corresponding word

You need to use the fast version of the tokenizer (BertTokenizerFast) if you want to use that functionality.

1 Like