- Yes you can get a word embedding for a specific word in a sentence. You have to take care though, because in language models we often use a subword tokenizer. It chops words into smaller pieces. That means that you do not necessarily get one output for every word in a sentence, but probably more than one, namely one for all its subword components. What we then typically do is average the outputs of those tokens of the right word, to get one representation for that word. I’m on mobile now, but this is a modified script that I have used in the past to get the output of a specific word.
import numpy as np
from transformers import AutoTokenizer, AutoModel
def get_word_idx(sent: str, word: str):
return sent.split(" ").index(word)
def get_hidden_states(encoded, token_ids_word, model, layers):
"""Push input IDs through model. Stack and sum `layers` (last four by default).
Select only those subword token outputs that belong to our word of interest
and average them."""
output = model(**encoded)
# Get all hidden states
states = output.hidden_states
# Stack and sum all requested layers
output = torch.stack([states[i] for i in layers]).sum(0).squeeze()
# Only select the tokens that constitute the requested word
word_tokens_output = output[token_ids_word]
def get_word_vector(sent, idx, tokenizer, model, layers):
"""Get a word vector by first tokenizing the input sentence, getting all token idxs
that make up the word of interest, and then `get_hidden_states`."""
encoded = tokenizer.encode_plus(sent, return_tensors="pt")
# get all token idxs that belong to the word of interest
token_ids_word = np.where(np.array(encoded.word_ids()) == idx)
return get_hidden_states(encoded, token_ids_word, model, layers)
# Use last four layers by default
layers = [-4, -3, -2, -1] if layers is None else layers
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModel.from_pretrained("bert-base-cased", output_hidden_states=True)
sent = "I like cookies ."
idx = get_word_idx(sent, "cookies")
word_embedding = get_word_vector(sent, idx, tokenizer, model, layers)
if __name__ == '__main__':
- Word embeddings are always contextual. You can extract values from the embedding layer only but that seems counter intuitive and will probably not work well. The whole point of (bidirectional) is to include context.
- Not sure what you mean here. Unique in that sentence or unique in what sense?