Extracting token embeddings from pretrained language models

kadaj13 · June 15, 2021, 4:16am

I am interested in extracting feature embedding from famous and recent language models such as GPT-2, XLNeT or Transformer-XL.

Is there any sample code to learn how to do that?

Thanks in advance

osanseviero · June 15, 2021, 6:27am

Hello!
You can use the feature-extraction pipeline for this.

from transformers import pipeline
pipeline = pipeline('feature-extraction', model='xlnet-base-cased')
data = pipeline("this is a test")
print(data)

You can also do this through the Inference API.

kadaj13 · June 15, 2021, 7:55am

Thank you very much.
However, when I run this code, I get the following error:

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

would you please let me know how to solve this problem?

Thank you!

osanseviero · June 15, 2021, 9:00am

You can specify the tokenizer with the tokenizer argument and do what is suggested in the error message.

Here is an example based on the documentation.

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))

pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

This works for me. Let me know if you have any other questions

kadaj13 · June 15, 2021, 10:16am

Continuing the discussion from Extracting token embeddings from pretrained language models:

Thank you very much. It is working.

Final question: I think the result of your code will give me embedding of whole sentence. Can I get the word embeddings within a sentence (embeddings of each word in a sentence separately) as well?

osanseviero · June 15, 2021, 10:23am

It should already give the embedding for each token. For example, “This is a test” has 4 tokens with the gpt2 tokenizer

tokenizer("This is a test")

And if you see the output of the pipeline, you get a list with 4 lists, each containing the embedding for each individual token.

data = pipe("This is a test")
print(len(data[0]))
>>> 4
print(len(len(data[0][0]))
>>> 768

kadaj13 · June 16, 2021, 10:27am

Thank you very much for your great help.

I understood how to get the values for each token. But there is one thing I am confused. I have written all the codes you kindly wrote here, and for the sentence “this is a test”, when I print the len(data[0]), I get the value “6” instead of 4.

I have attached my codes and outputs as a screenshot. Do you know what could be wrong?

osanseviero · June 16, 2021, 10:42am

When initializing the pipeline, you’re specifying model='xlnet-base-cased'. You should be specifying the model and tokenizer you defined above.

pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

giuseppefutia · August 11, 2022, 3:39pm

Thank you for the very clean answer, Omar!

May I ask you to elaborate on the differences between your answer and the following ones?

If I understood well, in practice, all the questions are related to the same request: the word embedding extraction from pre-trained models. If so, what is the best practice among all the reported solutions?

Sorry if I missing something and thank you for your clarification.

LeroyDyer · May 2, 2024, 3:11pm

you can also extract the embedding from any llm model …
as they are held in the penultimate layer on output,:
First you would export the vocabulary then send each vocab token to the model to get its respective embedding keeping them to save at the end:
I personally did this for my models also : As you will notice they are , Tokenizer related : so the number of tokens in the vocabulary reflect the number of embed dings you will return : as well as the size is related to the width of the model!
So , if you would need to Use these embedding later separately from the model, you would also need the associated . tokenzer to tokenizer your document first and present these tokens to your extracted model…
As you know these are actually open source models:
So : Disclaimer: they may not be the same as the associated professional embedding offered from the models origin: Ie: Mistral API… they may even differ model to model ? ie 4x7b or 13b may have different embedding ? as these embedding may change after fine tuning again… hence also embedding can also be volatile!


 Get the vocabulary tokens
  vocab_tokens = tokenizer.get_vocab().keys()

  # Convert vocabulary tokens to a list
  vocab_tokens_list = list(vocab_tokens)

  # Get the embeddings for each vocabulary token
  embeddings_list = []
  for token in vocab_tokens_list:
      tokens = tokenizer(token, return_tensors="pt")
      embeddings = model(**tokens).past_key_values[0][0].squeeze().tolist()
      embeddings_list.append(embeddings)

something like that …
it may take an hr? 32,000 tokens in a basic mistral: and llama even more ?
Question is : Which is better ? more tokens or less ? Word tokens or sentence tokens or word_gram tokens or BPE ?
If initializing a model from scratch then it woujld be prudent to begin with a Custom Tokenizer ? … ie you could now use your personalized vocabulary and corpus to train your tokenizer model first before adding it to your newly instantiated model ? then when you train your model you would be using your tokenizer , ie: Multi-lingual? mistral uses the llama tokenizer ? why not the more (taught) Berts ? is there something to GAIN by using custom tokenizers ? or should in the end they converge to the same ?? especially if you use BPE? where are the embeddings as we use the toknizer only to produce token embeddings ? but if our token embeddings have already been given some type of BOOST with training so essentially the output of the tokenizer is actually meaningfull embedding then the models should essentially have two layers of embeddings ?
Since embedding are word to word matrixes , the final layer of the model is actually the last embedding table in the model as at each layer the embedding atake new shape ? hence taking the last layer and not the input embddings layer ?

Not sure if im confused or not here ?

Topic		Replies	Views
Extracting embedding values of NLP pertained models from tokenized strings 🤗Tokenizers	3	2221	August 18, 2021
Get output embedding of FeatureExtractor 🤗Transformers	1	703	April 20, 2021
Easiest way to get a senetence embedder from a transformers model? 🤗Transformers	1	1376	April 7, 2022
Getting pretrained embeddings 🤗Transformers	0	598	June 20, 2023
Extracting sentence embeddings from NLP models from each layer seperately Beginners	0	719	August 18, 2021

Extracting token embeddings from pretrained language models

Related topics