Extracting token embeddings from pretrained language models

I am interested in extracting feature embedding from famous and recent language models such as GPT-2, XLNeT or Transformer-XL.

Is there any sample code to learn how to do that?

Thanks in advance

2 Likes

Hello!
You can use the feature-extraction pipeline for this.

from transformers import pipeline
pipeline = pipeline('feature-extraction', model='xlnet-base-cased')
data = pipeline("this is a test")
print(data)

You can also do this through the Inference API.

1 Like

Thank you very much.
However, when I run this code, I get the following error:

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

would you please let me know how to solve this problem?

Thank you!

You can specify the tokenizer with the tokenizer argument and do what is suggested in the error message.

Here is an example based on the documentation.

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))

pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

This works for me. Let me know if you have any other questions :slight_smile:

3 Likes

Continuing the discussion from Extracting token embeddings from pretrained language models:

Thank you very much. It is working.

Final question: I think the result of your code will give me embedding of whole sentence. Can I get the word embeddings within a sentence (embeddings of each word in a sentence separately) as well?

It should already give the embedding for each token. For example, “This is a test” has 4 tokens with the gpt2 tokenizer

tokenizer("This is a test")

And if you see the output of the pipeline, you get a list with 4 lists, each containing the embedding for each individual token.

data = pipe("This is a test")
print(len(data[0]))
>>> 4
print(len(len(data[0][0]))
>>> 768
1 Like

Thank you very much for your great help.

I understood how to get the values for each token. But there is one thing I am confused. I have written all the codes you kindly wrote here, and for the sentence “this is a test”, when I print the len(data[0]), I get the value “6” instead of 4.

I have attached my codes and outputs as a screenshot. Do you know what could be wrong?

1 Like

When initializing the pipeline, you’re specifying model='xlnet-base-cased'. You should be specifying the model and tokenizer you defined above.

pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)
3 Likes

Thank you for the very clean answer, Omar!

May I ask you to elaborate on the differences between your answer and the following ones?

If I understood well, in practice, all the questions are related to the same request: the word embedding extraction from pre-trained models. If so, what is the best practice among all the reported solutions?

Sorry if I missing something and thank you for your clarification. :slightly_smiling_face: