I am interested in extracting feature embedding from famous and recent language models such as GPT-2, XLNeT or Transformer-XL.
Is there any sample code to learn how to do that?
Thanks in advance
I am interested in extracting feature embedding from famous and recent language models such as GPT-2, XLNeT or Transformer-XL.
Is there any sample code to learn how to do that?
Thanks in advance
Hello!
You can use the feature-extraction
pipeline for this.
from transformers import pipeline
pipeline = pipeline('feature-extraction', model='xlnet-base-cased')
data = pipeline("this is a test")
print(data)
You can also do this through the Inference API.
Thank you very much.
However, when I run this code, I get the following error:
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as
pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
or add a new pad token viatokenizer.add_special_tokens({'pad_token': '[PAD]'})
.
would you please let me know how to solve this problem?
Thank you!
You can specify the tokenizer with the tokenizer
argument and do what is suggested in the error message.
Here is an example based on the documentation.
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)
This works for me. Let me know if you have any other questions
Continuing the discussion from Extracting token embeddings from pretrained language models:
Thank you very much. It is working.
Final question: I think the result of your code will give me embedding of whole sentence. Can I get the word embeddings within a sentence (embeddings of each word in a sentence separately) as well?
It should already give the embedding for each token. For example, “This is a test” has 4 tokens with the gpt2 tokenizer
tokenizer("This is a test")
And if you see the output of the pipeline, you get a list with 4 lists, each containing the embedding for each individual token.
data = pipe("This is a test")
print(len(data[0]))
>>> 4
print(len(len(data[0][0]))
>>> 768
Thank you very much for your great help.
I understood how to get the values for each token. But there is one thing I am confused. I have written all the codes you kindly wrote here, and for the sentence “this is a test”, when I print the len(data[0]), I get the value “6” instead of 4.
I have attached my codes and outputs as a screenshot. Do you know what could be wrong?
When initializing the pipeline, you’re specifying model='xlnet-base-cased'
. You should be specifying the model and tokenizer you defined above.
pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)
Thank you for the very clean answer, Omar!
May I ask you to elaborate on the differences between your answer and the following ones?
If I understood well, in practice, all the questions are related to the same request: the word embedding extraction from pre-trained models. If so, what is the best practice among all the reported solutions?
Sorry if I missing something and thank you for your clarification.
you can also extract the embedding from any llm model …
as they are held in the penultimate layer on output,:
First you would export the vocabulary then send each vocab token to the model to get its respective embedding keeping them to save at the end:
I personally did this for my models also : As you will notice they are , Tokenizer related : so the number of tokens in the vocabulary reflect the number of embed dings you will return : as well as the size is related to the width of the model!
So , if you would need to Use these embedding later separately from the model, you would also need the associated . tokenzer to tokenizer your document first and present these tokens to your extracted model…
As you know these are actually open source models:
So : Disclaimer: they may not be the same as the associated professional embedding offered from the models origin: Ie: Mistral API… they may even differ model to model ? ie 4x7b or 13b may have different embedding ? as these embedding may change after fine tuning again… hence also embedding can also be volatile!
Get the vocabulary tokens
vocab_tokens = tokenizer.get_vocab().keys()
# Convert vocabulary tokens to a list
vocab_tokens_list = list(vocab_tokens)
# Get the embeddings for each vocabulary token
embeddings_list = []
for token in vocab_tokens_list:
tokens = tokenizer(token, return_tensors="pt")
embeddings = model(**tokens).past_key_values[0][0].squeeze().tolist()
embeddings_list.append(embeddings)
something like that …
it may take an hr? 32,000 tokens in a basic mistral: and llama even more ?
Question is : Which is better ? more tokens or less ? Word tokens or sentence tokens or word_gram tokens or BPE ?
If initializing a model from scratch then it woujld be prudent to begin with a Custom Tokenizer ? … ie you could now use your personalized vocabulary and corpus to train your tokenizer model first before adding it to your newly instantiated model ? then when you train your model you would be using your tokenizer , ie: Multi-lingual? mistral uses the llama tokenizer ? why not the more (taught) Berts ? is there something to GAIN by using custom tokenizers ? or should in the end they converge to the same ?? especially if you use BPE? where are the embeddings as we use the toknizer only to produce token embeddings ? but if our token embeddings have already been given some type of BOOST with training so essentially the output of the tokenizer is actually meaningfull embedding then the models should essentially have two layers of embeddings ?
Since embedding are word to word matrixes , the final layer of the model is actually the last embedding table in the model as at each layer the embedding atake new shape ? hence taking the last layer and not the input embddings layer ?
Not sure if im confused or not here ?