Extracting embedding values of NLP pertained models from tokenized strings

I am using huggingface’s pipeline to extract embeddings of words in a sentence. As far as I know, first a sentence will be turned into a tokenized strings. I think the length of the tokenized string might not be equal to the number of words in the original sentence. I need to retrieve word embedding of a particular sentence.

For example, here is my code:

#https://discuss.huggingface.co/t/extracting-token-embeddings-from-pretrained-language-models/6834/6

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
import re

model_name = "xlnet-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))

model_pipeline = pipeline('feature-extraction', model=model_name, tokenizer=tokenizer)

def find_wordNo_sentence(word, sentence):
    
    print(sentence)
    splitted_sen = sentence.split(" ")
    print(splitted_sen)
    index = splitted_sen.index(word)


    for i,w in enumerate(splitted_sen):
        if(word == w):
            return i

    print("not found") #0 base
def return_xlnet_embedding(word, sentence):
        
    word = re.sub(r'[^\w]', " ", word)
    word = " ".join(word.split())
    
    sentence = re.sub(r'[^\w]', ' ', sentence)
    sentence = " ".join(sentence.split())
    
    id_word = find_wordNo_sentence(word, sentence)
    
   
        
    try:
        data = model_pipeline(sentence)
        
        n_words = len(sentence.split(" "))
        #print(sentence_emb.shape)
        n_embs  = len(data[0])
        print(n_embs, n_words)
        print(len(data[0]))
    
        if (n_words != n_embs):
            "There is extra tokenized word"
            
            
        results = data[0][id_word]  
        return np.array(results)
    
    except:
        return "word not found"

return_xlnet_embedding('your', "what is your name?")

Then the output is:

what is your name [‘what’, ‘is’, ‘your’, ‘name’] 6 4 6

So the length of tokenized string that is fed to the pipeline is two more than number of my words. How can I find which one (among these 6 values) are the embedding of my word?

More especifically, I want to know when I call the model_pipeline(sentence), how should I understand how the sentence was tokenized? Because I think some words in the sentence might be tokenized into several parts, so I need to understand them.

Hi @kadaj13

You can checkout which words correspond to which token with the tokenizer and the word_ids function:

inputs = tokenizer('This is a loooong word')
print(f"Word IDs: {inputs.word_ids()}")
print(f"Tokens: {inputs.tokens()}")

>>> Word IDs: [0, 1, 2, 3, 3, 3, 3, 4, None, None]
>>> Tokens: ['▁This', '▁is', '▁a', '▁', 'loo', 'o', 'ong', '▁word', '<sep>', '<cls>']

You can see that tokens __, loo, o, and ong all belong the word with ID 3 (in other words the 4th word).

This also helps you spot the position of the special input characters that you probably don’t want to embed which are indicated with None.

Hope this helps!

1 Like

Thank you very much