Extracting embedding values of NLP pertained models from tokenized strings

kadaj13 · August 17, 2021, 6:58am

I am using huggingface’s pipeline to extract embeddings of words in a sentence. As far as I know, first a sentence will be turned into a tokenized strings. I think the length of the tokenized string might not be equal to the number of words in the original sentence. I need to retrieve word embedding of a particular sentence.

For example, here is my code:

#https://discuss.huggingface.co/t/extracting-token-embeddings-from-pretrained-language-models/6834/6

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
import re

model_name = "xlnet-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))

model_pipeline = pipeline('feature-extraction', model=model_name, tokenizer=tokenizer)

def find_wordNo_sentence(word, sentence):
    
    print(sentence)
    splitted_sen = sentence.split(" ")
    print(splitted_sen)
    index = splitted_sen.index(word)


    for i,w in enumerate(splitted_sen):
        if(word == w):
            return i

    print("not found") #0 base

def return_xlnet_embedding(word, sentence):
        
    word = re.sub(r'[^\w]', " ", word)
    word = " ".join(word.split())
    
    sentence = re.sub(r'[^\w]', ' ', sentence)
    sentence = " ".join(sentence.split())
    
    id_word = find_wordNo_sentence(word, sentence)
    
   
        
    try:
        data = model_pipeline(sentence)
        
        n_words = len(sentence.split(" "))
        #print(sentence_emb.shape)
        n_embs  = len(data[0])
        print(n_embs, n_words)
        print(len(data[0]))
    
        if (n_words != n_embs):
            "There is extra tokenized word"
            
            
        results = data[0][id_word]  
        return np.array(results)
    
    except:
        return "word not found"

return_xlnet_embedding('your', "what is your name?")

Then the output is:

what is your name [‘what’, ‘is’, ‘your’, ‘name’] 6 4 6

So the length of tokenized string that is fed to the pipeline is two more than number of my words. How can I find which one (among these 6 values) are the embedding of my word?

kadaj13 · August 17, 2021, 7:07am

More especifically, I want to know when I call the model_pipeline(sentence), how should I understand how the sentence was tokenized? Because I think some words in the sentence might be tokenized into several parts, so I need to understand them.

lvwerra · August 17, 2021, 8:37am

Hi @kadaj13

You can checkout which words correspond to which token with the tokenizer and the word_ids function:

inputs = tokenizer('This is a loooong word')
print(f"Word IDs: {inputs.word_ids()}")
print(f"Tokens: {inputs.tokens()}")

>>> Word IDs: [0, 1, 2, 3, 3, 3, 3, 4, None, None]
>>> Tokens: ['▁This', '▁is', '▁a', '▁', 'loo', 'o', 'ong', '▁word', '<sep>', '<cls>']

You can see that tokens __, loo, o, and ong all belong the word with ID 3 (in other words the 4th word).

This also helps you spot the position of the special input characters that you probably don’t want to embed which are indicated with None.

Hope this helps!

kadaj13 · August 18, 2021, 6:28am

Thank you very much

Topic		Replies	Views
Extracting token embeddings from pretrained language models Beginners	9	22218	May 2, 2024
Extracting sentence embeddings from NLP models from each layer seperately Beginners	0	722	August 18, 2021
Embeddings from llama2 🤗Transformers	6	12380	December 13, 2023
Choosing the layer for extracting NLP features (using using pipeline) Models	0	768	August 19, 2021
Get output embedding of FeatureExtractor 🤗Transformers	1	705	April 20, 2021

Extracting embedding values of NLP pertained models from tokenized strings

Related topics