Choosing the layer for extracting NLP features (using using pipeline)

kadaj13 · August 19, 2021, 10:48am

Following my previous post, I am trying to extract NLP features from famous models such as BERT or t5.

As you may know, these models consist of many layers. Here is my code:

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
import re


# In[7]:


model_name = "xlnet-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))

model_pipeline = pipeline('feature-extraction', model=model_name, tokenizer=tokenizer)


# In[8]:


def find_wordNo_sentence(word, sentence):
    
    print(sentence)
    splitted_sen = sentence.split(" ")
    print(splitted_sen)
    index = splitted_sen.index(word)


    for i,w in enumerate(splitted_sen):
        if(word == w):
            return i

    print("not found") #0 base


# In[13]:


def return_xlnet_embedding(word, sentence):
        
    word = re.sub(r'[^\w]', " ", word)
    word = " ".join(word.split())
    
    sentence = re.sub(r'[^\w]', ' ', sentence)
    sentence = " ".join(sentence.split())
    
    id_word = find_wordNo_sentence(word, sentence)
    
   
        
    try:
        data = model_pipeline(sentence)

        n_words = len(sentence.split(" "))
        n_embs  = len(data[0])
        print(n_embs, n_words)
        print(len(data[0]))


        inputs = tokenizer(sentence)
        print(inputs.word_ids())
        list_of_word_ids = [i for i,j in enumerate(inputs.word_ids()) if j==id_word]

        print("hi")
        print(list_of_word_ids)
        print(inputs.tokens())
        print(len(data[0]))
        results = np.zeros(len(data[0][0]))

        for i in range(len(list_of_word_ids)):
            print(len(data[0][i]))
            results += np.array(data[0][i])
        results /= len(list_of_word_ids)

        return np.array(results)

    except:
        print("word is wrong")

How can I specify which layer to extract features?

Topic		Replies	Views
Extracting sentence embeddings from NLP models from each layer seperately Beginners	0	717	August 18, 2021
Feature extraction pipeline Vs model hidden states Beginners	1	1590	February 7, 2021
Extracting token embeddings from pretrained language models Beginners	9	21917	May 2, 2024
How to use a feature-extraction pipeline in a sklearn pipeline? Beginners	1	2206	June 16, 2022
Extracting embeddings with distilbert? (in tensorflow) 🤗Transformers	5	2985	August 6, 2021

Choosing the layer for extracting NLP features (using using pipeline)

Related topics