Extracting embeddings with distilbert? (in tensorflow)

Hello,

I am trying to understand the transformers architecture better and in particular to extract the contextual embeddings for a given sentence.

I know I can use the pipeline feature-extraction but I would like to extract them manually, but consider the small example below. Unfortunately, the last hidden states cannot be the contextual embeddings : I get a 2-dimensional vector whereas the embeddings have hundreds of dimensions.

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = TFAutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

input_ids = tf.constant(tokenizer.encode("Hello I am a dog."))[None, :]
outputs = model(input_ids)
last_hidden_states = outputs[0]

last_hidden_states.numpy()
Out[22]: array([[-1.651872 ,  1.6822953]], dtype=float32)

What is the issue here? Thanks!

You should use the model without head for this, given by the class TFAutoModel. Here you use a model for sequence classification.

1 Like

ha! thanks @sgugger. The issue is that I am getting a different embedding when I use the pipeline then… See below

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = TFAutoModel.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')


input_ids = tf.constant(tokenizer.encode("Hello I am a dog."))[None, :]
outputs = model(input_ids)
last_hidden_states = outputs[0]

last_hidden_states.numpy()

Out[50]: 
array([[[ 0.01984931,  0.44084704,  0.8077417 , ...,  0.21650855,
          0.85567355,  0.5611263 ],
        [-0.03800384,  0.68151647,  0.64441675, ...,  0.14872764,
          0.95094967,  0.39576268],
        [ 0.2189587 ,  0.5272005 ,  0.59539056, ...,  0.02541553,
          0.72854155,  0.70450056],
        ...,
        [ 0.46494445,  0.5847002 ,  0.71188813, ..., -0.09583103,
          0.91368306,  0.67914045],
        [-0.14020433,  0.03367072,  0.83420205, ...,  0.45702922,
          0.94741714, -0.24155208],
        [ 0.55558103,  0.3518911 ,  0.16493745, ...,  0.73744303,
          0.1932145 , -0.14626735]]], dtype=float32)

Whereas using the pipeline returns another embedding

mypipe = pipeline('feature-extraction', 'distilbert-base-uncased-finetuned-sst-2-english')
mypipe("Hello I am a dog.")[0][0]

...
 -0.42439621686935425,
 0.44787469506263733,
 -0.559832751750946,
 -1.4749068021774292,
 -1.0559613704681396,
 -0.29952752590179443,
 -0.6128035187721252,
 -0.016618162393569946,
 -0.6082541942596436,
 0.31112414598464966,
 -0.3840394616127014,
 -0.2776432931423187,
 0.21650901436805725,
 0.855672299861908,
 0.5611256957054138]

Do you see what is happening here?

mypipe("Hello I am a dog.")[0][0] prints the feature of the first token. I can see the end matches perfectly the end of the first row of the matrix. You can look at the other elements of mypipe("Hello I am a dog.")[0] to check the other features.

1 Like

got it, thanks for the clarification @sgugger. This is very useful.
The hidden state has shape:

last_hidden_states.numpy().shape
Out[53]: (1, 8, 768)

Which means the very first embedding (position [0][0]) corresponds to the [CLS] token, right? The one people usually use as a vector-representation of the whole sentence?

Thanks!

Yes, must be the case

tokenizer.decode(tokenizer.encode("Hello I am a dog."))
Out[62]: '[CLS] hello i am a dog. [SEP]'