Identical CLS token embeddings for all different sentences?

qpazuzu · July 11, 2021, 11:14pm

Hi, I was trying to extract CLS token embeddings for my encoded inputs over a loop. I see the embeddings of CLS tokens for all my sentences are identical. Can someone please tell me what I am missing here or is this expected?

Code snippet

model_chkpt = 'roberta-large'
model = AutoModelForSequenceClassification.from_pretrained(model_chkpt, output_hidden_states=True).to('cuda')

embeddings = []
with torch.no_grad():
    for _, record in enumerate(ds_enc.shard(index=1, num_shards=1000)):
        inputs = {k: torch.unsqueeze(v, dim=1).to('cuda') for k,v in record.items()}
        outputs = model(**inputs)
        #extract CLS token from the embeddings layer
        cls_embedding = torch.squeeze(outputs.hidden_states[0][0])
        embeddings.append(cls_embedding.detach().cpu().numpy())

#Prints true for all the iteration
for _, i in enumerate(embeddings):
    print((embeddings[0] == i).all())

From what I have read CLS token for each text input is different. I am not sure if my understanding is correct. If my understanding is correct then I am seriously doing something wrong that I am not able to figure out.

Can someone please let me know where I am going wrong.

nomorewords · April 17, 2023, 4:00pm

I’m having the same issue. Did you figure it out? Thanks!

Topic		Replies	Views
Special tokens with inputs_embeds input Beginners	0	261	July 10, 2023
Sentence Embeddings From Fine-Tuned BERTForSequenceClassification 🤗Transformers	1	1677	September 29, 2021
Getting the same embedding from llama 2 class token for any input 🤗Transformers	1	1290	December 4, 2023
Getting different sentence embeddings when using model on CPU and GPU Beginners	0	2299	August 26, 2022
How is CLS special token embedding initialized? Intermediate	1	2771	March 16, 2022

Identical CLS token embeddings for all different sentences?

Related topics