Identical CLS token embeddings for all different sentences?

Hi, I was trying to extract CLS token embeddings for my encoded inputs over a loop. I see the embeddings of CLS tokens for all my sentences are identical. Can someone please tell me what I am missing here or is this expected?

Code snippet

model_chkpt = 'roberta-large'
model = AutoModelForSequenceClassification.from_pretrained(model_chkpt, output_hidden_states=True).to('cuda')

embeddings = []
with torch.no_grad():
    for _, record in enumerate(ds_enc.shard(index=1, num_shards=1000)):
        inputs = {k: torch.unsqueeze(v, dim=1).to('cuda') for k,v in record.items()}
        outputs = model(**inputs)
        #extract CLS token from the embeddings layer
        cls_embedding = torch.squeeze(outputs.hidden_states[0][0])
        embeddings.append(cls_embedding.detach().cpu().numpy())

#Prints true for all the iteration
for _, i in enumerate(embeddings):
    print((embeddings[0] == i).all())

From what I have read CLS token for each text input is different. I am not sure if my understanding is correct. If my understanding is correct then I am seriously doing something wrong that I am not able to figure out.

Can someone please let me know where I am going wrong.

1 Like

I’m having the same issue. Did you figure it out? Thanks!