Hi, I was trying to extract CLS token embeddings for my encoded inputs over a loop. I see the embeddings of CLS tokens for all my sentences are identical. Can someone please tell me what I am missing here or is this expected?
Code snippet
model_chkpt = 'roberta-large'
model = AutoModelForSequenceClassification.from_pretrained(model_chkpt, output_hidden_states=True).to('cuda')
embeddings = []
with torch.no_grad():
for _, record in enumerate(ds_enc.shard(index=1, num_shards=1000)):
inputs = {k: torch.unsqueeze(v, dim=1).to('cuda') for k,v in record.items()}
outputs = model(**inputs)
#extract CLS token from the embeddings layer
cls_embedding = torch.squeeze(outputs.hidden_states[0][0])
embeddings.append(cls_embedding.detach().cpu().numpy())
#Prints true for all the iteration
for _, i in enumerate(embeddings):
print((embeddings[0] == i).all())
From what I have read CLS token for each text input is different. I am not sure if my understanding is correct. If my understanding is correct then I am seriously doing something wrong that I am not able to figure out.
Can someone please let me know where I am going wrong.