Mistral model generates the same embeddings for different input texts

Hi, I am using pre-trained LLM to get a representative embedding for an input text. But the results are wired. The embeddings are all the same regardless of input texts.

Codes:

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np

PRETRAIN_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)

def generate_embedding(document):
    inputs = tokenizer(document, return_tensors='pt')
    print("Tokenized inputs:", inputs)
    with torch.no_grad():
        outputs = model(**inputs)
    embedding = outputs.last_hidden_state[0, 0, :].numpy()
    print("Generated embedding:", embedding)
    return embedding

text1 = "this is a test"
text2 = "this is another test"
text3 = "there are other tests"

embedding1 = generate_embedding(text1)
embedding2 = generate_embedding(text2)
embedding3 = generate_embedding(text3)

are_equal = np.array_equal(embedding1, embedding2) and np.array_equal(embedding2, embedding3)
if are_equal:
    print("The embeddings are the same.")
else:
    print("The embeddings are not the same.")

The output tokenized inputs are different, but the generated embeddings are the same.

Detailed output:

Tokenized inputs: {'input_ids': tensor([[   1,  456,  349,  264, 1369]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
Tokenized inputs: {'input_ids': tensor([[   1,  456,  349, 1698, 1369]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
Tokenized inputs: {'input_ids': tensor([[   1,  736,  460,  799, 8079]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
The embeddings are the same.

Does anyone know where the problem is? Many thanks!

It turns out that the embedding of the special beginning token in this model remains almost the same for different input texts. I guess that’s the reason. We can’t use the embedding of the beginning token to represent the whole sequence in this model. Got the answer from python - Mistral model generates the same embeddings for different input texts - Stack Overflow

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.