Mistral model generates the same embeddings for different input texts

hliuci · April 11, 2024, 9:07am

Hi, I am using pre-trained LLM to get a representative embedding for an input text. But the results are wired. The embeddings are all the same regardless of input texts.

Codes:

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np

PRETRAIN_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)

def generate_embedding(document):
    inputs = tokenizer(document, return_tensors='pt')
    print("Tokenized inputs:", inputs)
    with torch.no_grad():
        outputs = model(**inputs)
    embedding = outputs.last_hidden_state[0, 0, :].numpy()
    print("Generated embedding:", embedding)
    return embedding

text1 = "this is a test"
text2 = "this is another test"
text3 = "there are other tests"

embedding1 = generate_embedding(text1)
embedding2 = generate_embedding(text2)
embedding3 = generate_embedding(text3)

are_equal = np.array_equal(embedding1, embedding2) and np.array_equal(embedding2, embedding3)
if are_equal:
    print("The embeddings are the same.")
else:
    print("The embeddings are not the same.")

The output tokenized inputs are different, but the generated embeddings are the same.

Detailed output:

Tokenized inputs: {'input_ids': tensor([[   1,  456,  349,  264, 1369]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
Tokenized inputs: {'input_ids': tensor([[   1,  456,  349, 1698, 1369]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
Tokenized inputs: {'input_ids': tensor([[   1,  736,  460,  799, 8079]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679  1.9293272 -2.2413437 ...  2.6379988 -3.104867   4.806004 ]
The embeddings are the same.

Does anyone know where the problem is? Many thanks!

hliuci · April 12, 2024, 9:17am

It turns out that the embedding of the special beginning token in this model remains almost the same for different input texts. I guess that’s the reason. We can’t use the embedding of the beginning token to represent the whole sequence in this model. Got the answer from python - Mistral model generates the same embeddings for different input texts - Stack Overflow

system · April 12, 2024, 9:17pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
The hidden_states when i use model.generate 🤗Transformers	4	1853	March 28, 2025
Injecting multi modal embeddings into a language model breaks the `generate` function 🤗Transformers	0	56	July 17, 2024
Identical CLS token embeddings for all different sentences? Beginners	1	451	April 17, 2023
Different embeddings when using sentence transformers and transformers.js Beginners	3	899	April 19, 2024
Parallelize Mistral/ llama2 output 🤗Transformers	1	153	May 25, 2024

Mistral model generates the same embeddings for different input texts

Related topics