Hello, I am building a multilabel classifier that uses the embeddings from sentence-transformers/all-MiniLM-L6-v2 as input. After training a model that produces good enough results, I would like to run this in the brower using transformersjs and the Xenova/all-MiniLM-L6-v2 model. However, I am getting different embeddings for the same text.
Here is my python code:
model_name = "sentence-transformers/all-MiniLM-L6-v2"
mdl = SentenceTransformer(model_name)
raw_inputs = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
se = mdl.encode(raw_inputs)
# the first 4 dimensions...
# [[-0.0635541 0.00168205 0.08878317 0.01061784]
# [-0.0278877 0.02493023 0.01891949 0.03274209]]
The js code:
import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@latest';
let extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2')
let result = await extractor( [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
], { pooling: 'mean', normalize: true });
console.log( result.data.slice(0,4) )
console.log( result.data.slice(384,388) )
// [-0.0713, 0.0169, 0.0940, 0.00842]
// [-0.0041, 0.0070, 0.0365 0.0422]
I would like to reproduce the sentence transformer embeddings. If this is not possible, I just need to have the same embeddings between python and javascript, and I will try to retrain. My specific questions are:
Am I doing this correctly?
If so, can I get the javascript embeddings to match?
Thank you
Hi,
I recognized the same with sentence-transformers and transformers.
Why is this the case and do I make a mistake or is it ok, as long as you use the same way for all embeddings you create?
My goal is building a local RAG application
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
# sentence-transformer version
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentences = [
"This framework generates embeddings for each input sentence",
"Sentences are passed as a list of strings.",
"The quick brown fox jumps over the lazy dog.",
]
embeddings = model.encode(sentences)
df = pd.DataFrame(embeddings, index=sentences)
print("sentence-transformer version")
print(df)
# ---------------------------------------------------------------
# transformer version
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return sum_embeddings / sum_mask
# Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
# Tokenize sentences
encoded_input = tokenizer(
sentences, padding=True, truncation=True, max_length=128, return_tensors="pt"
)
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
df2 = pd.DataFrame(sentence_embeddings.numpy(), index=sentences)
print("transformer version")
print(df2)
@Stefan-LTB I was able to get the embeddings to match. I believe it was caused because transformersjs preferentially uses quantized models. So by adding pipeline('feature-extraction','Xenova/all-MiniLM-L6-v2',{quantized:false}) in my js code. The values looked the same.
I am not sure why your transformers and sentence_transformers are not providing the same results. I’m pretty sure I looked at this case also. You may want to double check your masking.