All-mpnet-base-v2 get different results in Spark-NLP vs SentenceTransformers

lamaisona · January 3, 2025, 1:07pm

I’m trying to populate Pinecone with sentence embeddings for a large dataset using Spark and then look up the nearest matched using an embedding generated by SentenceTransformers using the same all-mpnet-base-v2 model.

I would expect the vector generated for a given sentence to be the same whether it was generated through Spark-NLP or SentenceTransformers but it differs considerably.

With SentenceTransformers:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
model.encode("boring")

array([-3.42088640e-02, 1.28042186e-02, 1.47349751e-02, 1.06020477e-02, …

With Spark-NLP

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import SparkSession

document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

mpnet = MPNetEmbeddings.pretrained("all_mpnet_base_v2", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("embeddings")

embeddings_finisher = (
        EmbeddingsFinisher()
        .setInputCols("embeddings")
        .setOutputCols("vector")
        .setOutputAsVector(True)
    )

pipeline = Pipeline(stages=[document, mpnet, embeddings_finisher])

data = spark.createDataFrame([["boring"]], ["text"])
result = pipeline.fit(data).transform(data).withColumn("vector", element_at("vector", 1))
result.collect()[0].vector

DenseVector([-0.0125, 0.0614, -0.0067, 0.0252, 0.0148, 0.0332, …

As you can see the value are very different even though they’re using the same HuggingFace model. What am I missing? Both methods are presumably getting their vocabulary from the model so, even though they use different tokenizer implementations, I would expect them to get the same results.

lamaisona · January 6, 2025, 7:43pm

For future readers, it’s a bug in Spark NLP 5.5.2. 5.5.1 works fine

system · January 7, 2025, 7:43am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SentenceSimilarityInputsCheck expected dict not list: `__root__` in `parameters` Beginners	7	1880	August 11, 2023
SentenceTransformer TrainingArguments torch and accelerate version issue Beginners	0	98	August 28, 2024
Sentence-transformers/all-mpnet-base-v2 requires Input Text after Cleaning or Raw Text Only Models	0	592	January 6, 2022
Different embeddings when using sentence transformers and transformers.js Beginners	3	907	April 19, 2024
How to get sentences from embeddings Intermediate	0	441	February 3, 2022

All-mpnet-base-v2 get different results in Spark-NLP vs SentenceTransformers

Related topics