All-mpnet-base-v2 get different results in Spark-NLP vs SentenceTransformers

I’m trying to populate Pinecone with sentence embeddings for a large dataset using Spark and then look up the nearest matched using an embedding generated by SentenceTransformers using the same all-mpnet-base-v2 model.

I would expect the vector generated for a given sentence to be the same whether it was generated through Spark-NLP or SentenceTransformers but it differs considerably.

With SentenceTransformers:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
model.encode("boring")

array([-3.42088640e-02, 1.28042186e-02, 1.47349751e-02, 1.06020477e-02, …

With Spark-NLP

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import SparkSession

document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

mpnet = MPNetEmbeddings.pretrained("all_mpnet_base_v2", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("embeddings")

embeddings_finisher = (
        EmbeddingsFinisher()
        .setInputCols("embeddings")
        .setOutputCols("vector")
        .setOutputAsVector(True)
    )

pipeline = Pipeline(stages=[document, mpnet, embeddings_finisher])

data = spark.createDataFrame([["boring"]], ["text"])
result = pipeline.fit(data).transform(data).withColumn("vector", element_at("vector", 1))
result.collect()[0].vector

DenseVector([-0.0125, 0.0614, -0.0067, 0.0252, 0.0148, 0.0332, …

As you can see the value are very different even though they’re using the same HuggingFace model. What am I missing? Both methods are presumably getting their vocabulary from the model so, even though they use different tokenizer implementations, I would expect them to get the same results.

1 Like

For future readers, it’s a bug in Spark NLP 5.5.2. 5.5.1 works fine

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.