I’m trying to populate Pinecone with sentence embeddings for a large dataset using Spark and then look up the nearest matched using an embedding generated by SentenceTransformers using the same all-mpnet-base-v2 model.
I would expect the vector generated for a given sentence to be the same whether it was generated through Spark-NLP or SentenceTransformers but it differs considerably.
With SentenceTransformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
model.encode("boring")
array([-3.42088640e-02, 1.28042186e-02, 1.47349751e-02, 1.06020477e-02, …
With Spark-NLP
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import SparkSession
document = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
mpnet = MPNetEmbeddings.pretrained("all_mpnet_base_v2", "en") \
.setInputCols(["document"]) \
.setOutputCol("embeddings")
embeddings_finisher = (
EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("vector")
.setOutputAsVector(True)
)
pipeline = Pipeline(stages=[document, mpnet, embeddings_finisher])
data = spark.createDataFrame([["boring"]], ["text"])
result = pipeline.fit(data).transform(data).withColumn("vector", element_at("vector", 1))
result.collect()[0].vector
DenseVector([-0.0125, 0.0614, -0.0067, 0.0252, 0.0148, 0.0332, …
As you can see the value are very different even though they’re using the same HuggingFace model. What am I missing? Both methods are presumably getting their vocabulary from the model so, even though they use different tokenizer implementations, I would expect them to get the same results.