Hello,
I found that the embeddings given by a sentence transformer 'distiluse-base-multilingual-cased-v1’ is slightly different, depending on how you calculate them for a Pandas series.
Specifically, the result is different if I calculate encoding using series as an argument or if I use ‘apply’ method:
import pandas as pd
import sentence transformer model
sentence_transformer_path=‘distiluse-base-multilingual-cased-v1’
from sentence_transformers import SentenceTransformer
encoder=SentenceTransformer(sentence_transformer_path).encode
create dataframe with texts
s=[str(i**2) for i in range(10)]
df=pd.DataFrame()
df[‘num’]=s
find embedding 1st method
embed1=encoder(df[‘num’])
compare embeddnigs for a given row
row=1
difference=(encoder(df.loc[row,‘num’])-embed1[row,:])
find total length of the difference vector
print('Method 1 difference: ',(sum(difference**2))**0.5)
find embedding 2nd method
embed2=(df[‘num’]).apply(encoder)
#compare embeddnigs for a given row
row=1
difference=(encoder(df.loc[row,‘num’])-embed2[row])
find total length of the difference vector
print('Method 2 difference: ',(sum(difference**2))**0.5)
The difference seems to be minor but I wonder what is the reason. Also, I don’t know if it is specific to my machine or not.
Thank you