Consider the simple example below.
I consider three simple sentences, find their embeddings and and compute their cosine-similarity with each other.
I am puzzled by the results. I would have expected this is a cat
to be very similar to this is a dog
and both this is a cat
and this is a dog
to be dissimilar to this is a banana
.
However, taken at face value, this is a banana
is more similar to this is a dog
than the two animals sentences together…
Is this expected?! What do you think?
import tensorflow as tf
from transformers import pipeline
from numpy import dot
from numpy.linalg import norm
def mycos(x,y):
return dot(x, y)/(norm(x)*norm(y))
mypipe = pipeline('feature-extraction', 'distilbert-base-uncased-finetuned-sst-2-english')
one = mypipe('this is a cat')[0][0]
two = mypipe('this is a dog')[0][0]
three = mypipe('this is a banana')[0][0]
mycos(one,two)
Out[55]: 0.5795413454711928
mycos(one,three)
Out[56]: 0.19475422728604236
mycos(two,three)
Out[57]: 0.5881860164213862
Thanks!