Consider the simple example below.
I consider three simple sentences, find their embeddings and and compute their cosine-similarity with each other.
I am puzzled by the results. I would have expected
this is a cat to be very similar to
this is a dog and both
this is a cat and
this is a dog to be dissimilar to
this is a banana.
However, taken at face value,
this is a banana is more similar to
this is a dog than the two animals sentences together…
Is this expected?! What do you think?
import tensorflow as tf from transformers import pipeline from numpy import dot from numpy.linalg import norm def mycos(x,y): return dot(x, y)/(norm(x)*norm(y)) mypipe = pipeline('feature-extraction', 'distilbert-base-uncased-finetuned-sst-2-english') one = mypipe('this is a cat') two = mypipe('this is a dog') three = mypipe('this is a banana') mycos(one,two) Out: 0.5795413454711928 mycos(one,three) Out: 0.19475422728604236 mycos(two,three) Out: 0.5881860164213862