Interesting (but puzzling) cosine-similarity comparison with distilbert

olaffson · August 6, 2021, 5:03pm

Consider the simple example below.

I consider three simple sentences, find their embeddings and and compute their cosine-similarity with each other.

I am puzzled by the results. I would have expected this is a cat to be very similar to this is a dog and both this is a cat and this is a dog to be dissimilar to this is a banana.

However, taken at face value, this is a banana is more similar to this is a dog than the two animals sentences together…

Is this expected?! What do you think?

import tensorflow as tf
from transformers import pipeline
from numpy import dot
from numpy.linalg import norm

def mycos(x,y):
    return dot(x, y)/(norm(x)*norm(y))

mypipe = pipeline('feature-extraction', 'distilbert-base-uncased-finetuned-sst-2-english')

one = mypipe('this is a cat')[0][0]
two = mypipe('this is a dog')[0][0]
three = mypipe('this is a banana')[0][0]

mycos(one,two)
Out[55]: 0.5795413454711928

mycos(one,three)
Out[56]: 0.19475422728604236

mycos(two,three)
Out[57]: 0.5881860164213862

Thanks!

Topic		Replies	Views
Knowledge Distillation of SentenceTransformer - problems making it work Beginners	0	1060	April 9, 2022
Computing similarity between sentences Intermediate	4	3279	July 31, 2021
How to properly compute Sentence Embeddings using a non english, pretrained distilbert model? Beginners	0	514	April 25, 2021
Training classifier with frozen DistilBERT embeddings Beginners	5	3447	January 24, 2025
Extracting embeddings with distilbert? (in tensorflow) 🤗Transformers	5	2998	August 6, 2021

Interesting (but puzzling) cosine-similarity comparison with distilbert

Related topics