Text similarity not by cosine similarity

Hi all,
I have a question.
I have a dataset containing questions and answers from a specific domain. My goal is to find the find the X most similar questions to a query.
for example:
user: “What is python?”
dataset questions: [“What is python?”, “What does python means?”, “Is it python?”, “Is it a python snake?”, “Is it a python?”]
I tried encoding the questions to embeddings and calculate the cosine similarity but the problem is it gives me high similarity score for “Is it python?” for the query “What is python?” which is clearly not the same question meaning and for “What does python means?” get very low score compared to “Is it python?”
Any suggestions how i can overcome this problem? maybe new approaches…

if cosine similarity is not giving you the results you want, you could try a different metric like euclidean / manhattan / minkowski distance or jaccard similarity.

alternatively you could try changing the embedding model to see if that improves the comparisons

1 Like

What you are trying to do is clearly one of the GLUE tasks:

3.2 SIMILARITY AND PARAPHRASE TASKS
MRPC The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether
the sentences in the pair are semantically equivalent. Because the classes are imbalanced (68%
positive), we follow common practice and report both accuracy and F1 score.

QQP The Quora Question Pairs2 dataset is a collection of question pairs from the community
question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent. As in MRPC, the class distribution in QQP is unbalanced (63% negative), so we
report both accuracy and F1 score. We use the standard test set, for which we obtained private labels
from the authors. We observe that the test set has a different label distribution than the training set.

STS-B The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence
pairs drawn from news headlines, video and image captions, and natural language inference data.
Each pair is human-annotated with a similarity score from 1 to 5; the task is to predict these scores.
Follow common practice, we evaluate using Pearson and Spearman correlation coefficients.

What I suggest you to do, is to follow the following tutorial to pre-train your model on the dataset that is the most similar to what you are trying to do (ex: GLUE, QQP instead of GLUE MRCP in the tutorial)

There is even available Leaderboard where you can find which model perform the best on QQP.

These are not definitive solutions but experiments I’ve tried with vectorized representations and I’ve had some success:

  1. Definitely try Dot product. In my limited experience dot product has always given superior results to other metrics. There are reasons why metrics like euclidean might fail, things get freaky and weird when we’re extending our 3-dimensional intuition to 100 dimensions. However, experimentation is going to make you wiser.

  2. Refer to the first WordVectors paper where they do experiments like adding and subtracting vectors like concepts. For example, v(king) - v(man) + v(woman) is close to v(queen). These experiments are not perfect and I remember reading a paper stating a proposition that this kind of adding and subtracting is flawed which might have some merit. However, they’ve worked in a limited capacity for me. So, experiments like:

v(What is python?) - v(What) + v(How) might lead you near places where python questions with How.

  • v(x) refers to the vector of x