Sentence similarity

Hi all,
I have a question.
I have a dataset containing questions and answers from a specific domain. My goal is to find the find the X most similar questions to a query.
for example:
user: “What is python?”
dataset questions: [“What is python?”, “What does python means?”, “Is it python?”, “Is it a python snake?”, “Is it a python?”]
I tried encoding the questions to embeddings and calculate the cosine similarity but the problem is it gives me high similarity score for “Is it python?” for the query “What is python?” which is clearly not the same question meaning and for “What does python means?” get very low score compared to “Is it python?”
Any suggestions how i can overcome this problem? maybe new approaches…


I would suggest to try 3-4 models from the Sentence similarity task filter.

There is an easy way to do it: use accelerated inference for each model from Colab notebook. It may help you to see if some of them is really giving the high weight t the “What does python means?” question from your example.