Sentence similarity

yanagar25 · June 28, 2021, 9:25am

Hi all,
I have a question.
I have a dataset containing questions and answers from a specific domain. My goal is to find the find the X most similar questions to a query.
for example:
user: “What is python?”
dataset questions: [“What is python?”, “What does python means?”, “Is it python?”, “Is it a python snake?”, “Is it a python?”]
I tried encoding the questions to embeddings and calculate the cosine similarity but the problem is it gives me high similarity score for “Is it python?” for the query “What is python?” which is clearly not the same question meaning and for “What does python means?” get very low score compared to “Is it python?”
Any suggestions how i can overcome this problem? maybe new approaches…

pavel-nesterov · September 16, 2021, 12:47pm

Hi,

I would suggest to try 3-4 models from the Sentence similarity task filter.

There is an easy way to do it: use accelerated inference for each model from Colab notebook. It may help you to see if some of them is really giving the high weight t the “What does python means?” question from your example.

Topic		Replies	Views
Text similarity not by cosine similarity Research	3	4735	April 12, 2022
Can Similarity Sentence Returns the Similarity Content? 🤗Transformers	0	324	April 27, 2023
Retrieval by question-answer similarity Beginners	0	318	February 10, 2023
Fine tuning a sentence-transformer for cosine sim on 500k sentence pairs without labels-- advice 🤗Transformers	2	1200	April 20, 2024
Sentence Similarity for Code Generation related tasks Beginners	1	872	March 28, 2022

Sentence similarity

Related topics