I have a question.
I have a dataset containing questions and answers from a specific domain. My goal is to find the find the X most similar questions to a query.
user: “What is python?”
dataset questions: [“What is python?”, “What does python means?”, “Is it python?”, “Is it a python snake?”, “Is it a python?”]
I tried encoding the questions to embeddings and calculate the cosine similarity but the problem is it gives me high similarity score for “Is it python?” for the query “What is python?” which is clearly not the same question meaning and for “What does python means?” get very low score compared to “Is it python?”
Any suggestions how i can overcome this problem? maybe new approaches…
if cosine similarity is not giving you the results you want, you could try a different metric like euclidean / manhattan / minkowski distance or jaccard similarity.
alternatively you could try changing the embedding model to see if that improves the comparisons