I don't understand the difference between asymmetric retrieval, sentence similarity, and semantic search

So I’ve been learning a lot about language models lately, and thought “Hey, AI could be used to search documents and websites for concepts instead of keywords; that would be really useful”. Then I learned that you don’t really need a full AI for that, you can just use the embedding layer, so now I’m learning about embeddings. Two things I don’t understand:

I get the idea of mapping a token into a multi-dimensional embedding space (and then tokens with similar meanings would be near each other in that space) but I’m unclear on how sentences or documents made up of multiple tokens get mapped into such a space. ChatGPT says that you map each token, and then aggregate them into a single vector by averaging, max pooling, or weighted sum. I’m unclear on how you choose that, and how it’s possible to encode meaning of a sentence from the meaning of individual words (especially when order of words changes meaning, and none of those sound like they include order information). Do we know how text-embedding-ada-002 aggregates multiple tokens, for instance? I couldn’t find it in a search.

My second understanding hole is that I see that different models are better at different tasks:

sentence-transformers/sentence-t5-xxl · Hugging Face says:

The model works well for sentence similarity tasks, but doesn’t perform that well for semantic search tasks.

while hkunlp/instructor-large · Hugging Face says:

can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.)

But retrieval, sentence similarity, and semantic search all seem like the same thing to me. Can anyone clarify the differences? ChatGPT describes them as:

Task: Retrieve relevant documents from a large corpus based on a query.
Length of text: Can involve entire documents or chunks of text.
Example: You have a large database of scientific articles, and you want to find articles related to “neural networks.” A retrieval model will help you find those articles in the database.

Sentence similarity:
Task: Measure the semantic similarity between two sentences or text chunks.
Length of text: Usually involves comparing two short text segments (e.g., sentences or phrases).
Example: You have two sentences: “The weather today is sunny” and “It’s a bright day outside.” A sentence similarity model will measure the semantic similarity between these two sentences, indicating that they are highly similar in meaning.

Semantic search:
Task: Find relevant documents based on a query by considering the underlying meaning, even if the exact words in the query are not present in the documents.
Length of text: Can involve entire documents or chunks of text.
Example: You have a large database of news articles, and you want to find articles that discuss “artificial intelligence.” A semantic search model will not only find articles that contain the exact phrase “artificial intelligence” but also those that discuss the topic using different terms (e.g., “machine learning” or “AI”).

Its descriptions of retrieval and semantic search sound identical to me, and it seems like both would be implemented by using sentence similarity of the search text with chunks (sentences?) of each document. So I don’t understand how a model can be good at sentence similarity while being bad at semantic search. Doesn’t one depend on the other?

I have the same question as you. I haven’t looked into the details of the paper.
I looked at the model card of instructor-large and realized that unlike other embedding model, you have to pass an instruction in addition to the sentence to encode. Here is my interpretation:

Let say we want to know “What is joe biden’s age”. The answer to the retrieval query should be a paragraph that contains something like “Joe Biden’s age is 80”. However you can argue that “What is joe biden’s age” and “Joe Biden’s age is 80” are two different sentences, one is a question, one is a statement. One can also argue that a long paragraph(e.g. joe biden’s entire wiki entry) that contains the sentence and only the sentence itself are not semantically the same. It depends on which features you care about, which you can specify in the instruction when you compute the embedding.

1 Like

I think you can check this:

1 Like