So I’ve been learning a lot about language models lately, and thought “Hey, AI could be used to search documents and websites for concepts instead of keywords; that would be really useful”. Then I learned that you don’t really need a full AI for that, you can just use the embedding layer, so now I’m learning about embeddings. Two things I don’t understand:
I get the idea of mapping a token into a multi-dimensional embedding space (and then tokens with similar meanings would be near each other in that space) but I’m unclear on how sentences or documents made up of multiple tokens get mapped into such a space. ChatGPT says that you map each token, and then aggregate them into a single vector by averaging, max pooling, or weighted sum. I’m unclear on how you choose that, and how it’s possible to encode meaning of a sentence from the meaning of individual words (especially when order of words changes meaning, and none of those sound like they include order information). Do we know how text-embedding-ada-002
aggregates multiple tokens, for instance? I couldn’t find it in a search.
My second understanding hole is that I see that different models are better at different tasks:
sentence-transformers/sentence-t5-xxl · Hugging Face says:
The model works well for sentence similarity tasks, but doesn’t perform that well for semantic search tasks.
while hkunlp/instructor-large · Hugging Face says:
can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.)
But retrieval, sentence similarity, and semantic search all seem like the same thing to me. Can anyone clarify the differences? ChatGPT describes them as:
Retrieval:
Task: Retrieve relevant documents from a large corpus based on a query.
Length of text: Can involve entire documents or chunks of text.
Example: You have a large database of scientific articles, and you want to find articles related to “neural networks.” A retrieval model will help you find those articles in the database.Sentence similarity:
Task: Measure the semantic similarity between two sentences or text chunks.
Length of text: Usually involves comparing two short text segments (e.g., sentences or phrases).
Example: You have two sentences: “The weather today is sunny” and “It’s a bright day outside.” A sentence similarity model will measure the semantic similarity between these two sentences, indicating that they are highly similar in meaning.Semantic search:
Task: Find relevant documents based on a query by considering the underlying meaning, even if the exact words in the query are not present in the documents.
Length of text: Can involve entire documents or chunks of text.
Example: You have a large database of news articles, and you want to find articles that discuss “artificial intelligence.” A semantic search model will not only find articles that contain the exact phrase “artificial intelligence” but also those that discuss the topic using different terms (e.g., “machine learning” or “AI”).
Its descriptions of retrieval and semantic search sound identical to me, and it seems like both would be implemented by using sentence similarity of the search text with chunks (sentences?) of each document. So I don’t understand how a model can be good at sentence similarity while being bad at semantic search. Doesn’t one depend on the other?