Language model to search an answer in a huge collection of (unrelated) paragraphs

I want to build a question/answer language model to search a large collection of paragraphs.

Say 10k paragraphs. And find relevant answers in them.

There are 2 issues I don’t know how to solve.

  1. existing solutions often identify an answer from a short paragraph. I don’t know how to deal with a lot of paragraphs. A naive approach would be going through each paragraph and identify an answer in each of them.

  2. existing solutions will generate an answer even when fed with an unrelated paragraph. they don’t give a confidence number. If I have 10k paragraphs to search an answer from, and only 3 paragraphs have an answer, using existing solutions won’t let me to rule out unrelated paragraphs.

Is there a way to generate a document embedding first (using both a question and a paragraph ), and I can use the embedding to find candidate paragraphs first and then do the actual answer search. And when there is no answer, I’d like to get a confidence number that 's below my answer threshold.

Are there any papers dealing with this problem?

1 Like

DPR & RAG may be the references you want.

Regarding your questions and my answers with DPR

  1. DPR (retriever module) select top-k paragraphs from 20 million of possible wikipedia paragraphs (not just 10k, and you can also make your own corpus) using very fast MIPS (maximum inner product search) implemented by FAISS

  2. DPR (reader module) produce a relevance score for each of the top-k passages so this is a confidence number that you mentioned

Finally, RAG is an improvement of DPR where (1) you can combine different passages directly (both relevance and irrelevance) to produce the final answer by “marginalization” and (2) Final answer is generated in free-form, not necessarily contained in any of the passage .

(Please see the paper for details :smiley:
https://huggingface.co/transformers/model_doc/rag.html )