Language model to search an answer in a huge collection of (unrelated) paragraphs

billconan · November 25, 2020, 6:59pm

I want to build a question/answer language model to search a large collection of paragraphs.

Say 10k paragraphs. And find relevant answers in them.

There are 2 issues I don’t know how to solve.

existing solutions often identify an answer from a short paragraph. I don’t know how to deal with a lot of paragraphs. A naive approach would be going through each paragraph and identify an answer in each of them.
existing solutions will generate an answer even when fed with an unrelated paragraph. they don’t give a confidence number. If I have 10k paragraphs to search an answer from, and only 3 paragraphs have an answer, using existing solutions won’t let me to rule out unrelated paragraphs.

Is there a way to generate a document embedding first (using both a question and a paragraph ), and I can use the embedding to find candidate paragraphs first and then do the actual answer search. And when there is no answer, I’d like to get a confidence number that 's below my answer threshold.

Are there any papers dealing with this problem?

Jung · November 27, 2020, 11:50pm

DPR & RAG may be the references you want.

Regarding your questions and my answers with DPR

DPR (retriever module) select top-k paragraphs from 20 million of possible wikipedia paragraphs (not just 10k, and you can also make your own corpus) using very fast MIPS (maximum inner product search) implemented by FAISS
DPR (reader module) produce a relevance score for each of the top-k passages so this is a confidence number that you mentioned

Finally, RAG is an improvement of DPR where (1) you can combine different passages directly (both relevance and irrelevance) to produce the final answer by “marginalization” and (2) Final answer is generated in free-form, not necessarily contained in any of the passage .

(Please see the paper for details
https://huggingface.co/transformers/model_doc/rag.html )

Berowne · July 2, 2021, 6:32am

Hi Jung & HF Community.
I am implementing a RAG process,… with a daily update.
I can easily merge the dataset objects using datasets.concatenate_datasets()
but I have two questions:

I cannot merge the indices… even if i .load_faiss_index() to each part the concat object has no index
Is this the best way to search a large corpus or would it be best to load each dataset into a seperate node and scan across a cluster?

I am following transformers/use_own_knowledge_dataset.py at master · huggingface/transformers · GitHub, creating a new folder for each daily dataset.

Jung · July 3, 2021, 9:24am

Hi @Berowne , it’s very interesting question.
Daily updated datasets should be an important use case.
Unfortunately, I have no answer. Maybe @lhoestq could help us here?

lhoestq · July 6, 2021, 12:40pm

Hi ! If you concatenate two datasets, you will need to build a new FAISS index for the new dataset.
Depending on the number of documents you have and the type of index you use, you can either:

rebuild a new index from scratch (easy, but slow for big datasets and advanced index types)
or update one of the existing index with new vectors (useful if you need to add a few new documents for example into an already existing big dataset)
or merge the two index together (possible only for certain index types, here is an example for IVF)

Regarding your second question, it is definitely a reasonable way to search a large corpus. Though it may also depend on your needs in terms of speed and accuracy, and on the size of your dataset.

Topic		Replies	Views
Seeking Advice on Processing Support Conversations for Efficient RAG Model Search Intermediate	0	49	September 9, 2024
RAG Model performance does not match paper Models	0	332	February 5, 2021
RAG for Reading Comprehension Models	1	713	April 6, 2021
Bert question answering model without context Models	5	11107	October 1, 2023
Run_squad occasionally finds an answer to a question asked of a text fragment Intermediate	0	308	September 9, 2020

Language model to search an answer in a huge collection of (unrelated) paragraphs

Related topics