Language model to search an answer in a huge collection of (unrelated) paragraphs

I want to build a question/answer language model to search a large collection of paragraphs.

Say 10k paragraphs. And find relevant answers in them.

There are 2 issues I don’t know how to solve.

  1. existing solutions often identify an answer from a short paragraph. I don’t know how to deal with a lot of paragraphs. A naive approach would be going through each paragraph and identify an answer in each of them.

  2. existing solutions will generate an answer even when fed with an unrelated paragraph. they don’t give a confidence number. If I have 10k paragraphs to search an answer from, and only 3 paragraphs have an answer, using existing solutions won’t let me to rule out unrelated paragraphs.

Is there a way to generate a document embedding first (using both a question and a paragraph ), and I can use the embedding to find candidate paragraphs first and then do the actual answer search. And when there is no answer, I’d like to get a confidence number that 's below my answer threshold.

Are there any papers dealing with this problem?

1 Like

DPR & RAG may be the references you want.

Regarding your questions and my answers with DPR

  1. DPR (retriever module) select top-k paragraphs from 20 million of possible wikipedia paragraphs (not just 10k, and you can also make your own corpus) using very fast MIPS (maximum inner product search) implemented by FAISS

  2. DPR (reader module) produce a relevance score for each of the top-k passages so this is a confidence number that you mentioned

Finally, RAG is an improvement of DPR where (1) you can combine different passages directly (both relevance and irrelevance) to produce the final answer by “marginalization” and (2) Final answer is generated in free-form, not necessarily contained in any of the passage .

(Please see the paper for details :smiley:
https://huggingface.co/transformers/model_doc/rag.html )

Hi Jung & HF Community.
I am implementing a RAG process,… with a daily update.
I can easily merge the dataset objects using datasets.concatenate_datasets()
but I have two questions:

  1. I cannot merge the indices… even if i .load_faiss_index() to each part the concat object has no index
  2. Is this the best way to search a large corpus or would it be best to load each dataset into a seperate node and scan across a cluster?

I am following transformers/use_own_knowledge_dataset.py at master · huggingface/transformers · GitHub, creating a new folder for each daily dataset.

Hi @Berowne , it’s very interesting question.
Daily updated datasets should be an important use case.
Unfortunately, I have no answer. Maybe @lhoestq could help us here?

Hi ! If you concatenate two datasets, you will need to build a new FAISS index for the new dataset.
Depending on the number of documents you have and the type of index you use, you can either:

  • rebuild a new index from scratch (easy, but slow for big datasets and advanced index types)
  • or update one of the existing index with new vectors (useful if you need to add a few new documents for example into an already existing big dataset)
  • or merge the two index together (possible only for certain index types, here is an example for IVF)

Regarding your second question, it is definitely a reasonable way to search a large corpus. Though it may also depend on your needs in terms of speed and accuracy, and on the size of your dataset.