I want to build a question/answer language model to search a large collection of paragraphs.
Say 10k paragraphs. And find relevant answers in them.
There are 2 issues I don’t know how to solve.
existing solutions often identify an answer from a short paragraph. I don’t know how to deal with a lot of paragraphs. A naive approach would be going through each paragraph and identify an answer in each of them.
existing solutions will generate an answer even when fed with an unrelated paragraph. they don’t give a confidence number. If I have 10k paragraphs to search an answer from, and only 3 paragraphs have an answer, using existing solutions won’t let me to rule out unrelated paragraphs.
Is there a way to generate a document embedding first (using both a question and a paragraph ), and I can use the embedding to find candidate paragraphs first and then do the actual answer search. And when there is no answer, I’d like to get a confidence number that 's below my answer threshold.
Are there any papers dealing with this problem?