How to build NLP querying system for text document that returns correct facts

Hi all,

I’m trying to add the option to our knowledge base software to query articles using natural language (Dutch) and have the system return factually correct answers. The document structure is similar to the text of Wikipedia documents.

I’ve tried a LLM + RAG tutorial here on huggingface and one on the Lanchain site. First, the vector database does not return articles matching a query. I have tried chromadb with different embeddings, tokenizers and chunk sizes.

To proceed I hardcoded the text of some of the articles in the prompt and asked for information that is present in that text. Most of the times the answers from the LLM’s (LLama3.2, mistral and gemma3) contains some errors. They mention facts that are just not in the text or do exist, but in separate parts. Also query text described in not the exact same wording seems not to be recognized.

I guess LLM + RAG is not the way to implement such a system, since it’s not just about text generation but also finding the facts and verifying that every answer is correct.

Can anyone point me in a direction of a solution?

thanks,
rob

1 Like

You could try several things:

Before calling the LLM, try using a retriever to grab passages that contain the facts. You could use BM25 and/or dense searching. You could also experiment with chunking( Too big loses focus, too small loses context).. Maybe try using a “reader” Dutch/multilingual model that scores each chunk. You could set a threshold and if it passes, then return that answer. If it doesn’t then you could return that the model doesn’t understand instead of returning a wrong answer.

I hope this helps :slight_smile:

1 Like

Thanks for the suggestions

1 Like