How to build NLP querying system for text document that returns correct facts

robopzet · May 28, 2025, 3:08pm

Hi all,

I’m trying to add the option to our knowledge base software to query articles using natural language (Dutch) and have the system return factually correct answers. The document structure is similar to the text of Wikipedia documents.

I’ve tried a LLM + RAG tutorial here on huggingface and one on the Lanchain site. First, the vector database does not return articles matching a query. I have tried chromadb with different embeddings, tokenizers and chunk sizes.

To proceed I hardcoded the text of some of the articles in the prompt and asked for information that is present in that text. Most of the times the answers from the LLM’s (LLama3.2, mistral and gemma3) contains some errors. They mention facts that are just not in the text or do exist, but in separate parts. Also query text described in not the exact same wording seems not to be recognized.

I guess LLM + RAG is not the way to implement such a system, since it’s not just about text generation but also finding the facts and verifying that every answer is correct.

Can anyone point me in a direction of a solution?

thanks,
rob

Mdrnfox · May 28, 2025, 10:38pm

robopzet:

I’m trying to add the option to our knowledge base software to query articles using natural language (Dutch) and have the system return factually correct answers. The document structure is similar to the text of Wikipedia documents.

I’ve tried a LLM + RAG tutorial here on huggingface and one on the Lanchain site. First, the vector database does not return articles matching a query. I have tried chromadb with different embeddings, tokenizers and chunk sizes.

To proceed I hardcoded the text of some of the articles in the prompt and asked for information that is present in that text. Most of the times the answers from the LLM’s (LLama3.2, mistral and gemma3) contains some errors. They mention facts that are just not in the text or do exist, but in separate parts. Also query text described in not the exact same wording seems not to be recognized.

I guess LLM + RAG is not the way to implement such a system, since it’s not just about text generation but also finding the facts and verifying that every answer is correct.

You could try several things:

Before calling the LLM, try using a retriever to grab passages that contain the facts. You could use BM25 and/or dense searching. You could also experiment with chunking( Too big loses focus, too small loses context).. Maybe try using a “reader” Dutch/multilingual model that scores each chunk. You could set a threshold and if it passes, then return that answer. If it doesn’t then you could return that the model doesn’t understand instead of returning a wrong answer.

I hope this helps

robopzet · May 30, 2025, 8:37am

Thanks for the suggestions

Topic		Replies	Views
Compare 2 long texts Beginners	0	1481	May 2, 2023
Question answering based on documents with citations Beginners	1	643	November 8, 2023
Narrative text generation Research	2	78	June 13, 2025
Use embeddings stored in vector db to reduce work for LLM generating response Intermediate	0	1553	February 19, 2024
Although doing RAG does it worth fine tuning the LLM on the documents? - Llama2 Intermediate	1	1524	September 14, 2023

How to build NLP querying system for text document that returns correct facts

Related topics