In RAG systems, who's really responsible for hallucination... the model, the retriever, or the data?

I’ve been thinking a lot about how we define and evaluate hallucinations in Retrieval-Augmented Generation (RAG) setups.
Let’s say a model “hallucinates”, but it turns out the context retrieved although semantically similar was factually wrong or irrelevant. Is that really the model’s fault?
Or is the failure in:

  1. The retriever, for selecting misleading context?
  2. The documents themselves, which may be poorly structured or outdated?
    Almost every hallucination detection effort i’ve experienced focuses on the generation step, but in RAG, the damage may already done by the time the model gets the context.

I’m also building a lightweight playground tool to inspect what dense embedding models (like OpenAI’s text-embedding-3-small) actually retrieve in a RAG pipeline. The idea is to help developers explore whether good-seeming results are actually relevant, or just semantically close.

1 Like

The fun answer can be everywhere lol.

First, I would find a gold standard set to be retrieved so that you can debug your workflow. For example: For each query: reference answer + doc passages that really support it.
Then run the retrieval and calculate the metrics there or per k.
To debug the model, I would perform the same actions across multiple models to ensure consistency on the retrieval.

The retrieval hallucination can look like the generation is plausible, but cites/uses the wrong passage. The knowledge store could be outdated.
The model’s answers could contradict facts or make up new things.

2 Likes

I think the models are always hallucinating, but sometimes it makes sense, and other times it doesn’t. From the models point of view, it doesn’t even know what words are. The model knows a finite amount of routes through it’s embeddings that input numbers take to reach output numbers. I would assume that the RAG database would be the most accurate part of the system.

A RAG upgrade to improve accuracy may be to keep all of the source documents and data that were used to build the RAG in a separate database completely intact where the RAG can reference document, page, and line number, so the user is able to confirm validity, and check accuracy using surrounding context in the source data.

1 Like

Yes , you are just following a next token distribution using vectors.

Yes , that would be best. However there are some limitations. For example you could store the documents but they could be miscanned/recognized/tokenized, out of date, or non factual(Reddit/tweets as an example).

You could squeeze out performance with telling the LLM to answer and cite the page/ line number. Additionally when you chunk the documents , I’d advise an overlapping chunk method so that any answer sentence is covered by at least one chunk. Then you can have the numerical representation and the text representation to validate later

2 Likes

In my experience, a lot of these issues stem from junk data being pulled during the retrieval process. Even with the best generation models, if the context the model is working with is irrelevant, outdated, or poorly filtered, it can easily lead to hallucinated outputs. The key issue often isn’t the model itself, but rather the quality of the data being retrieved. If we improve the data retrieval process by ensuring that the retriever pulls only relevant, accurate, and up to date information we can significantly reduce the chances of hallucinations happening in the generation stage. Filtering and preprocessing the data before it reaches the model is a crucial step in improving output quality.

3 Likes