In my experience, a lot of these issues stem from junk data being pulled during the retrieval process. Even with the best generation models, if the context the model is working with is irrelevant, outdated, or poorly filtered, it can easily lead to hallucinated outputs. The key issue often isn’t the model itself, but rather the quality of the data being retrieved. If we improve the data retrieval process by ensuring that the retriever pulls only relevant, accurate, and up to date information we can significantly reduce the chances of hallucinations happening in the generation stage. Filtering and preprocessing the data before it reaches the model is a crucial step in improving output quality.
In RAG systems, who's really responsible for hallucination... the model, the retriever, or the data?
3 Likes