AI Accuracy Issues When Analyzing Financial Reports: Seeking Solutions for Persistent Hallucinations

ar111222333 · April 9, 2025, 2:00pm

Hello,

I’m implementing an AI chat feature for our application’s HTML reports and facing accuracy issues. I’d appreciate your insights on the following problem:

Current Scenario

We’ve integrated an AI chat panel into our application’s reporting system to allow users to get summaries and ask questions about financial reports (like income statements, balance sheet).

The Issue

The AI consistently fails to extract accurate data from these reports, even for simple queries. For example, when asking “How many operating expenses are there?” “Give breakdown of top 5 operating expense from the month of oct?” from an income statement, the AI cannot provide the correct answer even after multiple attempts.

Solutions We’ve Tried

HTML Solution:

Added metadata attributes to our HTML to help the AI understand the document hierarchy
Included specific instructions in our system prompts about parsing the HTML structure
Result: Still inaccurate responses

PDF Solution:

Converted reports to PDF format
Result: AI continues to hallucinate and provides incorrect counts and stats

Image Solution:

Tried image-based reports
Result: Same issues with accuracy and hallucination

JSON Solution:

Converted data to structured JSON format, expecting better results
Even with simple flat datasets, the AI struggles with basic counting queries
Even ChatGPT on the web ultimately resorted to running code to answer the question
Example: When for test purpose we gave an array of flat user data and asked “How many males are there?”, it couldn’t answer correctly

Dummy Data:

[{
  "id": 1,
  "first_name": "Broderic",
  "last_name": "Kidder",
  "email": "bkidder0@cdbaby.com",
  "gender": "Male",
  "ip_address": "94.205.115.55"
},{},{},{}]

New Responses API with GPT-4o:

Built a proof of concept using GPT-4o with the new responses API
Result: Still experienced the same hallucination issues with flat dataset in JSON.

Questions

Are we taking the wrong approach to this problem?
Could our system prompts be insufficient? (We’ve experimented with multiple variations)
Is AI technology simply not mature enough for this kind of data extraction yet?
Are there best practices or alternative approaches we should consider?

Models Tried:

GPT4o

It used beauty soup and pandas in python using code interpreter but still it was failing in some scenarios. We also tested Claude with the dataset and its answers were also not accurate in some cases.

Any help or guidance would be appreciated.

Thanks,

John6666 · April 10, 2025, 2:22am

Since a standalone LLM or VLM is often not suitable for tasks that require a high level of accuracy, why not take a RAG-like approach?

ar111222333 · April 10, 2025, 2:12pm

Does it work with financial data as well? Or does it only work with natural language datasets?
In my understanding RAG is good for natural languages datasets.

In my case, we need a solution where users will be asking financial questions from our reports and expect accurate results.

John6666 · April 11, 2025, 3:46am

I don’t know if it will work in the end, but there are many other uses for RAGs besides the “chatbot that answers by referring to a database” that is often associated with them, so I think it will be suitable for that purpose as well.

The basic idea of RAG is to have LLM do only the things that it is better at, and to combine it with other programs and services.

Of course, if there were a huge LLM or VLM that could memorize everything, it would be easier, but the cost…

Topic		Replies	Views
I need help getting more accurate results after training Beginners	0	56	August 25, 2024
Fine-Tuning + RAG based Chatbot: Dataset Structure & Instruction Adherence Issues Intermediate	7	367	March 11, 2025
Conversational Budget Analytics Research	1	524	March 19, 2023
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2598	November 5, 2024
In RAG systems, who's really responsible for hallucination... the model, the retriever, or the data? Models	3	67	June 27, 2025