AI Accuracy Issues When Analyzing Financial Reports: Seeking Solutions for Persistent Hallucinations

Hello,

I’m implementing an AI chat feature for our application’s HTML reports and facing accuracy issues. I’d appreciate your insights on the following problem:

Current Scenario

We’ve integrated an AI chat panel into our application’s reporting system to allow users to get summaries and ask questions about financial reports (like income statements, balance sheet).

The Issue

The AI consistently fails to extract accurate data from these reports, even for simple queries. For example, when asking “How many operating expenses are there?” “Give breakdown of top 5 operating expense from the month of oct?” from an income statement, the AI cannot provide the correct answer even after multiple attempts.

Solutions We’ve Tried

  1. HTML Solution:
  • Added metadata attributes to our HTML to help the AI understand the document hierarchy
  • Included specific instructions in our system prompts about parsing the HTML structure
  • Result: Still inaccurate responses
  1. PDF Solution:
  • Converted reports to PDF format
  • Result: AI continues to hallucinate and provides incorrect counts and stats
  1. Image Solution:
  • Tried image-based reports
  • Result: Same issues with accuracy and hallucination
  1. JSON Solution:
  • Converted data to structured JSON format, expecting better results
  • Even with simple flat datasets, the AI struggles with basic counting queries
  • Even ChatGPT on the web ultimately resorted to running code to answer the question
  • Example: When for test purpose we gave an array of flat user data and asked “How many males are there?”, it couldn’t answer correctly

Dummy Data:

[{
  "id": 1,
  "first_name": "Broderic",
  "last_name": "Kidder",
  "email": "bkidder0@cdbaby.com",
  "gender": "Male",
  "ip_address": "94.205.115.55"
},{},{},{}]
  1. New Responses API with GPT-4o:
  • Built a proof of concept using GPT-4o with the new responses API
  • Result: Still experienced the same hallucination issues with flat dataset in JSON.

Questions

  1. Are we taking the wrong approach to this problem?
  2. Could our system prompts be insufficient? (We’ve experimented with multiple variations)
  3. Is AI technology simply not mature enough for this kind of data extraction yet?
  4. Are there best practices or alternative approaches we should consider?

Models Tried:

  1. GPT4o

It used beauty soup and pandas in python using code interpreter but still it was failing in some scenarios. We also tested Claude with the dataset and its answers were also not accurate in some cases.

Any help or guidance would be appreciated.

Thanks,

1 Like

Since a standalone LLM or VLM is often not suitable for tasks that require a high level of accuracy, why not take a RAG-like approach?

Does it work with financial data as well? Or does it only work with natural language datasets?
In my understanding RAG is good for natural languages datasets.

In my case, we need a solution where users will be asking financial questions from our reports and expect accurate results.

1 Like

I don’t know if it will work in the end, but there are many other uses for RAGs besides the “chatbot that answers by referring to a database” that is often associated with them, so I think it will be suitable for that purpose as well.

The basic idea of RAG is to have LLM do only the things that it is better at, and to combine it with other programs and services.

Of course, if there were a huge LLM or VLM that could memorize everything, it would be easier, but the cost…