Hello,
I’m implementing an AI chat feature for our application’s HTML reports and facing accuracy issues. I’d appreciate your insights on the following problem:
Current Scenario
We’ve integrated an AI chat panel into our application’s reporting system to allow users to get summaries and ask questions about financial reports (like income statements, balance sheet).
The Issue
The AI consistently fails to extract accurate data from these reports, even for simple queries. For example, when asking “How many operating expenses are there?” “Give breakdown of top 5 operating expense from the month of oct?” from an income statement, the AI cannot provide the correct answer even after multiple attempts.
Solutions We’ve Tried
- HTML Solution:
- Added metadata attributes to our HTML to help the AI understand the document hierarchy
- Included specific instructions in our system prompts about parsing the HTML structure
- Result: Still inaccurate responses
- PDF Solution:
- Converted reports to PDF format
- Result: AI continues to hallucinate and provides incorrect counts and stats
- Image Solution:
- Tried image-based reports
- Result: Same issues with accuracy and hallucination
- JSON Solution:
- Converted data to structured JSON format, expecting better results
- Even with simple flat datasets, the AI struggles with basic counting queries
- Even ChatGPT on the web ultimately resorted to running code to answer the question
- Example: When for test purpose we gave an array of flat user data and asked “How many males are there?”, it couldn’t answer correctly
Dummy Data:
[{
"id": 1,
"first_name": "Broderic",
"last_name": "Kidder",
"email": "bkidder0@cdbaby.com",
"gender": "Male",
"ip_address": "94.205.115.55"
},{},{},{}]
- New Responses API with GPT-4o:
- Built a proof of concept using GPT-4o with the new responses API
- Result: Still experienced the same hallucination issues with flat dataset in JSON.
Questions
- Are we taking the wrong approach to this problem?
- Could our system prompts be insufficient? (We’ve experimented with multiple variations)
- Is AI technology simply not mature enough for this kind of data extraction yet?
- Are there best practices or alternative approaches we should consider?
Models Tried:
- GPT4o
It used beauty soup and pandas in python using code interpreter but still it was failing in some scenarios. We also tested Claude with the dataset and its answers were also not accurate in some cases.
Any help or guidance would be appreciated.
Thanks,