Extraction of tabular data from a PDF

Hello everyone,
I’m working on a RAG system and I’ve noticed that many documents contain tabular data. I tried some extractors like PyPDF to load these tables, but when I tested my LLM on the extracted tables, the responses often contained incorrect information. (By the way, I’m using GPT-4o, so the issue isn’t due to hallucination — the questions were factual and simple to answer from the tables.)
I’m curious to know if anyone has come up with a creative or reliable approach to efficiently extract tabular data?

1 Like