Hi,
I’m currently working on building Question answering model using LLM(LLama). My data source is pdfs, I have 200 pdf files and I use PyPDF2 to extract data, while extracting the table inside the pdf file is also getting extracted but extracted table structure is messed up. when I tested with model with that messed table data, model isn’t able to answer my question.
So, can anyone recommend any better method to tackle this scenario?
or is there any Large Language model to query table data?
please find the sample table data inside pdf file
sample questions will be like : what is the specification for network data?
Languages at Hugging Face
Transformers Beginners
2 Likes
I think your problem is two-fold - (1) Extracting the table data correctly and then (2) Querying on it using an LLM.
For the latter, there are models specifically trained to convert table data to text (such as RUCAIBox/mtl-data-to-text · Hugging Face) that could be used to generate natural language text from your tables before querying them?
3 Likes
I am encountering the same problem. Would you have any thoughts on how to extract the tables correctly? I tried using UNSTRUCTURED.IO but it wasnt able to extract nor detect the tables
Thanks!
1 Like
Have you tried img2table: https://betterprogramming.pub/extracting-tables-from-images-in-python-made-easy-ier-3be959555f6f.
It performes well, even if you have many tables in one page it extacts them and save them in a single execl sheet, in sperate pages.
Last time I saw pandoc https://pandoc.org , tool can convert documents between many formats. The constant format that connect others by readers and writers are JSON structure. Now I explore the quality of this solution – how works on various documents.
Came across this from LlamaIndex and found it helpful.