LLM model for table data

Iamexperimenting · June 22, 2023, 9:41pm

Hi,

I’m currently working on building Question answering model using LLM(LLama). My data source is pdfs, I have 200 pdf files and I use PyPDF2 to extract data, while extracting the table inside the pdf file is also getting extracted but extracted table structure is messed up. when I tested with model with that messed table data, model isn’t able to answer my question.

So, can anyone recommend any better method to tackle this scenario?

or is there any Large Language model to query table data?

please find the sample table data inside pdf file

sample questions will be like : what is the specification for network data?

Languages at Hugging Face Transformers Beginners

guruprasadnk7 · July 7, 2023, 6:24pm

I think your problem is two-fold - (1) Extracting the table data correctly and then (2) Querying on it using an LLM.

For the latter, there are models specifically trained to convert table data to text (such as RUCAIBox/mtl-data-to-text · Hugging Face) that could be used to generate natural language text from your tables before querying them?

raf-madrigal · August 14, 2023, 1:19am

I am encountering the same problem. Would you have any thoughts on how to extract the tables correctly? I tried using UNSTRUCTURED.IO but it wasnt able to extract nor detect the tables

Thanks!

Fatima19 · August 22, 2023, 7:52am

Have you tried img2table: https://betterprogramming.pub/extracting-tables-from-images-in-python-made-easy-ier-3be959555f6f.

It performes well, even if you have many tables in one page it extacts them and save them in a single execl sheet, in sperate pages.

markowanga · September 4, 2023, 7:26am

Last time I saw pandoc https://pandoc.org , tool can convert documents between many formats. The constant format that connect others by readers and writers are JSON structure. Now I explore the quality of this solution – how works on various documents.

sgowdaks · September 12, 2023, 4:14pm

Came across this from LlamaIndex and found it helpful.

cungnlp · February 1, 2024, 3:11pm

I have read how you shared. But there is a problem: when I install the libraries, it gives me an error, conflicting versions. Can you give me the versions of the necessary libraries?

cungnlp · February 1, 2024, 3:12pm

I have read how you shared. But there is a problem: when I install the libraries, it gives me an error, conflicting versions. Can you give me the versions of the necessary libraries? I look forward to your response

shivamtripathi0508 · July 21, 2024, 7:43pm

See You Problem is very simple.

Use Python Module Called “Camelot” to extract only tables from pdfs. (Camelot: PDF Table Extraction for Humans — Camelot 0.11.0 documentation).
export that data into CSV, xlsx, etc.
use Pandas.AI to query on data. this requires OpenAI API, or any LLMs Like LLama-3, Mistral, Ollama, Local-LLM, etc.
(link - Introduction to PandasAI - PandasAI)

please don’t forget to use ChatGPT for Code generation and modification.

---------------- Jai Shree Ram ------------------

Topic		Replies	Views
Reading PDF tables in PDF's with different languages and layouts Beginners	0	1240	February 8, 2024
Table extraction from pdf Beginners	1	2945	July 6, 2022
Serialize bank statements from PDF to CSV Beginners	1	3466	July 6, 2022
Model Recommendation for table extraction from PDF Models	3	4327	July 14, 2024
LayoutLM for table detection and extraction Beginners	3	8433	July 11, 2023

LLM model for table data

Related topics