LLM model for table data

Hi,

I’m currently working on building Question answering model using LLM(LLama). My data source is pdfs, I have 200 pdf files and I use PyPDF2 to extract data, while extracting the table inside the pdf file is also getting extracted but extracted table structure is messed up. when I tested with model with that messed table data, model isn’t able to answer my question.

So, can anyone recommend any better method to tackle this scenario?

or is there any Large Language model to query table data?

please find the sample table data inside pdf file

sample questions will be like : what is the specification for network data?

Languages at Hugging Face :hugs:Transformers Beginners

3 Likes

I think your problem is two-fold - (1) Extracting the table data correctly and then (2) Querying on it using an LLM.

For the latter, there are models specifically trained to convert table data to text (such as RUCAIBox/mtl-data-to-text · Hugging Face) that could be used to generate natural language text from your tables before querying them?

7 Likes

I am encountering the same problem. Would you have any thoughts on how to extract the tables correctly? I tried using UNSTRUCTURED.IO but it wasnt able to extract nor detect the tables

Thanks!

1 Like

Have you tried img2table: https://betterprogramming.pub/extracting-tables-from-images-in-python-made-easy-ier-3be959555f6f.

It performes well, even if you have many tables in one page it extacts them and save them in a single execl sheet, in sperate pages.

3 Likes

Last time I saw pandoc https://pandoc.org , tool can convert documents between many formats. The constant format that connect others by readers and writers are JSON structure. Now I explore the quality of this solution – how works on various documents.

Came across this from LlamaIndex and found it helpful.

I have read how you shared. But there is a problem: when I install the libraries, it gives me an error, conflicting versions. Can you give me the versions of the necessary libraries?

I have read how you shared. But there is a problem: when I install the libraries, it gives me an error, conflicting versions. Can you give me the versions of the necessary libraries? I look forward to your response

See You Problem is very simple.

  1. Use Python Module Called “Camelot” to extract only tables from pdfs. (Camelot: PDF Table Extraction for Humans — Camelot 0.11.0 documentation).

  2. export that data into CSV, xlsx, etc.

  3. use Pandas.AI to query on data. this requires OpenAI API, or any LLMs Like LLama-3, Mistral, Ollama, Local-LLM, etc.
    (link - Introduction to PandasAI - PandasAI)

please don’t forget to use ChatGPT for Code generation and modification.

---------------- Jai Shree Ram ------------------

1 Like