Reading PDF tables in PDF's with different languages and layouts

andreaswt · February 8, 2024, 7:41am

I’m looking for an approach to extract table data from PDF files. My case is the following.

Input: PDF file with an order. It contains a table with order lines. The PDFs may be in different languages and have different layouts. Common for all PDFs is that the order lines contain ID’s for products and their quantities.

Output: Structured data such as a JSON object or a list of order lines. The order lines should contain the product ID and quantity.

I have a few thousand old PDF’s with their respective outputs for training or finetuning a model. I have thought about parsing the PDF’s as a string, and training a simple reinforcement model. My hope is that there is an existing model made for this case, which I can finetune with the existing data.

An edge case in the PDFs is that the order lines table can be spread across two pages, of there’s too many order lines for one page. I.e. the first half is on one page, and the second half is on another.

What would be my best bet?

Topic		Replies	Views
How can I extract a table from a PDF text doc? Beginners	0	541	April 24, 2024
Table extraction from pdf Beginners	1	2871	July 6, 2022
Extraction of tabular data from a PDF Beginners	0	65	May 6, 2025
Model Recommendation for table extraction from PDF Models	3	3943	July 14, 2024
LLM model for table data Languages at Hugging Face	8	41196	July 21, 2024

Reading PDF tables in PDF's with different languages and layouts

Related topics