I’m looking for an approach to extract table data from PDF files. My case is the following.
Input: PDF file with an order. It contains a table with order lines. The PDFs may be in different languages and have different layouts. Common for all PDFs is that the order lines contain ID’s for products and their quantities.
Output: Structured data such as a JSON object or a list of order lines. The order lines should contain the product ID and quantity.
I have a few thousand old PDF’s with their respective outputs for training or finetuning a model. I have thought about parsing the PDF’s as a string, and training a simple reinforcement model. My hope is that there is an existing model made for this case, which I can finetune with the existing data.
An edge case in the PDFs is that the order lines table can be spread across two pages, of there’s too many order lines for one page. I.e. the first half is on one page, and the second half is on another.
What would be my best bet?