Best route for text extraction from Invoice documents

gandg · January 8, 2025, 1:19pm

Hi there,

I am trying to gather as much information as I can before deciding which route to take for achieving this task.

I need to extract some standard information from Invoice documents such as invoice number, product names and descriptions, invoice dates, etc. I have around 20,000 invoices, and they are from multiple vendors (around 200 vendors) and therefore have different structures. This, of course, means data that I am after is positioned in different places in these invoices.

What would be the most efficient approach for extracting data with high accuracy?:

a. Use LayoutLMv3 model to train it on my custom data. This involves first creating that custom data in appropriate format, meaning OCR-ing the invoice images using Tesseract or similar libraries, and using the resultant dataset as training data for LayoutLMv3. The problem is, even though the invoices I have are in fairly good and readable quality, Tesseract still has a lot of trouble identifying some of the text/symbols/punctuation/spaces. This further delves me into fine-tuning the Tesseract model itself by creating small pieces of images from the original invoices with text on them and manually annotating them with the actual text written on those small images (ground truth). That is a lot, and a lot of manual work. With 200 vendors, that would probably mean thousands of small images I need to annotate.

b. Directly train a image-2-text model like pic2struct in hopes that it produces accurate enough results on its own. I am not sure which ML algorithm would even be the best one to use here, CNNs? Would that be anywhere near as accurate as the approach above?

c. Please give your advice

Thanks.

Alanturner2 · January 8, 2025, 1:30pm

Hi @gandg,

Your task of extracting structured data from invoices is indeed challenging due to the diversity in vendor formats. Let me address your options and provide additional advice:

Option A: Fine-tuning LayoutLMv3

LayoutLMv3 is highly effective for structured document understanding, especially when documents have varied layouts. However, the bottleneck in your case is the OCR quality.

Tesseract Challenges: Fine-tuning Tesseract to improve OCR results for 200 vendors is indeed a monumental task. Instead of this, consider leveraging pre-trained OCR tools like Google’s Vision API, AWS Textract, or Microsoft Azure’s OCR services. They often outperform Tesseract in terms of accuracy and can save significant time.
Custom Dataset Creation: After OCR, you can focus on converting the extracted text and layout information into the format required by LayoutLMv3. Tools like docTR might help streamline the OCR-to-dataset process.
Accuracy Potential: Once you have high-quality OCR data, LayoutLMv3, fine-tuned on your custom dataset, can achieve high accuracy, especially when combined with a labeled dataset for your use case.

Option B: Training an Image-to-Text Model

Using an image-to-text model like pic2struct is a viable option but comes with its own challenges:

Model Selection: Models like pic2struct or other transformer-based vision models (e.g., Donut) are designed to process document images directly. They often do not rely on OCR, making them suitable for tasks where OCR struggles.
Dataset Requirements: Training such a model from scratch requires a significant dataset of annotated invoice images. Fine-tuning a pre-trained model may reduce this burden but still requires labeled data.
Accuracy Considerations: While image-to-text models are improving, they might not yet match the precision of LayoutLMv3 when it comes to structured data extraction from varied layouts.

Option C: Recommended Approach

Here’s a strategy that balances accuracy and effort:

Hybrid Approach: Use a robust OCR service (e.g., Azure, AWS Textract, or Google Vision) for text extraction and bounding boxes. This reduces manual effort compared to fine-tuning Tesseract.
Pre-trained Model Fine-Tuning: Fine-tune LayoutLMv3 or Donut on the OCR-processed dataset. This takes advantage of LayoutLMv3’s strength in understanding the structure of documents.
Active Learning for Annotation: Instead of manually labeling thousands of samples, adopt active learning. Focus on annotating edge cases where the model performs poorly, reducing annotation overhead.
Iterative Improvement: Start with a subset of data (e.g., 5-10 vendors) to validate your pipeline before scaling to all 200 vendors. Use tools like Label Studio to streamline the labeling process.

Alternative Tools to Explore

Tabula and Camelot: For PDFs with structured tables, these can help extract tabular data efficiently.
LangChain with OCR and LLMs: If you’re open to experimenting, use a pipeline combining OCR with a general-purpose language model (e.g., GPT-based models) for post-processing extracted text.
Document Understanding AI: Pre-built solutions like Google’s Document AI are designed for tasks like invoice parsing and can be configured for high accuracy with minimal effort.

Let me know if you’d like further help with any specific part of this process!

Best,
Alan

gandg · January 11, 2025, 5:21pm

Thank you for your answer, I’ve decided to follow the Donut fine-tuning route as you suggested.

hiraltalsaniya · July 3, 2025, 7:49am

@gandg . Which approach is working for your’s problem could you please tell me?
I am also facing same issue

Topic		Replies	Views
Why is LayoutLMv2 Bad at Token Classification? Beginners	0	411	June 17, 2023
LayoutLMV3 information extraction from invoice Awesome paper	2	998	September 22, 2024
Streamlining Invoice Classification with LayoutMLv3 and Label-Studio: Simplifying Data Labeling for Precise Results Beginners	0	376	April 11, 2024
Extracting information from bills, tax statements, etc: What ML model to use? Research	3	3207	August 28, 2024
I need your opinion about Metadata Extraction Beginners	0	259	March 27, 2024

Best route for text extraction from Invoice documents

Option A: Fine-tuning LayoutLMv3

Option B: Training an Image-to-Text Model

Option C: Recommended Approach

Alternative Tools to Explore

Related topics