I am trying to gather as much information as I can before deciding which route to take for achieving this task.
I need to extract some standard information from Invoice documents such as invoice number, product names and descriptions, invoice dates, etc. I have around 20,000 invoices, and they are from multiple vendors (around 200 vendors) and therefore have different structures. This, of course, means data that I am after is positioned in different places in these invoices.
What would be the most efficient approach for extracting data with high accuracy?:
a. Use LayoutLMv3 model to train it on my custom data. This involves first creating that custom data in appropriate format, meaning OCR-ing the invoice images using Tesseract or similar libraries, and using the resultant dataset as training data for LayoutLMv3. The problem is, even though the invoices I have are in fairly good and readable quality, Tesseract still has a lot of trouble identifying some of the text/symbols/punctuation/spaces. This further delves me into fine-tuning the Tesseract model itself by creating small pieces of images from the original invoices with text on them and manually annotating them with the actual text written on those small images (ground truth). That is a lot, and a lot of manual work. With 200 vendors, that would probably mean thousands of small images I need to annotate.
b. Directly train a image-2-text model like pic2struct in hopes that it produces accurate enough results on its own. I am not sure which ML algorithm would even be the best one to use here, CNNs? Would that be anywhere near as accurate as the approach above?
Your task of extracting structured data from invoices is indeed challenging due to the diversity in vendor formats. Let me address your options and provide additional advice:
Option A: Fine-tuning LayoutLMv3
LayoutLMv3 is highly effective for structured document understanding, especially when documents have varied layouts. However, the bottleneck in your case is the OCR quality.
Tesseract Challenges: Fine-tuning Tesseract to improve OCR results for 200 vendors is indeed a monumental task. Instead of this, consider leveraging pre-trained OCR tools like Google’s Vision API, AWS Textract, or Microsoft Azure’s OCR services. They often outperform Tesseract in terms of accuracy and can save significant time.
Custom Dataset Creation: After OCR, you can focus on converting the extracted text and layout information into the format required by LayoutLMv3. Tools like docTR might help streamline the OCR-to-dataset process.
Accuracy Potential: Once you have high-quality OCR data, LayoutLMv3, fine-tuned on your custom dataset, can achieve high accuracy, especially when combined with a labeled dataset for your use case.
Option B: Training an Image-to-Text Model
Using an image-to-text model like pic2struct is a viable option but comes with its own challenges:
Model Selection: Models like pic2struct or other transformer-based vision models (e.g., Donut) are designed to process document images directly. They often do not rely on OCR, making them suitable for tasks where OCR struggles.
Dataset Requirements: Training such a model from scratch requires a significant dataset of annotated invoice images. Fine-tuning a pre-trained model may reduce this burden but still requires labeled data.
Accuracy Considerations: While image-to-text models are improving, they might not yet match the precision of LayoutLMv3 when it comes to structured data extraction from varied layouts.
Option C: Recommended Approach
Here’s a strategy that balances accuracy and effort:
Hybrid Approach: Use a robust OCR service (e.g., Azure, AWS Textract, or Google Vision) for text extraction and bounding boxes. This reduces manual effort compared to fine-tuning Tesseract.
Pre-trained Model Fine-Tuning: Fine-tune LayoutLMv3 or Donut on the OCR-processed dataset. This takes advantage of LayoutLMv3’s strength in understanding the structure of documents.
Active Learning for Annotation: Instead of manually labeling thousands of samples, adopt active learning. Focus on annotating edge cases where the model performs poorly, reducing annotation overhead.
Iterative Improvement: Start with a subset of data (e.g., 5-10 vendors) to validate your pipeline before scaling to all 200 vendors. Use tools like Label Studio to streamline the labeling process.
Alternative Tools to Explore
Tabula and Camelot: For PDFs with structured tables, these can help extract tabular data efficiently.
LangChain with OCR and LLMs: If you’re open to experimenting, use a pipeline combining OCR with a general-purpose language model (e.g., GPT-based models) for post-processing extracted text.
Document Understanding AI: Pre-built solutions like Google’s Document AI are designed for tasks like invoice parsing and can be configured for high accuracy with minimal effort.
Let me know if you’d like further help with any specific part of this process!