Improving Key-Value Pair Extraction with LayoutLM and LiLT on Custom OCR Dataset

rushabhGod · January 13, 2025, 10:24am

I am working on an OCR project to perform document understanding on a specific type of document with a fixed layout (e.g., mandate forms). The layout has minor variations across images. My goal is to accurately extract key-value pairs and output them in JSON format.

What I Have Done

Dataset Preparation:

I created a dataset of 231 annotated images in FUNSD format using the UBIAI tool. The annotations include bounding boxes and labels (KEY, VALUE).

Models Used:

I fine-tuned both LayoutLMv3 and LiLT using my dataset.
Training shows good performance (e.g., metrics like F1-score), but during testing, the models fail to generalize and accurately mark key-value pairs.

References:

I followed the LiLT fine-tuning tutorial.

Problems Faced

The models work well during training but produce poor results on the test set, particularly when identifying key-value relationships.
Some keys and values are either missed or mismatched in the output JSON.

Expected vs Actual Outcome

Expected: Accurate key-value extraction with proper bounding boxes and JSON output.
Actual: The models fail to generalize, often missing keys/values or mismatching pairs during testing.

What I Need Help With

Suggestions on improving the architecture or training workflow to enhance model accuracy.
Debugging why models trained on the custom FUNSD dataset are underperforming during testing.
Recommendations for optimizing the dataset, preprocessing, or hyperparameters.What I Have Done
Dataset Preparation:
* I created a dataset of 231 annotated images in FUNSD format using the UBIAI tool. The annotations include bounding boxes and labels (KEY, VALUE).
Models Used:
* I fine-tuned both LayoutLMv3 and LiLT using my dataset.
* Training shows good performance (e.g., metrics like F1-score), but during testing, the models fail to generalize and accurately mark key-value pairs.
References:
* I followed the LiLT fine-tuning tutorial.

Problems Faced

The models work well during training but produce poor results on the test set, particularly when identifying key-value relationships.
Some keys and values are either missed or mismatched in the output JSON.

Expected vs Actual Outcome

Expected: Accurate key-value extraction with proper bounding boxes and JSON output.
Actual: The models fail to generalize, often missing keys/values or mismatching pairs during testing.

What I Need Help With

Suggestions on improving the architecture or training workflow to enhance model accuracy.
Debugging why models trained on the custom FUNSD dataset are underperforming during testing.
Recommendations for optimizing the dataset, preprocessing, or hyperparameters.

Code and Logs

Sample JSON Annotation

{
  "form": [
    {
      "box": [
        368,
        43,
        1205,
        84
      ],
      "text": "UMRNAx1S0014321011781012",
      "label": "VALUE",
      "words": [
        {
          "box": [
            368,
            43,
            1205,
            84
          ],
          "text": "UMRNAx1S0014321011781012",
          "iob_tag": "B-VALUE"
        }
      ]
    },
    {
      "box": [
        1336,
        42,
        1380,
        65
      ],
      "text": "Date",
      "label": "KEY",
      "words": [
        {
          "box": [
            1336,
            42,
            1380,
            65
          ],
          "text": "Date",
          "iob_tag": "B-KEY"
        }
      ]
    }
  ]
}

Training Code

Sample Output

Expected JSON:

{“UMRN”: “UMRNAx1S0014321011781012”, “Date”: “17102024”}

Actual JSON:

{“UMRN”: null, “Date”: “17102024”}

Logs

Train Loss: 0.12, F1-Score: 0.92 Test Loss: 0.35, F1-Score: 0.68

Additional Information

The layout of the document has minor variations, which might be affecting model performance.
Is there a better preprocessing strategy or data augmentation technique that can help?

riccardodemaria · February 6, 2025, 6:56pm

I am working on a similar project, I use a VLM (InternVL2.5 MPO).
My current issue is not being able to extract a confidence score for OCR, were you able to?

for16 · February 21, 2025, 9:00pm

Hello, I’m working on a similar project. Can LayoutLM be used for scanned document classification in addition to text extraction?

Topic		Replies	Views
Dataset preparation for LayoutLM and LiLT Research	1	65	April 27, 2025
Which model to select Models	1	70	April 14, 2025
Need help in LayoutLM model Models	0	477	July 8, 2022
LiLT not returning words when ocr_=True 🤗Transformers	0	116	October 6, 2023
LayoutLMV3 for Token Classification 🤗Transformers	7	4421	June 19, 2025

Improving Key-Value Pair Extraction with LayoutLM and LiLT on Custom OCR Dataset

What I Have Done

Problems Faced

Expected vs Actual Outcome

What I Need Help With

Problems Faced

Expected vs Actual Outcome

What I Need Help With

Code and Logs

Sample JSON Annotation

Training Code

Sample Output

Logs

Additional Information

Related topics