Improving Key-Value Pair Extraction with LayoutLM and LiLT on Custom OCR Dataset

I am working on an OCR project to perform document understanding on a specific type of document with a fixed layout (e.g., mandate forms). The layout has minor variations across images. My goal is to accurately extract key-value pairs and output them in JSON format.

What I Have Done

  1. Dataset Preparation:
  • I created a dataset of 231 annotated images in FUNSD format using the UBIAI tool. The annotations include bounding boxes and labels (KEY, VALUE).
  1. Models Used:
  • I fine-tuned both LayoutLMv3 and LiLT using my dataset.
  • Training shows good performance (e.g., metrics like F1-score), but during testing, the models fail to generalize and accurately mark key-value pairs.
  1. References:

Problems Faced

  1. The models work well during training but produce poor results on the test set, particularly when identifying key-value relationships.
  2. Some keys and values are either missed or mismatched in the output JSON.

Expected vs Actual Outcome

  • Expected: Accurate key-value extraction with proper bounding boxes and JSON output.
  • Actual: The models fail to generalize, often missing keys/values or mismatching pairs during testing.

What I Need Help With

  1. Suggestions on improving the architecture or training workflow to enhance model accuracy.
  2. Debugging why models trained on the custom FUNSD dataset are underperforming during testing.
  3. Recommendations for optimizing the dataset, preprocessing, or hyperparameters.What I Have Done
  4. Dataset Preparation:
    * I created a dataset of 231 annotated images in FUNSD format using the UBIAI tool. The annotations include bounding boxes and labels (KEY, VALUE).
  5. Models Used:
    * I fine-tuned both LayoutLMv3 and LiLT using my dataset.
    * Training shows good performance (e.g., metrics like F1-score), but during testing, the models fail to generalize and accurately mark key-value pairs.
  6. References:
    * I followed the LiLT fine-tuning tutorial.

Problems Faced

  1. The models work well during training but produce poor results on the test set, particularly when identifying key-value relationships.
  2. Some keys and values are either missed or mismatched in the output JSON.

Expected vs Actual Outcome

  • Expected: Accurate key-value extraction with proper bounding boxes and JSON output.
  • Actual: The models fail to generalize, often missing keys/values or mismatching pairs during testing.

What I Need Help With

  1. Suggestions on improving the architecture or training workflow to enhance model accuracy.
  2. Debugging why models trained on the custom FUNSD dataset are underperforming during testing.
  3. Recommendations for optimizing the dataset, preprocessing, or hyperparameters.

Code and Logs

Sample JSON Annotation

{
  "form": [
    {
      "box": [
        368,
        43,
        1205,
        84
      ],
      "text": "UMRNAx1S0014321011781012",
      "label": "VALUE",
      "words": [
        {
          "box": [
            368,
            43,
            1205,
            84
          ],
          "text": "UMRNAx1S0014321011781012",
          "iob_tag": "B-VALUE"
        }
      ]
    },
    {
      "box": [
        1336,
        42,
        1380,
        65
      ],
      "text": "Date",
      "label": "KEY",
      "words": [
        {
          "box": [
            1336,
            42,
            1380,
            65
          ],
          "text": "Date",
          "iob_tag": "B-KEY"
        }
      ]
    }
  ]
}

Training Code

Sample Output

  • Expected JSON:

{“UMRN”: “UMRNAx1S0014321011781012”, “Date”: “17102024”}

  • Actual JSON:

{“UMRN”: null, “Date”: “17102024”}

Logs

Train Loss: 0.12, F1-Score: 0.92 Test Loss: 0.35, F1-Score: 0.68

Additional Information

  • The layout of the document has minor variations, which might be affecting model performance.
  • Is there a better preprocessing strategy or data augmentation technique that can help?
2 Likes

I am working on a similar project, I use a VLM (InternVL2.5 MPO).
My current issue is not being able to extract a confidence score for OCR, were you able to?

1 Like