I am working on an OCR project to perform document understanding on a specific type of document with a fixed layout (e.g., mandate forms). The layout has minor variations across images. My goal is to accurately extract key-value pairs and output them in JSON format.
What I Have Done
- Dataset Preparation:
- I created a dataset of 231 annotated images in FUNSD format using the UBIAI tool. The annotations include bounding boxes and labels (
KEY
,VALUE
).
- Models Used:
- I fine-tuned both LayoutLMv3 and LiLT using my dataset.
- Training shows good performance (e.g., metrics like F1-score), but during testing, the models fail to generalize and accurately mark key-value pairs.
- References:
- I followed the LiLT fine-tuning tutorial.
Problems Faced
- The models work well during training but produce poor results on the test set, particularly when identifying key-value relationships.
- Some keys and values are either missed or mismatched in the output JSON.
Expected vs Actual Outcome
- Expected: Accurate key-value extraction with proper bounding boxes and JSON output.
- Actual: The models fail to generalize, often missing keys/values or mismatching pairs during testing.
What I Need Help With
- Suggestions on improving the architecture or training workflow to enhance model accuracy.
- Debugging why models trained on the custom FUNSD dataset are underperforming during testing.
- Recommendations for optimizing the dataset, preprocessing, or hyperparameters.What I Have Done
- Dataset Preparation:
* I created a dataset of 231 annotated images in FUNSD format using the UBIAI tool. The annotations include bounding boxes and labels (KEY
,VALUE
). - Models Used:
* I fine-tuned both LayoutLMv3 and LiLT using my dataset.
* Training shows good performance (e.g., metrics like F1-score), but during testing, the models fail to generalize and accurately mark key-value pairs. - References:
* I followed the LiLT fine-tuning tutorial.
Problems Faced
- The models work well during training but produce poor results on the test set, particularly when identifying key-value relationships.
- Some keys and values are either missed or mismatched in the output JSON.
Expected vs Actual Outcome
- Expected: Accurate key-value extraction with proper bounding boxes and JSON output.
- Actual: The models fail to generalize, often missing keys/values or mismatching pairs during testing.
What I Need Help With
- Suggestions on improving the architecture or training workflow to enhance model accuracy.
- Debugging why models trained on the custom FUNSD dataset are underperforming during testing.
- Recommendations for optimizing the dataset, preprocessing, or hyperparameters.
Code and Logs
Sample JSON Annotation
{
"form": [
{
"box": [
368,
43,
1205,
84
],
"text": "UMRNAx1S0014321011781012",
"label": "VALUE",
"words": [
{
"box": [
368,
43,
1205,
84
],
"text": "UMRNAx1S0014321011781012",
"iob_tag": "B-VALUE"
}
]
},
{
"box": [
1336,
42,
1380,
65
],
"text": "Date",
"label": "KEY",
"words": [
{
"box": [
1336,
42,
1380,
65
],
"text": "Date",
"iob_tag": "B-KEY"
}
]
}
]
}
Training Code
Sample Output
- Expected JSON:
{“UMRN”: “UMRNAx1S0014321011781012”, “Date”: “17102024”}
- Actual JSON:
{“UMRN”: null, “Date”: “17102024”}
Logs
Train Loss: 0.12, F1-Score: 0.92 Test Loss: 0.35, F1-Score: 0.68
Additional Information
- The layout of the document has minor variations, which might be affecting model performance.
- Is there a better preprocessing strategy or data augmentation technique that can help?