Looking for OCR post-processing for Visual Document Understanding

frhrdr · December 15, 2023, 10:28am

Hi, I’m looking into models for feature and relation extraction tasks for Documents, such as LayoutLMv3, LiLT, DocTr etc.
Many of them take image and text data with bounding boxes as input, likely coming from an OCR engine.

My problem here is that these models seem to usually assume that a relevant item is located in exactly one text box.

In the diligently annotated datasets such as FUNSD, this may be the case. However, common OCR outputs usually oversegment the text into many small boxes such that a persons first and last name for instance will not be in the same box, despite belonging together as a value to be extracted.
I am fairly new to this subfield of ML and may be missing some common post-processing techniques people apply to OCR to get rid of this problem. I have not found a discussion of it in any papers yet. Does anyone here have experience with this kind of problem? I would greatly appreciate if someone could give me some advice on this or refer me to tutorials/discussions/papers on the matter.

Topic		Replies	Views
Which is the correct bbox ocr level for LiLT? block level or word level? 🤗Transformers	0	351	June 22, 2023
Improving Key-Value Pair Extraction with LayoutLM and LiLT on Custom OCR Dataset Research	2	280	February 21, 2025
Models for Document Image Annotation Without OCR Research	1	185	December 12, 2024
Which model to select Models	1	70	April 14, 2025
LayoutLMV3 for Token Classification 🤗Transformers	7	4419	June 19, 2025

Looking for OCR post-processing for Visual Document Understanding

Related topics