Looking for OCR post-processing for Visual Document Understanding

Hi, I’m looking into models for feature and relation extraction tasks for Documents, such as LayoutLMv3, LiLT, DocTr etc.
Many of them take image and text data with bounding boxes as input, likely coming from an OCR engine.

My problem here is that these models seem to usually assume that a relevant item is located in exactly one text box.

In the diligently annotated datasets such as FUNSD, this may be the case. However, common OCR outputs usually oversegment the text into many small boxes such that a persons first and last name for instance will not be in the same box, despite belonging together as a value to be extracted.
I am fairly new to this subfield of ML and may be missing some common post-processing techniques people apply to OCR to get rid of this problem. I have not found a discussion of it in any papers yet. Does anyone here have experience with this kind of problem? I would greatly appreciate if someone could give me some advice on this or refer me to tutorials/discussions/papers on the matter.