NER for 2D text

I’m looking for a method for NER on semi-structured text (ie. text with bounding boxes). The challenge with NER on semi-structured text is that because of the 2D nature of the text, we cannot rely on the usual IOB tagging schema to retrieve entities.

Here’s an example where we want to extract the 2 addresses as LOC entities
Screen Shot 2021-03-16 at 11.55.40 AM
In this setup, we have those labels (disregarding B-/I- since it’s not making sense anymore)
Screen Shot 2021-03-16 at 11.55.46 AM
Now, if we were to treat this as plain text by sequentially looking line by line, this would give us
Screen Shot 2021-03-16 at 11.55.52 AM

Here, we are mixing entities because each entity spreads across multiple lines, so retrieving entities from entity labels is not trivial.
The only solution I’ve seen is to add a subtask to group tokens into entities (treating it essentially as relation extraction).