How to cluster words into semantic entities, when performing information extraction?

Hi everyone,

I’ve got a question regarding information extraction from forms. Not sure if it’s the right sub-forum to use to ask it, but since I’ve been using the “transformers” library to perform my work, and I suggest things related to it, I might as well try here. Don’t hesitate to tell me if you think this post should be made elsewhere instead.

I’ve been searching the web about this for some time now, but I have not found a satisfactory answer yet.
My objective is to automatically extract the content of a form as a list of (QUESTION; ANSWER) pairs.

There are tutos and notebook dedicated to showing how to perform information extraction by using LayoutLMForTokenClassification (ex: Transformers-Tutorials/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub).

However, with this, all we achieve is the labeling of words.
To actually achieve my desired result, we need to do two more things:

A] First, we need to group together words into semantic entities; i.e. to group together the words which make up the name of a field / make up a QUESTION; or the words which constitute an ANSWER (for instance: first the word with the value and then the word for the unit, e.g. “3.5 kg”).

B] Then we need to determine / extract the relations between those entities. For instance, we need to find the relation linking one semantic entity which is a QUESTION, with one which is the associated ANSWER (which may also not exist).

There already exist some resources dedicated to performing the RelationExtraction part.
Ex: by the people of Microsoft working on unilm, who tried to implement the LayoutLMv2ForRelationExtraction model (one tuto notebook is available here: Google Colab).
In fact, one PR had been opened on the “transformers” repo in order add this model class.

However, in order to work, this model assumes that we already know the semantic entities, hence the need for the A] step.
But then, how do we achieve this step? That’s what I’m struggling with.
The output of the use of the LayouLMTokenClassification will look something like this:

However, to use RelationExtraction model, we need something like this instead.

Sure, we have this info for labelled data, but not at inference time, for a whole new document.
So, how do we do that for new documents? The order in which the words is output by the OCR may not be consistent with the order in which we actually need to consider the words, if we just rely on the label values to perform the decoding / the build of the semantic entities.

We can model the problem in various ways: notably, this can be though up as finding the edges between the nodes of a directed graph, where the indegree and outdegree values of each vertex are at most 1, and there is no cycle.

I have crafted a simple algo based notably on computing the distance between the bounding box of the words, to account for the info contained in the spatial locations of the words relative to one another, but it’s far from perfect, and there are cases when it will fail to produce the correct result.

Seems to me like it’s the job the LayouLM -based model in the first place to consider both the semantic information and the spatial information of the words, in order to perform the proper labeling within the LayoutLMForTokenClassification model (notably, to distinguish the “B-” labels from the “I-” label), so it’s a shame that this info (“to which semantic entity belongs each token / each word”) is not output by it.

Does anyone have any idea regarding how to carry out the step A]?
Notably, do we need to craft a dedicated model, such as LayoutLMForTokenClassification and LayoutLMv2ForRelationExtraction, in order to achieve this? Or is it possible to somehow re-use / upgrade the LayoutLMForTokenClassification model, in order to produce outputs which allows to carry out the semantic entities construction task?

Sorry, could not share the second image the first time around, here it is:
However, to use RelationExtraction model, we need something like this instead.