LayoutLMv3 token classification on repeated values

aghenghiu · February 5, 2024, 10:06am

I’m using my own version of SROIE dataset for token classification problem using LayourLMv3. I found out a scenario where some information is repeated throughout the document. Let’s say the seller name it’s “abc def ghi” and it’s repeated three times on the document.

Character recognition is a big part of this problem, and sometimes, this process is not accurate enough. So, in this specific scenario, I get this output from the inference:

…
“SellerName”: [“abc”, “def”, “ghi”, “abo”, “def”, “obc”, “dof”, “ghi”]
…

So, basically, the inference is correct. This document has the seller name three times on in. But, I have a lot of information with the same meaning.

In the first case, the OCR got it perfectly [“abc”, “def”, “ghi”…]
In the second case, the OCR did not recognize the last part of the seller name, and recognized wrong a letter […“abo”, “def”…]
In the third case, the OCR recognized the full name but got wrong some letters […“obc”, “dof”, “ghi”]

“abc” has the same meaning as “abo” and “obc”.
“def” has the same meaning as “def” and “dof”.
“ghi” has the same meaning as “ghi”

My question is the following: How can I keep just one value with the same meaning?
I would like just to keep “SellerName”: [“abc”, “def”, “ghi”].

Topic		Replies	Views
Optimal Approach for Fine-Tuning LayoutLMv3 for Token Classification with 80 Labels Models	3	32	May 26, 2025
Improving Key-Value Pair Extraction with LayoutLM and LiLT on Custom OCR Dataset Research	2	275	February 21, 2025
LayoutLMv3 outputs multiple consecutive B- tokens within same word with transformers 28.1 vs dev Beginners	0	259	May 8, 2023
Dataset preparation for LayoutLM and LiLT Research	1	63	April 27, 2025
LayoutLMV3 for Token Classification 🤗Transformers	7	4406	June 19, 2025

LayoutLMv3 token classification on repeated values

Related topics