LayoutLMv3 token classification on repeated values

I’m using my own version of SROIE dataset for token classification problem using LayourLMv3. I found out a scenario where some information is repeated throughout the document. Let’s say the seller name it’s “abc def ghi” and it’s repeated three times on the document.

Character recognition is a big part of this problem, and sometimes, this process is not accurate enough. So, in this specific scenario, I get this output from the inference:


“SellerName”: [“abc”, “def”, “ghi”, “abo”, “def”, “obc”, “dof”, “ghi”]

So, basically, the inference is correct. This document has the seller name three times on in. But, I have a lot of information with the same meaning.

  • In the first case, the OCR got it perfectly [“abc”, “def”, “ghi”…]
  • In the second case, the OCR did not recognize the last part of the seller name, and recognized wrong a letter […“abo”, “def”…]
  • In the third case, the OCR recognized the full name but got wrong some letters […“obc”, “dof”, “ghi”]

“abc” has the same meaning as “abo” and “obc”.
“def” has the same meaning as “def” and “dof”.
“ghi” has the same meaning as “ghi”

My question is the following: How can I keep just one value with the same meaning?
I would like just to keep “SellerName”: [“abc”, “def”, “ghi”].