I’m using my own version of SROIE dataset for token classification problem using LayourLMv3. I found out a scenario where some information is repeated throughout the document. Let’s say the seller name it’s “abc def ghi” and it’s repeated three times on the document.
Character recognition is a big part of this problem, and sometimes, this process is not accurate enough. So, in this specific scenario, I get this output from the inference:
…
“SellerName”: [“abc”, “def”, “ghi”, “abo”, “def”, “obc”, “dof”, “ghi”]
…
So, basically, the inference is correct. This document has the seller name three times on in. But, I have a lot of information with the same meaning.
- In the first case, the OCR got it perfectly [“abc”, “def”, “ghi”…]
- In the second case, the OCR did not recognize the last part of the seller name, and recognized wrong a letter […“abo”, “def”…]
- In the third case, the OCR recognized the full name but got wrong some letters […“obc”, “dof”, “ghi”]
“abc” has the same meaning as “abo” and “obc”.
“def” has the same meaning as “def” and “dof”.
“ghi” has the same meaning as “ghi”
My question is the following: How can I keep just one value with the same meaning?
I would like just to keep “SellerName”: [“abc”, “def”, “ghi”].