LayoutLMv3 outputs multiple consecutive B- tokens within same word with transformers 28.1 vs dev

Hi,

Definite beginner here. I used Niels Rogge’s LayoutLMv3 code to fine-tune the model for my own labels (thank you for that very detailed tutorial, btw).
Initially, I’d been importing the transformers library from the git repo, thus pulling whatever the latest dev version was. However, I ran into this issue: Name Error: "Partial State" is not defind · Issue #22816 · huggingface/transformers · GitHub, where even after installing the accelerate library, I still get the “PartialState not defined” error, so I downgraded to transformers 4.28.1.

Now I am seeing some weird behaviour when doing named entity recognition. The test image had the following characters on it: “Model : 5432AB8” (handwritten, if that matters).
After processing Google’s ocr output, I get: words=['Model', ':', '5432AB8']

Here is the problem:

  • with the 4.29.x? dev transformers library, I was getting the labels assigned correctly:
debug_value= Model, raw_predictions[idx]=1
debug_value= :, raw_predictions[idx]=2
debug_value= 54, raw_predictions[idx]=3
root:debug_value=32, raw_predictions[idx]=4
debug_value=AB, raw_predictions[idx]=4
debug_value=8, raw_predictions[idx]=4

It’s not important what the labels represent, the important part is that 3 is a B-token, while 4 is a I-token (in other words, 3 represents the beginning of a subword, while 4 is the inside/interior of a subword, as per the IOB notation/convention).
This makes snese, since in the string “5432AB8”, “54” is indeed the beginning, and the remaining three subtokens are interior subwords (so I-tokens). All good.

  • with the 4.28.1 transformers library (latest stable), I get:
debug_value= Model, raw_predictions[idx]=1
debug_value= :, raw_predictions[idx]=3
debug_value= 54, raw_predictions[idx]=3
**debug_value=32, raw_predictions[idx]=3**
debug_value=AB, raw_predictions[idx]=4
debug_value=8, raw_predictions[idx]=4

Note that now the “32” is also marked as a B-token, even though it follows right after another B-token, inside the same subword (as detected by OCR).
I’d like to understand what may cause this, and if there is anything I can do about it.

I realize that because I can’t provide any way to replicate this, it may be impossible for anyone to help, and I apologize for not being able to provide any code. Thank you for reading!