LayoutLMv3 outputs multiple consecutive B- tokens within same word with transformers 28.1 vs dev

burceam · May 8, 2023, 7:13pm

Hi,

Definite beginner here. I used Niels Rogge’s LayoutLMv3 code to fine-tune the model for my own labels (thank you for that very detailed tutorial, btw).
Initially, I’d been importing the transformers library from the git repo, thus pulling whatever the latest dev version was. However, I ran into this issue: Name Error: "Partial State" is not defind · Issue #22816 · huggingface/transformers · GitHub, where even after installing the accelerate library, I still get the “PartialState not defined” error, so I downgraded to transformers 4.28.1.

Now I am seeing some weird behaviour when doing named entity recognition. The test image had the following characters on it: “Model : 5432AB8” (handwritten, if that matters).
After processing Google’s ocr output, I get: words=['Model', ':', '5432AB8']

Here is the problem:

with the 4.29.x? dev transformers library, I was getting the labels assigned correctly:

debug_value= Model, raw_predictions[idx]=1
debug_value= :, raw_predictions[idx]=2
debug_value= 54, raw_predictions[idx]=3
root:debug_value=32, raw_predictions[idx]=4
debug_value=AB, raw_predictions[idx]=4
debug_value=8, raw_predictions[idx]=4

It’s not important what the labels represent, the important part is that 3 is a B-token, while 4 is a I-token (in other words, 3 represents the beginning of a subword, while 4 is the inside/interior of a subword, as per the IOB notation/convention).
This makes snese, since in the string “5432AB8”, “54” is indeed the beginning, and the remaining three subtokens are interior subwords (so I-tokens). All good.

with the 4.28.1 transformers library (latest stable), I get:

debug_value= Model, raw_predictions[idx]=1
debug_value= :, raw_predictions[idx]=3
debug_value= 54, raw_predictions[idx]=3
**debug_value=32, raw_predictions[idx]=3**
debug_value=AB, raw_predictions[idx]=4
debug_value=8, raw_predictions[idx]=4

Note that now the “32” is also marked as a B-token, even though it follows right after another B-token, inside the same subword (as detected by OCR).
I’d like to understand what may cause this, and if there is anything I can do about it.

I realize that because I can’t provide any way to replicate this, it may be impossible for anyone to help, and I apologize for not being able to provide any code. Thank you for reading!

Topic		Replies	Views
LayoutLMv2Processor uses pad tokens for non-first subword tokens on NER task 🤗Transformers	3	386	July 19, 2022
Layoutlmv3 sequence_length vs token_sequnce_length size mismatch Models	2	699	November 19, 2022
LayoutLMv3 Inference Intermediate	2	1135	March 11, 2024
LayoutLM data format for bounding box classification Intermediate	1	265	February 13, 2025
Inference API - Sub-words display for Token Classification 🤗Hub	0	375	June 25, 2023

LayoutLMv3 outputs multiple consecutive B- tokens within same word with transformers 28.1 vs dev

Related topics