LayoutLMv2Processor uses pad tokens for non-first subword tokens on NER task

anz2 · July 1, 2022, 12:12pm

LayoutLMv2Processor uses pad tokens for non-first subword tokens on NER tasks. Since the bbox is provided at the word level all the subword tokens have the same bbox but why does only the first one is having the label?

As an example, here I am showing (<iob_label_id>, <input_ids>, , )

(0, 17403, '157', [112, 199, 145, 212]),
 (-100, 1010, ',', [112, 199, 145, 212]),
 (1, 6583, 'na', [152, 199, 254, 213]),
 (-100, 4648, '##ura', [152, 199, 254, 213]),
 (-100, 3070, '##ng', [152, 199, 254, 213]),
 (-100, 5311, '##pur', [152, 199, 254, 213]),
 (-100, 1010, ',', [152, 199, 254, 213]),
 (1, 4753, 'sector', [259, 200, 314, 211]),
 (1, 6275, '78', [317, 200, 338, 211]),

For me, it does make sense to have labels 1 instead of -100 since the initial words that had label 1 are subword tokenized. Subword tokens of the same word have exactly the same bbox values.

Any ideas?

nielsr · July 1, 2022, 1:28pm

Hi,

Fine-tuning Transformer-based models for NER typically has 2 strategies. One is labelling all subword tokens of a given word, another strategy is to only label the first subword token, and label all others with -100 (which is the ignore_index of the CrossEntropyLoss function in PyTorch, meaning the other subword tokens won’t contribute to the loss). Hence, with this strategy, the model needs to only learn to appropriately label the first subword token of a given word.

You can change this behaviour by doing processor.tokenizer.only_label_first_subword = False (or instantiating a LayoutLMv2Tokenizer/LayoutLMv2TokenizerFast with only_label_first_subword=False, and then instantiating a LayoutLMv2Processor with this tokenizer and a LayoutLMv2FeatureExtractor)

anz2 · July 1, 2022, 2:35pm

Thanks for help!
Just one more question here: Do you have any information on which approach performs better?

nielsr · July 19, 2022, 8:12am

In practice, both seem to work equally well.

Topic		Replies	Views
LayoutLMv3 outputs multiple consecutive B- tokens within same word with transformers 28.1 vs dev Beginners	0	259	May 8, 2023
Tokenization in a NER context 🤗Tokenizers	5	5730	August 11, 2021
How to Decode InputIDs back to String in LayoutLMV3 🤗Transformers	2	1362	March 8, 2024
Ask for help with prediction results of Named Entity Recognition Task 🤗Transformers	10	3230	May 21, 2021
LayoutLMV3 for Token Classification 🤗Transformers	7	4417	June 19, 2025

LayoutLMv2Processor uses pad tokens for non-first subword tokens on NER task

Related topics