LayoutLMv2Processor uses pad tokens for non-first subword tokens on NER task

LayoutLMv2Processor uses pad tokens for non-first subword tokens on NER tasks. Since the bbox is provided at the word level all the subword tokens have the same bbox but why does only the first one is having the label?

As an example, here I am showing (<iob_label_id>, <input_ids>, , )

(0, 17403, '157', [112, 199, 145, 212]),
 (-100, 1010, ',', [112, 199, 145, 212]),
 (1, 6583, 'na', [152, 199, 254, 213]),
 (-100, 4648, '##ura', [152, 199, 254, 213]),
 (-100, 3070, '##ng', [152, 199, 254, 213]),
 (-100, 5311, '##pur', [152, 199, 254, 213]),
 (-100, 1010, ',', [152, 199, 254, 213]),
 (1, 4753, 'sector', [259, 200, 314, 211]),
 (1, 6275, '78', [317, 200, 338, 211]),

For me, it does make sense to have labels 1 instead of -100 since the initial words that had label 1 are subword tokenized. Subword tokens of the same word have exactly the same bbox values.

Any ideas?

Hi,

Fine-tuning Transformer-based models for NER typically has 2 strategies. One is labelling all subword tokens of a given word, another strategy is to only label the first subword token, and label all others with -100 (which is the ignore_index of the CrossEntropyLoss function in PyTorch, meaning the other subword tokens won’t contribute to the loss). Hence, with this strategy, the model needs to only learn to appropriately label the first subword token of a given word.

You can change this behaviour by doing processor.tokenizer.only_label_first_subword = False (or instantiating a LayoutLMv2Tokenizer/LayoutLMv2TokenizerFast with only_label_first_subword=False, and then instantiating a LayoutLMv2Processor with this tokenizer and a LayoutLMv2FeatureExtractor)

1 Like

Thanks for help!
Just one more question here: Do you have any information on which approach performs better?

In practice, both seem to work equally well.

1 Like