LayoutLMv2Processor uses pad tokens for non-first subword tokens on NER tasks. Since the bbox is provided at the word level all the subword tokens have the same bbox but why does only the first one is having the label?
As an example, here I am showing (<iob_label_id>, <input_ids>, , )
(0, 17403, '157', [112, 199, 145, 212]),
(-100, 1010, ',', [112, 199, 145, 212]),
(1, 6583, 'na', [152, 199, 254, 213]),
(-100, 4648, '##ura', [152, 199, 254, 213]),
(-100, 3070, '##ng', [152, 199, 254, 213]),
(-100, 5311, '##pur', [152, 199, 254, 213]),
(-100, 1010, ',', [152, 199, 254, 213]),
(1, 4753, 'sector', [259, 200, 314, 211]),
(1, 6275, '78', [317, 200, 338, 211]),
For me, it does make sense to have labels 1 instead of -100 since the initial words that had label 1 are subword tokenized. Subword tokens of the same word have exactly the same bbox values.
Any ideas?