Hi @Zack, the reasons why we might not want to assign NER tags to every subword are nicely summarised by facehugger2020 in the other thread you linked to: Converting Word-level labels to WordPiece-level for Token Classification
Besides efficiency, one is also technically breaking the IOB2 format by having consecutive tokens be associated with say B-ORG. Personally, I find this confusing during debugging, so prefer the convention to only tag the first subword in an entity. Of course, this is just a convention and facehugger2020 links to a tutorial showing an alternative along the lines you suggested.
Now why the -100 value? I believe this comes from the PyTorch convention to ignore indices with -100 on the cross-entropy loss: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
You can override this value if you want, but then you’ll need to tinker with the loss function of your Transformer model.
You should be careful about picking an index like 0 because the tokenizer probably has a predefined set of indices for natural numbers and you might be overriding special tokens like <s>
this way.