How does NER model learns from the way it is processed during training?

Hi guys, I am just getting started with NER using transformers. But I am getting some issues while understanding how the NER model is trained and how it is expected to perform during inference.
During training, tokenization and labeling is done in such a way that for each word the label is assigned to the first token only and -100 (ignore_index) to all other tokens of that word. But during inference, it is expected to predict the same label for all the tokens.

Training: https://github.com/huggingface/transformers/blob/6b4c617666fd26646d44d54f0c45dfe1332b12ca/examples/token-classification/utils_ner.py#L110-L117
Inference: https://huggingface.co/transformers/usage.html#named-entity-recognition

Just trying to understand how does the model learns this way. Thanks :v:

Great ecosystem BTW :slight_smile:

ping @stefan-it @vblagoje