How does NER model learns from the way it is processed during training?

Hi guys, I am just getting started with NER using transformers. But I am getting some issues while understanding how the NER model is trained and how it is expected to perform during inference.
During training, tokenization and labeling is done in such a way that for each word the label is assigned to the first token only and -100 (ignore_index) to all other tokens of that word. But during inference, it is expected to predict the same label for all the tokens.


Just trying to understand how does the model learns this way. Thanks :v:

Great ecosystem BTW :slight_smile:

ping @stefan-it @vblagoje