Token Classification Model making mistake outside of training dataset

I have finetuned both BERT and Albert on a token classification task (for a type of key-phrase extraction task where the model receives a paragraph and it has to select all the key phrases). I used the beginning-middle annotation scheme, where I label the first token of the phrase with “1” and the rest of the tokens in the key phrase “2”. However, both of my models that I have finetuned on my dataset (has more than 10,000 training samples) make mistakes where the model doesn’t place a “1” for the first token of the key phrase and only places "2"s (this error occurred roughly a 1000 times for a 2000 sample validation set throughout the model training (I tested model for this error every 500 optimization steps)). Why is this happening (my training dataset has no errors)?

Could you please take a look at this? @valhalla or @lysandre