Token Classification Model making mistake outside of training dataset

I have finetuned both BERT and Albert on a token classification task (for a type of key-phrase extraction task where the model receives a paragraph and it has to select all the key phrases). I used the beginning-middle annotation scheme, where I label the first token of the phrase with “1” and the rest of the tokens in the key phrase “2”. However, both of my models that I have finetuned on my dataset (has more than 10,000 training samples) make mistakes where the model doesn’t place a “1” for the first token of the key phrase and only places "2"s (this error occurred roughly a 1000 times for a 2000 sample validation set throughout the model training (I tested model for this error every 500 optimization steps)). Why is this happening (my training dataset has no errors)?

