Unbalanced training with BERT

I have been looking further into this, and it seems like BERT (huggingfaces implementation) for some reason learn inbalanced.

I have a very small vocabulary since I’m training on genetic data, but you can basically think of it as if I’m just using all the letters in the alphabet as my vocabulary.

I’m plotting a confusion matrix as I’m training, and what I am see when I’m training is that two of these letters will be predicted more than 90% of the time, even though they are only slightly more common than the other in their natural occurence.

Have any of you seen this kind of inbalance before during training of a BERT network? or have any advice as to why this might be?