I built a BERT model from scratch on domain specific sequences, for preparing the input I appended the [CLS] to the front, [SEP] to the end and [PAD]s after [SEP] when necessary. However after I trained the model using the cross entropy as the loss, and stopped the training at a 90% acc, I found that my model only predicted the non special tokens. What did I do wrong? Should I enforce some rules to the loss so that the model will have more inductive bias in producing the result?