Why BertForSequenceClassification performs worse than BertModel+Linear?

In my experiments, I trained a simple sentiment classification model on the SST dataset. But it is interest that it is hardly for the model to converge with BertForSequenceClassification but could converge easily with the simple BertModel’s [CLS] +Linear. Did any one else met this problem and could explain the problem to me which part of the pool ,the tanh or the pretrained linear made this problem?