Finetune model outputs diffrent predictions at each run ? why?

I finetune a model with Bert on classifcation task but when saving the best model and trying on new data ? the model keep predicting the same class for all data ?

first run : all data are from the class 0
second run: class 1
third run : class 2

Can you explain why , I use the best model for the checkpoint. When testing after training ? i get ~0.90 accuracy but when I tried to use the model prediction on new unannotated data. I get the same class for all the dataset (~400 sentences). and if I run it. I keep getting another same class for all the dataset.

        model = FlaubertForSequenceClassification.from_pretrained(checkpoint, num_labels=3)
# print info about model's parameters
total_params = sum(p.numel() for p in model.parameters())
model_parameters = filter(lambda p: p.requires_grad, model.parameters())
trainable_params = sum([ for p in model_parameters])
test_trainer = Trainer(model)
raw_pred, _, _ = test_trainer.predict(emb_feature)

I get the raw and then convert to probabilities… Do you know perhaps why ?