Evaluating huggingface transformer with trainer gives different results

I am using a pre-trained Transformer for sequence classification (distilbert-base-cased) which I fine-tuned on my dataset with the Trainer class. When I evaluate the model using the Trainer class I get an accuracy of 94%

trainer = Trainer(model=model)
preds = trainer.predict(validation_dataset)
predictions = np.argmax(preds.predictions, axis=-1)

metric = evaluate.load("accuracy")
metric.compute(predictions=predictions, references=preds.label_ids)
# prints: {'accuracy': 0.9435554514341591}

However, when I tried to get the predictions directly from the model, the accuracy was only around 86%:

predictions = []
model.eval()
for row in validation_dataset:
    text_ids = row['input_ids'].unsqueeze(0)
    predicted = torch.argmax(model(text_ids)[0])
    predictions += [predicted.item()]

metric.compute(predictions, labels)
# prints {'accuracy': 0.8639942552151239}

I wonder why are the predictions from the trainer and the model different. And additionally, why is the accuracy of the predictions from the trainer so much better? Am I missing something or is it an indication of bad implementation?

5 Likes