Different eval output when using Trainer.evaluate() vs own inference loop

Hi, I am getting largely differing evaluation output when using Trainer.evaluate(val_data) vs my own inference loop (i.e. i feed the val_data into the model directly and evaluate the logits). The difference in performance accuracy is 56% for Trainer.evaluate(val_data) vs 18% for own inference loop.

I’m using the same test_accuracy metrics from scikit learn library. I also investigated on a small test set and confirmed that the prediction using huggingface Trainer vs my own model outputs was quite different.

################
code for trainer evaluate:
val_output = trainer.evaluate(val_data)

code for my own inference loop
test_dataloader = DataLoader(val_data, batch_size=8, shuffle=False, generator=gen, num_workers=2)

with torch.no_grad():
for idx, data in enumerate(test_dataloader)
data_input = data[ā€œinput_idsā€].to(device)
outputs = model(input_ids=data_input, output_hidden_states=True)
y_pred = torch.argmax(outputs.logits.cpu(),dim=1)
data_labels = data[ā€˜labels’].cpu()
if idx == 0:
overall_y_pred = y_pred
overall_data_labels = data_labels
else:
overall_y_pred = torch.hstack((overall_y_pred, y_pred))
overall_data_labels = torch.hstack((overall_data_labels,data_labels))
overall_y_pred = overall_y_pred.to(device=ā€˜cpu’, dtype=torch.float)
overall_data_labels = overall_data_labels.to(device=ā€˜cpu’, dtype=torch.float)
test_accuracy = balanced_accuracy_score(overall_data_labels, overall_y_pred)

##############
This is happening only for both Roberta and Deberta models. I am not facing similar issues for BEiT/ Wav2Vec 2.0/HubERT despite using almost the same codes for my own inference loop.

Am not able to figure out what’s wrong. Hoping someone is able to help pls. Thank you!