Trainer predict or evaluate returns zero for metrics

There’s a weird behaviour of trainer.evaluate and trainer.predict returning wrong or zero metrics compared to manually calling model.generate and then compute_metrics.

Here’s an example:

training_args = TrainingArguments(output_dir='./dummy_dir')
trainer = Trainer(
    model=my_model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics
)
metrics={
'test_loss': 0.05446154996752739,
'test_Accuracy': 0.0,
'test_Seq-Accuracy': 0.0,
'test_F1': 0,
'test_Precision': 0.0,
'test_Recall': 0.0,
'test_rouge1': 0.9656326034063261,
'test_rouge2': 0.9542443903757749,
'test_rougeL': 0.9656326034063261,
'test_rougeLsum': 0.965683292781833,
'test_runtime': 120.7627,
'test_samples_per_second': 40.84,
'test_steps_per_second': 5.109
}

Then doing things manually:

preds, labels = .test(my_model, test_loader, device)
metrics = compute_metrics((preds, labels))
metrics = {
'Accuracy':  85.0,
 'Seq-Accuracy':  84.0,
 'F1': 83,
 'Precision': 83.54,
 'Recall': 84.75,
 'rouge1': 0.9656326034063261,
 'rouge2': 0.9542443903757749,
 'rougeL': 0.9656326034063261,
 'rougeLsum': 0.965683292781833}
}

Any ideas what’s going on here?