There’s a weird behaviour of trainer.evaluate
and trainer.predict
returning wrong or zero metrics compared to manually calling model.generate
and then compute_metrics
.
Here’s an example:
training_args = TrainingArguments(output_dir='./dummy_dir')
trainer = Trainer(
model=my_model,
args=training_args,
train_dataset=tokenized_data["train"],
eval_dataset=tokenized_data["valid"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
preprocess_logits_for_metrics=preprocess_logits_for_metrics
)
metrics={
'test_loss': 0.05446154996752739,
'test_Accuracy': 0.0,
'test_Seq-Accuracy': 0.0,
'test_F1': 0,
'test_Precision': 0.0,
'test_Recall': 0.0,
'test_rouge1': 0.9656326034063261,
'test_rouge2': 0.9542443903757749,
'test_rougeL': 0.9656326034063261,
'test_rougeLsum': 0.965683292781833,
'test_runtime': 120.7627,
'test_samples_per_second': 40.84,
'test_steps_per_second': 5.109
}
Then doing things manually:
preds, labels = .test(my_model, test_loader, device)
metrics = compute_metrics((preds, labels))
metrics = {
'Accuracy': 85.0,
'Seq-Accuracy': 84.0,
'F1': 83,
'Precision': 83.54,
'Recall': 84.75,
'rouge1': 0.9656326034063261,
'rouge2': 0.9542443903757749,
'rougeL': 0.9656326034063261,
'rougeLsum': 0.965683292781833}
}
Any ideas what’s going on here?