Trainer.evaluate() vs trainer.predict()

I am following the multilabel text classification tutorial from @nielsr located here: Transformers-Tutorials/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub

I currently have my dataset split into a train, test, and validation dataset. After training, trainer.evaluate() is called which I think is being done on the validation dataset. My question is how do I use the model I created to predict the labels on my test dataset? Do I just call trainer.predict() immediately after trainer.evaluate() like so?

trainer = Trainer(
model,
args,
train_dataset=encoded_dataset[“train”],
eval_dataset=encoded_dataset[“validation”],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.evaluate()
trainer.predict(encoded_dataset[“test”])

Or can I just skip trainer.evaluate() and immediately go to trainer.predict() like so?

trainer = Trainer(
model,
args,
train_dataset=encoded_dataset[“train”],
eval_dataset=encoded_dataset[“validation”],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.predict(encoded_dataset[“test”])

Any help would be greatly appreciated. Thank you!

5 Likes

It depends on what you’d like to do, trainer.evaluate() will predict + compute metrics on your test set and trainer.predict() will only predict labels on your test set. However in case the test set also contains ground-truth labels, the latter will also compute metrics.

4 Likes

Thanks for getting back to me. Maybe my question is more related to what’s happening in inside trainer.train() and the difference between validation and prediction.

After every training epoch (at least the way it is set up in the tutorial notebook), isn’t the model being evaluated against the validation dataset? So why is trainer.evaluate() being run on the validation dataset? Wouldn’t you want it to be the test dataset?

Hi! I have encountered the same problem, when running the same notebook… Did you manage to find the answer?..

I solved the problem by specifying the test dataset as evaluation parameter: trainer.evaluate(eval_dataset=encoded_dataset[“test”])

1 Like

You can set your test dataset as eval dataset on the fly post training completion.
trainer.eval_dataset =
then run
trainer.evaluate()

I might have a clue for generative tasks. I’ve had the same problem for summarization tasks, but it seems that generation length for evaluation mode is longer than that of validation, which answers why the two accuracies had been different. You should check the evaluation kwargs to see the differneces!