Trainer.evaluate() with text generation

kstats · August 4, 2020, 4:51pm

Hi everyone, I’m fine-tuning XLNet for generation. For training, I’ve edited the permutation_mask to predict the target sequence one word at a time. I’m evaluating my trained model and am trying to decide between trainer.evaluate() and model.generate(). Running the same input/model with both methods yields different predicted tokens. Is it correct that trainer.evaluate() is not set up for sequential generation? I’ll switch my evaluation code to use model.generate() if that’s the case. Thanks for the help!

rosafish · August 11, 2020, 2:25pm

Hi, I encountered a similar problem when trying to use EncoderDecoderModel for seq2seq tasks. It seems like Trainer does not support text-generation tasks for now, as their website https://huggingface.co/transformers/examples.html shows.

valhalla · August 11, 2020, 2:28pm

There’s a PR for that, you can try to use it.

cgawron · December 31, 2021, 9:42am

Is there any update regarding this topic?
I would like to train a VisionEncoderDecoderModel for image captioning and measure the BLEU metrics during evaluation. The EvalPrediction object I get in compute_metrics just contains the logits, not the generated texts or tokens (i.e. the result of a beam search). I would assume that the computation of metrics on the result of generate is not uncommon.

The PR mentioned in this thread seems to be stale and there have been quite some changes to Trainer since it was proposed.

nielsr · December 31, 2021, 2:44pm

Hi @cgawron, you can take a look at my TrOCR notebooks here: Transformers-Tutorials/TrOCR at master · NielsRogge/Transformers-Tutorials · GitHub.

They include several example notebooks regarding fine-tuning TrOCR (which is an instance of VisionEncoderDecoderModel). I have a notebook regarding using the Seq2SeqTrainer, but also using native PyTorch. In both cases, I illustrate how to compute metrics using generate (in the notebooks I use CER, but you can easily replace it with something like ROUGE or BLEU).

cgawron · December 31, 2021, 7:57pm

Thank you, @nielsr, for providing these examples!

Just for reference for other readers:
The Seq2SeqTrainingArguments now contain a flag predict_with_generate exactly for this purpose.

Topic		Replies	Views
Parameters for evaluation loop of a Seq2SeqTrainer model Intermediate	0	1168	November 26, 2021
Evaluate model at saved checkpoint 🤗Transformers	0	1295	June 22, 2021
Trainer.evaluate() vs trainer.predict() 🤗Transformers	6	36567	July 10, 2024
T5 Model Evaluation on Generation 🤗Transformers	0	422	February 8, 2024
Running model.generate() in deep speed training DeepSpeed	2	530	July 25, 2024

Trainer.evaluate() with text generation

Related topics