Using Trainer class with T5 - what is returned in EvalPrediction dict?

Hi,

I am trying to finetune T5 using the Trainer class. I understand that Trainer doesn’t work out-of-the-box for seq2seq tasks and saw @patrickvonplaten’s https://github.com/huggingface/transformers/pull/5840 which extends the Trainer class to work for Bert2Bert. Is my understanding correct that the Trainer class appropriately handles training for seq2seq models since the loss is calculated by the model itself, and that the only problem is when returning EvalPredictions for calculating and logging custom validation metrics?

If so then I would really appreciate if someone can help me to understand what’s being returned in the EvalPrediction dict for T5, it seems like EvalPrediction.predictions is of size batch_size * max_output_len * model_size (31218), is this the generated prediction in embedding form? If so what is the best way to convert this to prediction ids? I tried naively calling model.lm_head() on it but that didn’t seem to be the correct approach. @valhalla perhaps you can weigh in, I also took a look at your notebook finetuning T5 with Pytorch Lightning but would really like to use the HF Trainer class.

Thanks for all help.

Hi @melody-ju
Not sure what you mean by "T5 doesn’t work out-of-the-box for seq2seq tasks ". T5 is a seq2seq model and it does work for seq2seq tasks.

You can use Trainer for seq2seq tasks as it is. Patrick’s PR extends it so that generative metrics can be calculated (ROUGE, BLUE etc), it should be okay if you calculate them after training the training is finished.

To use Trainer for T5, the dataset or collator (if you are using one) should at least return
input_ids, attention_mask and labels (set pad tokens to -100 in labels). The rest will be handled by Trainer

This notebook uses Trainer for fine-tuning T5.

Few things to note about that notebook,
I wrote it before v3.0.0, few things have changed after that

  1. DatCollator is not a class anymore, so you won’t need to inherit from DataCollator when creating T2TDataCollator. Also collate_batch should be renamed to __call__.
  2. lm_lables is now deprecated, use labels instead.
  3. No need to manually add </s> anymore, the tokenizer now does that itself.

Also you can use the prepare_seq2seq_batch method on toknizer which can take the source and target text and returns input_ids, attention_mask and labels.

You can also use finetune.py script from here to finetune T5 and other seq2seq models. It’s using PL, and there;s WIP version of Seq2SeqTrainer in this PR

For more T5 related tips T5 Finetuning Tips

Oops sorry, I meant that Trainer does not work out of the box for seq2seq

It works :). It can not calculate generative metrics right now , but that’s being added in Seq2SeqTrainer.

Gotcha! Looking forward to using Seq2SeqTrainer.

In the meantime I would like to calculate validation metrics during training but I don’t understand how to manipulate the output given by Trainer in EvalPrediction as the “prediction”, in order to retrieve the id’s of the generated prediction. What is it actually returning here?

To calculate generative metrics during training either clone Patrics branch or Seq2SeqTrainer PR branch.

The default Trainer returns the output of the final LM head layer which is why the shape is batch_size * max_output_len * vocab_size. The above branches instead calls the generate method inside the eval loop and returns the generated ids as EvalPrediction.predictions and the actual labels as EvalPrediction.label_ids, which then you can pass to the tokenizers decode or batch_decode method in the compute_metrics function to get the text and calculate metrics like ROUGE, BLUE etc

@melody-ju did you manage to compute metrics? I am stuck exactly at the same point as you.

model_checkpoint = "distilroberta-base"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint).to('cuda')

def compute_custom_metric(eval_pred):
    print(eval_pred.predictions.shape) # ==> (3387, 32, 50265) (beach_size * max_output_len * vocal_size)
    print(eval_pred.label_ids.shape) # ==> (3387, 32) (batch_size * max_output_len)
    return {'custom_metric': 0}

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train,
    eval_dataset = validation,
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_custom_metric
)

trainer.evaluate()

I would like to calculate perplexity, not quite sure how to go about it.

@valhalla Can you please share some thought.

@valhalla How to use this script for t5large?
I have a machine with 4x4 P100 GPU’s. I am still fairly new to PL. I think from what all I have tried that t5 large is too large to fit on a single GPU so in order to fine-tune it. I need to implement Shared training to fine T5 large or there is something else I can try?