I am trying to finetune T5 using the Trainer class. I understand that Trainer doesn’t work out-of-the-box for seq2seq tasks and saw @patrickvonplaten’s https://github.com/huggingface/transformers/pull/5840 which extends the Trainer class to work for Bert2Bert. Is my understanding correct that the Trainer class appropriately handles training for seq2seq models since the loss is calculated by the model itself, and that the only problem is when returning EvalPredictions for calculating and logging custom validation metrics?
If so then I would really appreciate if someone can help me to understand what’s being returned in the EvalPrediction dict for T5, it seems like EvalPrediction.predictions is of size batch_size * max_output_len * model_size (31218), is this the generated prediction in embedding form? If so what is the best way to convert this to prediction ids? I tried naively calling model.lm_head() on it but that didn’t seem to be the correct approach. @valhalla perhaps you can weigh in, I also took a look at your notebook finetuning T5 with Pytorch Lightning but would really like to use the HF Trainer class.
Not sure what you mean by "T5 doesn’t work out-of-the-box for seq2seq tasks ". T5 is a seq2seq model and it does work for seq2seq tasks.
You can use Trainer for seq2seq tasks as it is. Patrick’s PR extends it so that generative metrics can be calculated (ROUGE, BLUE etc), it should be okay if you calculate them after training the training is finished.
To use Trainer for T5, the dataset or collator (if you are using one) should at least return input_ids, attention_mask and labels (set pad tokens to -100 in labels). The rest will be handled by Trainer
In the meantime I would like to calculate validation metrics during training but I don’t understand how to manipulate the output given by Trainer in EvalPrediction as the “prediction”, in order to retrieve the id’s of the generated prediction. What is it actually returning here?
To calculate generative metrics during training either clone Patrics branch or Seq2SeqTrainer PR branch.
The default Trainer returns the output of the final LM head layer which is why the shape is batch_size * max_output_len * vocab_size. The above branches instead calls the generate method inside the eval loop and returns the generated ids as EvalPrediction.predictions and the actual labels as EvalPrediction.label_ids, which then you can pass to the tokenizers decode or batch_decode method in the compute_metrics function to get the text and calculate metrics like ROUGE, BLUE etc
@valhalla How to use this script for t5large?
I have a machine with 4x4 P100 GPU’s. I am still fairly new to PL. I think from what all I have tried that t5 large is too large to fit on a single GPU so in order to fine-tune it. I need to implement Shared training to fine T5 large or there is something else I can try?