T5 Finetuning Tips

moscow25 · August 12, 2020, 6:08am

Training with AdaFactor works quite well for me so far. I use the “constant LR 0.001” recommended in all of Colin Raffel’s finetuning paper and other AdaFactor settings from original Noam Shazeer paper.
Fair-Seq’s AdaFactor implementation is good – except you need to turn auto-scaling options off – no idea why they are on by default in the init.
https://github.com/pytorch/fairseq/blob/775122950d145382146e9120308432a9faf9a9b8/fairseq/optim/adafactor.py
lr=0.001, scale_parameter=False, relative_step=False

This even “works” in FP16 – but don’t get me started on Native AMP quite yet…

But in summary – I would strongly recommend using AdaFactor and not ADAM for T5 training and finetuning.

this is what the T5 authors use themselves
AdaFactor was developed specifically with Transformers/T5 in mind (say so in the paper)
ADAM is a massive waste of memory in general; it’s not surprising that something more efficient would work as well unless you have custom additions to your model

Topic		Replies	Views
Finetuning T5 for a task Intermediate	21	6944	September 3, 2022
Finetuning T5 on translation task 🤗Transformers	0	490	September 10, 2021
Does task specific prefix matters for T5 fine-tuning? Beginners	9	7297	June 28, 2021
T5: Tips for finetuning on crossword clues (clue => answer) Models	1	629	October 14, 2020
Finetuning mT5 for specific language pair Models	0	146	October 17, 2024