- Training with AdaFactor works quite well for me so far. I use the “constant LR 0.001” recommended in all of Colin Raffel’s finetuning paper and other AdaFactor settings from original Noam Shazeer paper.
- Fair-Seq’s AdaFactor implementation is good – except you need to turn auto-scaling options off – no idea why they are on by default in the init.
- https://github.com/pytorch/fairseq/blob/775122950d145382146e9120308432a9faf9a9b8/fairseq/optim/adafactor.py
- lr=0.001, scale_parameter=False, relative_step=False
This even “works” in FP16 – but don’t get me started on Native AMP quite yet…
But in summary – I would strongly recommend using AdaFactor and not ADAM for T5 training and finetuning.
- this is what the T5 authors use themselves
- AdaFactor was developed specifically with Transformers/T5 in mind (say so in the paper)
- ADAM is a massive waste of memory in general; it’s not surprising that something more efficient would work as well unless you have custom additions to your model