T5 Finetuning Tips

This even “works” in FP16 – but don’t get me started on Native AMP quite yet…

But in summary – I would strongly recommend using AdaFactor and not ADAM for T5 training and finetuning.

  • this is what the T5 authors use themselves
  • AdaFactor was developed specifically with Transformers/T5 in mind (say so in the paper)
  • ADAM is a massive waste of memory in general; it’s not surprising that something more efficient would work as well unless you have custom additions to your model
10 Likes