T5 Finetuning Tips

Hi guys!
I just finish training T5-large on ELI5 on 270,000 exampels using TPU V2-8 on colab modified from @valhalla notebook! This is not really finetuning tips, but some tips to make T5-large trainable on TPU V2-8 .

T5-large is challenging to train on TPU V2-8 with Pytorch (for me)

  • I faced a lot of memory problem (even on Colab High-RAM instance), this notebook of Davide Libenzi - one of XLA authors suggested to declare large model outside _mp_fn (see his mx variable )
  • with T5-base , there is around 7 minutes overhead before training can start, for T5-large, this takes 1 hour overhead to me
  • with max_length = 128 (both input and target), I am able to set per_device_train_batch_size = 4 (so, global_batch_size = 4*8 = 32)
  • there is an issue that xm.save() causes memory error with large models like XLM-Roberta , it happen to T5-large too, so I have to ignore the default save_steps of Trainer by setting it to 1000000

Combine all these, took me around 1 day before I can make a trainable notebook, so hopefully these tricks can be useful to some of you guys too!

I would like to find time to make a TF2 version which should be more stable on TPU :slight_smile:

More note

  • As @valhalla mentioned in his notebook, High-RAM instance is a must. Lately Kaggle notebook increased RAM to 16GB for TPUV3-8, but I could not the training to success (sadly since V3-8 should be 2x faster than V2-8)
4 Likes