Hi guys!
I just finish training T5-large on ELI5 on 270,000 exampels using TPU V2-8 on colab modified from @valhalla notebook! This is not really finetuning tips, but some tips to make T5-large trainable on TPU V2-8 .
T5-large is challenging to train on TPU V2-8 with Pytorch (for me)
- I faced a lot of memory problem (even on Colab High-RAM instance), this notebook of Davide Libenzi - one of XLA authors suggested to declare large model outside
_mp_fn
(see hismx
variable ) - with T5-base , there is around 7 minutes overhead before training can start, for T5-large, this takes 1 hour overhead to me
- with
max_length = 128
(both input and target), I am able to setper_device_train_batch_size = 4
(so, global_batch_size = 4*8 = 32) - there is an issue that
xm.save()
causes memory error with large models like XLM-Roberta , it happen to T5-large too, so I have to ignore the defaultsave_steps
ofTrainer
by setting it to 1000000
Combine all these, took me around 1 day before I can make a trainable notebook, so hopefully these tricks can be useful to some of you guys too!
I would like to find time to make a TF2 version which should be more stable on TPU
More note
- As @valhalla mentioned in his notebook, High-RAM instance is a must. Lately Kaggle notebook increased RAM to 16GB for TPUV3-8, but I could not the training to success (sadly since V3-8 should be 2x faster than V2-8)