TF32 Training on Ampere?
There seems to be discussion (various threads and git issues) about whether T5 arch is just inherently unstable and that frequent FP16 NaN isn’t a bug in the transformers implementation or in user training arguments but may be unavoidable in true FP16 mode.
I would like to bring into this discussion that when i run the mesh tensorflow version of T5 from the research repo (https://github.com/google-research/text-to-text-transfer-transformer) on TPU on my data set its rock solid 16 bit training (I assume because of the wider range capability of bf16 support). On the same data set I essentially can never get fp16 working on anything larger than t5-small with HuggingFace (with adafactor, with and without lr warming, native/apex(1/2/3) ect)
For workflow reasons using the research mesh code is not going to be an option and I need to get the 3B model training on GPUs which will require ~16bit compute in order to fit in ~32-48GB gpu.
Ampere gpus supports a similar size TF32 (18bit vs 16 bit for BF16 on TPU). Has anyone tried (or even have access to) an A100 GPU to see if TF32 solves the issue here?
EDIT: it looks like Ampere also natively support BF16. So that looks like a good way to compare T5 mesh on TPU with T5 HF on Ampere both using BF16.