Mixed precision for bfloat16-pretrained models

As bfloat16 hardware support is becoming more available there is an emerging trend of training in bfloat16, which leads to the issue of not being able to finetune such models in mixed precision (or eval in fp16) - be it amp, apex or deepspeed/fairscale.

Last week I spent some time sitting with the NaN issues reported in t5/mt5 (and pegasus apparently too), and I have been watching the activation values: [T5/MT5] resolve inf/nan under amp (mixed precision) by stas00 路 Pull Request #10956 路 huggingface/transformers 路 GitHub

and studying the numerical qualities of bfloat16 vs bloat16: ml-ways/bfloat16-vs-float16-study.ipynb at master 路 stas00/ml-ways 路 GitHub

So my conclusion/understanding is this: since bfloat16 has no access to precision it basically compensates and trains itself to use huge numbers, so rather than having small activation values it operates in the 1e5 - 1e10+ range which is beyond the 64k limit float16 can handle and thus overflows (inf) which then immediately leads to nan (see my nb for how inf/nan comes about).

To make things worse bfloat16 huge number range has huge gaps with no numbers in it:

torch.tensor(283, dtype=torch.bfloat16)*10 # 2848 instead of 2830!

so it trains to compensate for that handicap as well. And so when float16 comes around which has much smaller gaps it obviously won鈥檛 produce the same results. See my notebook to see the gaps demo鈥檈d.

Ideally there should be some plane transform that could take the weights trained in bfloat16 and convert those to the numerical domain of float16. A naive approach could be to divide everything by ~100000 to shift to a different effective range . But because the training is non-linear I can鈥檛 see how this would be possible, other than via some DNN that was trained for such transform.

As you can see from the PR some workarounds may work, but it鈥檚 hard to keep the numbers in check when the model wants to constantly operate in the range float16 wasn鈥檛 designed for. A user already reported NaNs after a 3h training with this PR, but hasn鈥檛 shared a way to reproduce yet.

@sshleifer suggested here that perhaps finetuning with a penalty for large activations could do the trick. It鈥檚 unclear how much of such finetuning it鈥檇 take, since the need is to lower the weights by several orders of magnitude, so that the activations and accumulative math operations don鈥檛 break the 64K barrier.

So currently t5/mt5/pegasus models are affected, but I鈥檓 sure there will be more emerging as new hardware supporting bfloat16 is quickly emerging so we will have to deal with that a lot more very soon I believe.

Of course, if we wait long enough, the mixed precision will be moved to fp32/bf16 or even not be needed anymore.

If perhaps some of you have experimented with such bf16 to fp16 finetuning and had good results please do share. It鈥檚 possible that if a solid approach is found then we will need to make a 2nd set of these models whose weights are finetuned for fp16.

Thank you.


And it looks like GPT-Neo just added itself to the group of bfloat16-pretrained models.

We started compiling a wiki of how different models were pre-trained, please add your knowledge there - thanks!

1 Like