Mixed precision for bfloat16-pretrained models

stas · April 5, 2021, 8:06pm

As bfloat16 hardware support is becoming more available there is an emerging trend of training in bfloat16, which leads to the issue of not being able to finetune such models in mixed precision (or eval in fp16) - be it amp, apex or deepspeed/fairscale.

Last week I spent some time sitting with the NaN issues reported in t5/mt5 (and pegasus apparently too), and I have been watching the activation values: [T5/MT5] resolve inf/nan under amp (mixed precision) by stas00 · Pull Request #10956 · huggingface/transformers · GitHub

and studying the numerical qualities of bfloat16 vs bloat16: ml-ways/bfloat16-vs-float16-study.ipynb at master · stas00/ml-ways · GitHub

So my conclusion/understanding is this: since bfloat16 has no access to precision it basically compensates and trains itself to use huge numbers, so rather than having small activation values it operates in the 1e5 - 1e10+ range which is beyond the 64k limit float16 can handle and thus overflows (inf) which then immediately leads to nan (see my nb for how inf/nan comes about).

To make things worse bfloat16 huge number range has huge gaps with no numbers in it:

torch.tensor(283, dtype=torch.bfloat16)*10 # 2848 instead of 2830!

so it trains to compensate for that handicap as well. And so when float16 comes around which has much smaller gaps it obviously won’t produce the same results. See my notebook to see the gaps demo’ed.

Ideally there should be some plane transform that could take the weights trained in bfloat16 and convert those to the numerical domain of float16. A naive approach could be to divide everything by ~100000 to shift to a different effective range . But because the training is non-linear I can’t see how this would be possible, other than via some DNN that was trained for such transform.

As you can see from the PR some workarounds may work, but it’s hard to keep the numbers in check when the model wants to constantly operate in the range float16 wasn’t designed for. A user already reported NaNs after a 3h training with this PR, but hasn’t shared a way to reproduce yet.

@sshleifer suggested here that perhaps finetuning with a penalty for large activations could do the trick. It’s unclear how much of such finetuning it’d take, since the need is to lower the weights by several orders of magnitude, so that the activations and accumulative math operations don’t break the 64K barrier.

So currently t5/mt5/pegasus models are affected, but I’m sure there will be more emerging as new hardware supporting bfloat16 is quickly emerging so we will have to deal with that a lot more very soon I believe.

Of course, if we wait long enough, the mixed precision will be moved to fp32/bf16 or even not be needed anymore.

If perhaps some of you have experimented with such bf16 to fp16 finetuning and had good results please do share. It’s possible that if a solid approach is found then we will need to make a 2nd set of these models whose weights are finetuned for fp16.

Thank you.

stas · April 6, 2021, 7:44pm

And it looks like GPT-Neo just added itself to the group of bfloat16-pretrained models.

github.com/huggingface/transformers

FP16 overflow with GPT-Neo when using sequence lengths of 2048.

opened 02:28AM - 06 Apr 21 UTC

closed 03:10PM - 27 May 21 UTC

LouisCastricato

## Environment info - `transformers` version: 4.5.0.dev0 - Platform: Linux-5….4.0-54-generic-x86_64-with-glibc2.29 - Python version: 3.8.5 - PyTorch version (GPU?): 1.8.0+cu111 - Tensorflow version (GPU?): N/A - Using GPU in script?: Yes - Using distributed or parallel set-up in script?: No ### Who can help @stas00 Models: - GPT-Neo 1.3b Library: - deepspeed: @stas00 ## Information Model I am using (Bert, XLNet ...): The problem arises when using: * [ ] the official example scripts: (give details below) * [x] my own modified scripts: (give details below) The tasks I am working on is: * [ ] an official GLUE/SQUaD task: (give the name) * [x] my own task or dataset: (give details below) ## To reproduce Steps to reproduce the behavior: 1. Use GPT-Neo 1.3b with The Pile dataset and built in trainer. Artificial data also suffices. It does not matter what the data is, as long as the attention mask spans all 2048 tokens. 2. Enable FP16 and set max_length to 2048 3. Observe that all loses reported are NaN Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block. When the max_length is shorter (512) this overflow does not occur.  ## Expected behavior I expected no overflows. ## Aside I'm reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.

stas · April 21, 2021, 11:37pm

We started compiling a wiki of how different models were pre-trained, please add your knowledge there - thanks!

Topic		Replies	Views
Finetuning for fp16 compatibility Research	2	1698	June 17, 2021
Model pre-training precision database: fp16, fp32, bf16 🤗Transformers	4	7054	December 3, 2022
Training Loss = 0.0, Validation Loss = nan Intermediate	6	13871	September 5, 2023
Bfloat16 conversion results in significantly slower computation for various transformer models 🤗Transformers	0	1418	December 20, 2021
Question met when using DeepSpeed ZeRO3 AMP for code testing on simple pytorch examples 🤗Accelerate	0	32	July 24, 2024

Mixed precision for bfloat16-pretrained models

Related topics