Finetuning for fp16 compatibility

t5 and pegasus don’t really work in fp16 because they create activations that overflow fp16 bits. (they were trained in bfloat 16 which has larger range) Has anyone read/seen/heard anything about finetuning/scaling models so that their activations can fit in fp16. (or generally to encourage smaller magnitude activations?

I tried one experiment on google/pegasus-xsum where I finetune with summarization lm loss and add some additional losses based on the magnitude of hidden states, but I haven’t weighted them (the model instantly forgets how to summarize) so I’m looking around.

It’s been a long time since this post, but maybe you remember if the problem with fp16 will appear when training the models from scratch (pretraining)?

I’ve seen some NaNs already while training with fp16 on, but after lowering the learning rate, beginning of training looks reasonable.

After 3 days of training with fp16 on NaN loss happened. Created issue Pegasus pretraining in fp16 results in NaN loss · Issue #12225 · huggingface/transformers · GitHub, maybe someone knows how it can be fixed.