Finetuning for fp16 compatibility

t5 and pegasus don’t really work in fp16 because they create activations that overflow fp16 bits. (they were trained in bfloat 16 which has larger range) Has anyone read/seen/heard anything about finetuning/scaling models so that their activations can fit in fp16. (or generally to encourage smaller magnitude activations?

I tried one experiment on google/pegasus-xsum where I finetune with summarization lm loss and add some additional losses based on the magnitude of hidden states, but I haven’t weighted them (the model instantly forgets how to summarize) so I’m looking around.