t5 and pegasus don’t really work in fp16 because they create activations that overflow fp16 bits. (they were trained in bfloat 16 which has larger range) Has anyone read/seen/heard anything about finetuning/scaling models so that their activations can fit in fp16. (or generally to encourage smaller magnitude activations?
I tried one experiment on google/pegasus-xsum where I finetune with summarization lm loss and add some additional losses based on the magnitude of hidden states, but I haven’t weighted them (the model instantly forgets how to summarize) so I’m looking around.