Does it ever make sense to finetune w fp32 if the base model was trained w fp16?

Is it possible that the model can make use of the added precision during finetuning? Or is it the case that if a model was initially trained with mixed precision then all downstream training should have use same (or less) precision?

Hi @nadahlberg transformer models are often sensitive to FP16 training because of the Layer Norms involved . The model can definitely have added precision benifits but that will not because that the model was trained on fp32 but because of transformers

1 Like