Is it possible that the model can make use of the added precision during finetuning? Or is it the case that if a model was initially trained with mixed precision then all downstream training should have use same (or less) precision?
Hi @nadahlberg transformer models are often sensitive to FP16 training because of the Layer Norms involved . The model can definitely have added precision benifits but that will not because that the model was trained on fp32 but because of transformers
1 Like