FP-16 training producing nans on t5-large/flan-t5-xl

cassianlewis · June 1, 2023, 1:27pm

This was an issue a while back but seems to have resurfaced - T5 fp16 issue is fixed

I have tested the exact following code on t5-small and t5-base and they work fine. However, when using t5-large and/or flan-t5-xl, the model produces nan outputs. This is solely a result of using half precision (ignore the multiple GPUs, strategy etc, I have tested with every other variation):

trainer = pl.Trainer(
    precision="16",
    accelerator='gpu',
    strategy='auto',
    devices=4,)

I am using transformers == 4.28.1 and lightning == 2.0.0

Any ideas/help appreciated
Thanks!

Topic		Replies	Views
Training Loss = 0.0, Validation Loss = nan Intermediate	6	13967	September 5, 2023
T5 fp16 issue is fixed 🤗Transformers	18	15167	June 20, 2024
T5 variants return Training Loss 0 and Validation loss nan while fine tuning 🤗Transformers	8	5474	November 10, 2024
Finetuning for fp16 compatibility Research	2	1700	June 17, 2021
Mixed precision for bfloat16-pretrained models 🤗Transformers	2	12401	April 21, 2021

FP-16 training producing nans on t5-large/flan-t5-xl

Related topics