FP-16 training producing nans on t5-large/flan-t5-xl

This was an issue a while back but seems to have resurfaced - T5 fp16 issue is fixed

I have tested the exact following code on t5-small and t5-base and they work fine. However, when using t5-large and/or flan-t5-xl, the model produces nan outputs. This is solely a result of using half precision (ignore the multiple GPUs, strategy etc, I have tested with every other variation):

trainer = pl.Trainer(
    precision="16",
    accelerator='gpu',
    strategy='auto',
    devices=4,)

I am using transformers == 4.28.1 and lightning == 2.0.0

Any ideas/help appreciated
Thanks!

3 Likes