T5 fp16 issue is fixed

RubenTao0914 · July 1, 2023, 3:14pm

I am using transformers==4.29.2.
I may have found a logical flaw in T5Block as shown in below:

Blockquote

Apply Feed Forward layer

hidden_states = self.layer-1

clamp inf values to enable fp16 training

if hidden_states.dtype == torch.float16:
clamp_value = torch.where(
torch.isinf(hidden_states).any(),
torch.finfo(hidden_states.dtype).max - 1000,
torch.finfo(hidden_states.dtype).max,
)
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)

Blockquote
Apparently the fixing code which clamps the hidden_states to the max values only works when hidden_states is fp16. It may works fine when the whole model is fp16. But I am trying to do mixed precision training using torch.amp, so the hidden_states returned by T5LayerFF is fp32 because the last operation in T5LayerFF is the adding op, a.k.a the residual connection, which can’t autocast to float16 according to https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float16.
The screenshot below also proved this theory:

屏幕截图 2023-07-01 2250031761×977 88.5 KB

Topic		Replies	Views
FP-16 training producing nans on t5-large/flan-t5-xl 🤗Transformers	0	732	June 1, 2023
T5 variants return Training Loss 0 and Validation loss nan while fine tuning 🤗Transformers	8	5593	November 10, 2024
Training Loss = 0.0, Validation Loss = nan Intermediate	6	14277	September 5, 2023
Large max differences between single input processing and batching with Bert and T5 🤗Transformers	0	558	April 26, 2021
Finetune bart for text summary has nan loss Amazon SageMaker	5	941	October 8, 2021

T5 fp16 issue is fixed

Apply Feed Forward layer

clamp inf values to enable fp16 training

Related topics