T5 fp16 issue is fixed

I am using transformers==4.29.2.
I may have found a logical flaw in T5Block as shown in below:

Blockquote

Apply Feed Forward layer

hidden_states = self.layer-1

clamp inf values to enable fp16 training

if hidden_states.dtype == torch.float16:
clamp_value = torch.where(
torch.isinf(hidden_states).any(),
torch.finfo(hidden_states.dtype).max - 1000,
torch.finfo(hidden_states.dtype).max,
)
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)

Blockquote
Apparently the fixing code which clamps the hidden_states to the max values only works when hidden_states is fp16. It may works fine when the whole model is fp16. But I am trying to do mixed precision training using torch.amp, so the hidden_states returned by T5LayerFF is fp32 because the last operation in T5LayerFF is the adding op, a.k.a the residual connection, which can’t autocast to float16 according to https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float16.
The screenshot below also proved this theory:

2 Likes