I am using transformers==4.29.2
.
I may have found a logical flaw in T5Block
as shown in below:
Blockquote
Apply Feed Forward layer
hidden_states = self.layer-1
clamp inf values to enable fp16 training
if hidden_states.dtype == torch.float16:
clamp_value = torch.where(
torch.isinf(hidden_states).any(),
torch.finfo(hidden_states.dtype).max - 1000,
torch.finfo(hidden_states.dtype).max,
)
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
Blockquote
Apparently the fixing code which clamps thehidden_states
to the max values only works whenhidden_states
is fp16. It may works fine when the whole model is fp16. But I am trying to do mixed precision training usingtorch.amp
, so thehidden_states
returned byT5LayerFF
is fp32 because the last operation inT5LayerFF
is the adding op, a.k.a the residual connection, which can’t autocast tofloat16
according to https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float16.
The screenshot below also proved this theory: