We have just fixed the T5 fp16 issue for some of the T5 models!
(Announcing it here, since lots of users were facing this issue and T5 is one most widely used model in the library)
TL;DR:
Previously, there was an issue when using T5
models in fp16
; it was producing nan
loss
and logits
. Now on the master, this issue is fixed for the following T5 models and versions. Now you should be able to train and use these models for inference in fp16
and see a decent speed-up!
-
T5v1 :
t5-small
,t5-base
,t5-large
-
T5v1_1 :
google/t5-v1_1-small
,google/t5-v1_1-base
-
MT5 :
google/mt5-small
,google/mt5-base
For those of you who are interested, here’s a description of what was causing nan
loss and how it is fixed.
t5-small
was the only T5 model that was working in fp16. The rest of the models produce nan
loss/logits.
for all the models and versions (v1, v1.1, mT5), at some point, we get inf
values in hidden_states
after applying the final linear layer (wo
) in T5DenseReluDense
and T5DenseGatedGeluDense
.
which results in nan
values in T5LayerNorm
.
Also for t5-large
, t5-v1_1-base
, t5-v1_1-large
, there are inf
values in the output of T5LayerSelfAttention
and T5LayerCrossAttention
, specifically where we add the attn output with the hidden_states
This happens during both training and inference, to reproduce
fix
To avoid inf
values we could clamp the hidden_states
to the max values for the current data type if there are inf
in it. i.e
if torch.isinf(hidden_states).any():
clamp_value = torch.finfo(hidden_states.dtype).max - 1000
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
we need to add this after self attn, cross-attn, and the feed-forward layer which is where the inf
values occur. This works for both apex
and amp
To verify this fix, trained t5-base
, t5-v1_1-base
and t5-v1_1-small
on cnn/dm
for 10k steps (1.11 epochs)
Here’s the training command, to run this navigate to examples/seq2seq
dir, follow the instructions in the readme to download cnn_dm
and dataset, and then run the following command
export M=google/t5-v1_1-base
export OUT_DIR=t5-v1_1-base-cnn-fp16
export DATA_DIR=cnn_dm
python finetune_trainer.py \
--model_name_or_path $M \
--data_dir $DATA_DIR \
--output_dir $OUT_DIR --overwrite_output_dir \
--max_steps=10000 \
--gradient_accumulation_steps=8 \
--learning_rate=1e-4 \
--per_device_train_batch_size=4 \
--n_val 500 \
--max_target_length=56 --val_max_target_length=128 \
--fp16 --fp16_backend apex \
--do_train --do_eval --evaluation_strategy steps \
--logging_steps=100 --logging_first_step --eval_steps=2500 --save_steps=2500 --save_total_limit=2 \
--sortish_sampler \
for evaluation
python run_eval.py \
t5-v1_1-base-cnn-fp16 cnn_dm/test.source hypothesis.txt \
--reference_path cnn_dm/test.target \
--score_path metrics.json \
--device cuda:0 \
--prefix summarize: \
--bs 16 \
--fp16 \
and got the following metrics (ROUGE2)
- for
t5-base
: 19.2804 - for
t5-v1.1-base
: 18.4316
(note that the score fort5-base
is more because it’s already pre-trained oncnn/dm
)
To compare this, evaluated the pre-trained t5-base
in both fp32
and fp16
, which gave the following results
-
fp16
: 18.3681 -
fp32
: 18.394
So the results are close enough.
To verify the fix for t5-large
, evaluated the pre-trained t5-large
in fp32
and fp16
(use the same command above to evaluate t5-large
) and got the following results
-
fp16
: 19.2734 -
fp32
: 19.2342
Surprisingly, rouge2 is slightly better in fp16
.
So with the above fix, the following model types now work in fp16
(opt level 01
), and give descent speed-up
-
T5v1:
t5-small
,t5-base
,t5-large
-
T5v1_1:
google/t5-v1_1-small
,google/t5-v1_1-base
-
MT5:
google/mt5-small
,google/mt5-base
One interesting observation,
For inference, the t5-base
fine-tuned with fp16
and evaluated in fp32
is faster (~1.31x) than pre-trained t5-base
evaluated in fp16
. See this colab