T5 fp16 issue is fixed

valhalla · January 12, 2021, 12:08pm

We have just fixed the T5 fp16 issue for some of the T5 models!

(Announcing it here, since lots of users were facing this issue and T5 is one most widely used model in the library)

TL;DR:

Previously, there was an issue when using T5 models in fp16; it was producing nan loss and logits. Now on the master, this issue is fixed for the following T5 models and versions. Now you should be able to train and use these models for inference in fp16 and see a decent speed-up!

T5v1 : t5-small , t5-base , t5-large
T5v1_1 : google/t5-v1_1-small , google/t5-v1_1-base
MT5 : google/mt5-small , google/mt5-base

For those of you who are interested, here’s a description of what was causing nan loss and how it is fixed.

t5-small was the only T5 model that was working in fp16. The rest of the models produce nan loss/logits.

for all the models and versions (v1, v1.1, mT5), at some point, we get inf values in hidden_states after applying the final linear layer (wo) in T5DenseReluDense and T5DenseGatedGeluDense.

github.com

huggingface/transformers/blob/02e05fb0a532e572b56ba75dad6ba3db625bbdeb/src/transformers/models/t5/modeling_t5.py#L248-L278


class T5DenseReluDense(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.wi = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
        self.dropout = nn.Dropout(config.dropout_rate)
    def forward(self, hidden_states):
        hidden_states = self.wi(hidden_states)
        hidden_states = F.relu(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.wo(hidden_states)
        return hidden_states
class T5DenseGatedGeluDense(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)

This file has been truncated. show original

which results in nan values in T5LayerNorm.

Also for t5-large, t5-v1_1-base, t5-v1_1-large, there are inf values in the output of T5LayerSelfAttention and T5LayerCrossAttention, specifically where we add the attn output with the hidden_states

github.com

huggingface/transformers/blob/02e05fb0a532e572b56ba75dad6ba3db625bbdeb/src/transformers/models/t5/modeling_t5.py#L548


        normed_hidden_states = self.layer_norm(hidden_states)
        attention_output = self.SelfAttention(
            normed_hidden_states,
            mask=attention_mask,
            position_bias=position_bias,
            head_mask=head_mask,
            past_key_value=past_key_value,
            use_cache=use_cache,
            output_attentions=output_attentions,
        )
        hidden_states = hidden_states + self.dropout(attention_output[0])
        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
        return outputs
class T5LayerCrossAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.EncDecAttention = T5Attention(config, has_relative_attention_bias=False)
        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

github.com

huggingface/transformers/blob/02e05fb0a532e572b56ba75dad6ba3db625bbdeb/src/transformers/models/t5/modeling_t5.py#L584


            normed_hidden_states,
            mask=attention_mask,
            key_value_states=key_value_states,
            position_bias=position_bias,
            head_mask=head_mask,
            past_key_value=past_key_value,
            use_cache=use_cache,
            query_length=query_length,
            output_attentions=output_attentions,
        )
        layer_output = hidden_states + self.dropout(attention_output[0])
        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them
        return outputs
class T5Block(nn.Module):
    def __init__(self, config, has_relative_attention_bias=False):
        super().__init__()
        self.is_decoder = config.is_decoder
        self.layer = nn.ModuleList()
        self.layer.append(T5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))

This happens during both training and inference, to reproduce

fix

To avoid inf values we could clamp the hidden_states to the max values for the current data type if there are inf in it. i.e

if torch.isinf(hidden_states).any():
    clamp_value = torch.finfo(hidden_states.dtype).max - 1000
    hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)

we need to add this after self attn, cross-attn, and the feed-forward layer which is where the inf values occur. This works for both apex and amp

To verify this fix, trained t5-base, t5-v1_1-base and t5-v1_1-small on cnn/dm for 10k steps (1.11 epochs)
Here’s the training command, to run this navigate to examples/seq2seq dir, follow the instructions in the readme to download cnn_dm and dataset, and then run the following command

export M=google/t5-v1_1-base
export OUT_DIR=t5-v1_1-base-cnn-fp16
export DATA_DIR=cnn_dm

python finetune_trainer.py \
    --model_name_or_path $M \
    --data_dir $DATA_DIR \
    --output_dir $OUT_DIR --overwrite_output_dir \
    --max_steps=10000 \
    --gradient_accumulation_steps=8 \
    --learning_rate=1e-4 \
    --per_device_train_batch_size=4 \
    --n_val 500 \
    --max_target_length=56 --val_max_target_length=128 \
    --fp16 --fp16_backend apex \
    --do_train --do_eval --evaluation_strategy steps \
    --logging_steps=100 --logging_first_step --eval_steps=2500 --save_steps=2500 --save_total_limit=2 \
    --sortish_sampler \

for evaluation

python run_eval.py \
    t5-v1_1-base-cnn-fp16  cnn_dm/test.source hypothesis.txt \
    --reference_path cnn_dm/test.target \
    --score_path metrics.json \
    --device cuda:0 \
    --prefix summarize: \
    --bs 16 \
    --fp16 \

and got the following metrics (ROUGE2)

for t5-base: 19.2804
for t5-v1.1-base: 18.4316
(note that the score for t5-base is more because it’s already pre-trained on cnn/dm)

To compare this, evaluated the pre-trained t5-base in both fp32 and fp16, which gave the following results

fp16: 18.3681
fp32: 18.394

So the results are close enough.

To verify the fix for t5-large, evaluated the pre-trained t5-large in fp32 and fp16 (use the same command above to evaluate t5-large) and got the following results

fp16: 19.2734
fp32: 19.2342

Surprisingly, rouge2 is slightly better in fp16.

So with the above fix, the following model types now work in fp16 (opt level 01), and give descent speed-up

T5v1: t5-small, t5-base, t5-large
T5v1_1: google/t5-v1_1-small, google/t5-v1_1-base
MT5: google/mt5-small, google/mt5-base

One interesting observation,
For inference, the t5-base fine-tuned with fp16 and evaluated in fp32 is faster (~1.31x) than pre-trained t5-base evaluated in fp16. See this colab

sshleifer · January 12, 2021, 10:17pm

Nice fix!
The speed discrepancy might be because of different length generations.

aaronchavez · January 26, 2021, 4:44pm

Very cool! Works well on T4 as well.

Any guesses why the inference time for fp16 doesn’t seem to be noticeably faster? Saw only a small difference in the shared colab, and seeing similar behavior locally…

valhalla · January 29, 2021, 9:00am

Hey @aaronchavez

as explained in this issue Improve PyTorch examples for FP16 · Issue #9752 · huggingface/transformers · GitHub
To get the full speed-up of FP16 training, every tensor passed through the model should have all its dimensions be a multiple of 8.

aaronchavez · January 29, 2021, 3:21pm

Thanks for replying! I was wondering about inference in particular (via generate())…is it the same situation, there? I am seeing approximately the same times for generation with fp16 and fp32.

Dara · March 21, 2021, 2:21pm

Hi

This is not fixed as far as I see, could you have a look here please:

github.com/huggingface/transformers

getting nans with t5-large + fix

opened 08:52AM - 21 Mar 21 UTC

closed 03:03PM - 23 Jul 21 UTC

yuvalkirstain

## Environment info - `transformers` version: 4.5.0.dev0 - Platform: Linux-4….15.0-65-generic-x86_64-with-glibc2.10 - Python version: 3.8.8 - PyTorch version (GPU?): 1.7.1+cu101 (True) - Tensorflow version (GPU?): not installed (NA) - Using GPU in script?: Yes - Using distributed or parallel set-up in script?: No ### Who can help @patil-suraj @patrickvonplaten ## Information Model I am using (Bert, XLNet ...): t5-large The problem arises when using: * [ ] my own modified scripts: run_seq2seq with minor modifications (attached) The tasks I am working on is: * [ ] my own task or dataset: Closed-Book Open Domain QA ## To reproduce Steps to reproduce the behavior (the fix I'm suggesting is very simple, so perhaps there is no reason to reproduce): 1. unzip the attached zip (below). 2. run ```bash python run_seq2seq.py --model_name_or_path=t5-large --do_train --do_eval --task=qa --train_file=data/PAQ.filtered.regular.16000.json --validation_file=data/PAQ.filtered.regular.16000.json --output_dir=results/5e-5-t5-large-4096000-128-140-1792000-0.1-regular-true-4 --overwrite_output_dir --per_device_train_batch_size=1 --per_device_eval_batch_size=128 --predict_with_generate --fp16 --max_steps=1000 --evaluation_strategy=steps --text_column=question --summary_column=answer --save_total_limit=5 --cache_dir=../.cache --save_steps=500000 --learning_rate=5e-5 --eval_steps=96000 --warmup_steps=100 --run_name=5e-5-t5-large-4096000-128-140-1792000-0.1-regular-true-4 --dropout_rate=0.1 --gradient_accumulation_steps=1 --logging_steps=1 ``` ## Expected behavior Training without nans. ## Possible fix I debugged and saw that we get nans at the `modeling_t5.py` script in line 241 ```python hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) ``` By modifing this line to: ```python clamp_value = torch.finfo(hidden_states.dtype).max - 1000 hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) * torch.rsqrt(variance + self.variance_epsilon) ``` It seems to be solved. BTW it happens in the last layers (this might explain why it wasn't caught in [this fix](https://discuss.huggingface.co/t/t5-fp16-issue-is-fixed/3139)) [seq2seq.zip](https://github.com/huggingface/transformers/files/6177063/seq2seq.zip)

github.com/huggingface/transformers

mt5 getting nans with fp16

opened 06:44AM - 20 Mar 21 UTC

closed 03:18PM - 07 Jun 21 UTC

dorost1234

## Environment info  - `transformers` version: 4.4.2 - Platform: linux - Python version: 3.7 - PyTorch version (GPU?): 1.8 - Tensorflow version (GPU?): - - Using GPU in script?: - - Using distributed or parallel set-up in script?: - ### Who can help  t5: @patrickvonplaten, @patil-suraj ## Information I am using mt5-small model: * the problem arises when using fp16 with mt5 The tasks I am working on is: * translation ## To reproduce Steps to reproduce the behavior: `python run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir test/tst-translation --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --max_train_samples 100 --fp16` outputs: ``` ***** eval metrics ***** epoch = 3.0 eval_bleu = 0.0039 eval_gen_len = 2.95 eval_loss = nan eval_mem_cpu_alloc_delta = 4MB eval_mem_cpu_peaked_delta = 5MB eval_mem_gpu_alloc_delta = 0MB eval_mem_gpu_peaked_delta = 1080MB eval_runtime = 72.1865 eval_samples = 1999 eval_samples_per_second = 27.692 ``` ## Expected behavior being able to use fp16 with mt5 models. Thank you very much for your help, this is really crucial for me to be able to run these models with fp16 to be able to fit more data into old GPUs I have access to and I appreciate a lot your help.

I really need these models and I appreciate having a look into it.
thanks a lot

Dara · March 21, 2021, 2:22pm

@valhalla @patrickvonplaten

FL33TW00D · March 21, 2021, 3:58pm

Hi Dara,
Can you specify which model you’re trying to use?

Dara · March 21, 2021, 5:45pm

Hi there
I use mt5-small, more info provided in the link to the issues me and another users opened thanks

Savindu · July 24, 2021, 5:11am

@valhalla I tried “finetune_trainer.py” you mentioned command.

I got this error.

07/23/2021 06:26:38 - INFO - __main__ - *** Train ***
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py:1026: FutureWarning: `model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` instead.
  FutureWarning,
Traceback (most recent call last):
  File "/content/gdrive/MyDrive/Colab Notebooks/generation/transformers/examples/legacy/seq2seq/finetune_trainer.py", line 367, in <module>
    main()
  File "/content/gdrive/MyDrive/Colab Notebooks/generation/transformers/examples/legacy/seq2seq/finetune_trainer.py", line 305, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1138, in train
    self.create_optimizer_and_scheduler(num_training_steps=max_steps)
  File "/content/gdrive/MyDrive/Colab Notebooks/generation/transformers/examples/legacy/seq2seq/seq2seq_trainer.py", line 118, in create_optimizer_and_scheduler
    if self.sharded_dpp:
AttributeError: 'Seq2SeqTrainer' object has no attribute 'sharded_dpp'

I have no idea how to fix this issue. How can I fix this?

DarkDragon · January 20, 2022, 7:29am

Hi @valhalla

The fix seems not to be applied on the MT5 model:

Nan losses occur when fine-tuning the model with the latest (stable) version of transformers.
MT5 works well with fp16.

Do you see the same behavior?

Pedrambbk · February 1, 2023, 5:22pm

I had the same issue and I changed it to fp16=False and it worked. Although the loss became a non-nan value, it caused the Out Of Memory error frequently. Impossible to train more. I tested many things and finally I tried this:
fp16_full_eval=True instead of fp16l=True. It solved the loss problem but at the end ROUGE metrics became all zero. How can it compute the loss but not the metrics?

Pedrambbk · February 3, 2023, 11:36am

Are the changes already merged?
I am unable to fine-tune mT5 with fp16.

zokica · April 5, 2023, 6:38pm

@ valhalla

So where is this fix now? I do not get where I should place code? Is it in anew version of transformers?

RajSingh333 · April 27, 2023, 10:37pm

I am getting same issue with google/flan-t5-small model with fp16=True. Is the error reported for this model too?

RubenTao0914 · July 1, 2023, 2:19pm

I am getting the same issue with Salesforce/blip2-flan-t5-xl which uses google/flan-t5-xl with amp.
I am using transformers==4.29.2.
I got inf values in hidden_states after applying the final linear layer (wo ) in T5DenseGatedActDense.
I found this by using PyCharm debug mode, as shown in the screenshot below:

By the way, I didn’t find the fixing code which clamps the hidden_states to the max values described in fix in the T5DenseGatedActDense class.

RubenTao0914 · July 1, 2023, 3:14pm

I am using transformers==4.29.2.
I may have found a logical flaw in T5Block as shown in below:

Blockquote

Apply Feed Forward layer

hidden_states = self.layer-1

clamp inf values to enable fp16 training

if hidden_states.dtype == torch.float16:
clamp_value = torch.where(
torch.isinf(hidden_states).any(),
torch.finfo(hidden_states.dtype).max - 1000,
torch.finfo(hidden_states.dtype).max,
)
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)

Blockquote
Apparently the fixing code which clamps the hidden_states to the max values only works when hidden_states is fp16. It may works fine when the whole model is fp16. But I am trying to do mixed precision training using torch.amp, so the hidden_states returned by T5LayerFF is fp32 because the last operation in T5LayerFF is the adding op, a.k.a the residual connection, which can’t autocast to float16 according to https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float16.
The screenshot below also proved this theory:

屏幕截图 2023-07-01 2250031761×977 88.5 KB

lexandstuff · May 27, 2024, 5:00am

3 year later, it seems t5-large still doesn’t support fp16. Anyone found a solution?

dblakely · June 20, 2024, 7:10pm

You should use bf16 instead of fp16 for T5 models

Topic		Replies	Views
FP-16 training producing nans on t5-large/flan-t5-xl 🤗Transformers	0	736	June 1, 2023
T5 variants return Training Loss 0 and Validation loss nan while fine tuning 🤗Transformers	8	5618	November 10, 2024
Training Loss = 0.0, Validation Loss = nan Intermediate	6	14327	September 5, 2023
Large max differences between single input processing and batching with Bert and T5 🤗Transformers	0	560	April 26, 2021
Finetune bart for text summary has nan loss Amazon SageMaker	5	941	October 8, 2021

T5 fp16 issue is fixed

fix

Apply Feed Forward layer

clamp inf values to enable fp16 training

Related topics