Training Loss = 0.0, Validation Loss = nan

PandaKata · December 16, 2022, 2:42pm

Hello, I am training a model, but the training loss is zero and the validation loss is nan. This only happened when I switched the pretrained model from t5 to mt5.

I don’t know what’s wrong because it was working with t5.

args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_strategy="steps", 
    logging_steps=100,
    save_strategy="steps",
    save_steps=200,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01, 
    save_total_limit=3, 
    num_train_epochs=6, 
    predict_with_generate=True, 
    fp16=True,
    load_best_model_at_end=True,
)

trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

dblakely · January 31, 2023, 5:32pm

Hi there,

I’m not very familiar with mT5, but your issue could be the fp16=True part. What happens if you switch that to False or try using bf16 instead?

PandaKata · February 2, 2023, 8:30pm

Thank you for your answer! I totally forgot about this thread. And yes, you are right - this one fixed it for me

toughdata · July 24, 2023, 6:58pm

Can anyone help me understand why switching off fp16 would fix this issue? Thanks

dblakely · August 1, 2023, 6:14pm

The root of the issue is that most T5 and T5-like models were pretrained by Google on TPUs, not GPUs. For TPU training, Google created its own half-precision floating point format, which is bf16.

Like fp16, bf16 uses 16 bits (instead of the 32 bits used in full precision). However, fp16 and bf16 represent different ranges of numbers - fp16 is limited to the range [-65k, 65k], whereas bf16 has a vastly bigger range of possible values (roughly the same range as fp32, except that chunks of numbers get skipped).

Ultimately, when you try to use fp16 to train a model that was pretrained with bf16, you frequently end up with a lot of overflow issues which cause inf/NaN values for the loss.

The best solution is to use bf16 when fine-tuning T5 models. However, not all GPUs support bf16 (the recent Nvidia GPUs support it, but older GPUs don’t). If you can’t get access to a bf16-compatible GPU, your best bet is probably to just train in fp32.

Fonchote · August 16, 2023, 3:20pm

Just want to give you thanks, I have been searching for this info now for a while, this is great information!

mikojelly · September 5, 2023, 7:52pm

Thank you for sharing the insights! I was facing the similar issue while training my summarization t5. Now have better understanding on the infras/archit req.

Topic		Replies	Views
T5 variants return Training Loss 0 and Validation loss nan while fine tuning 🤗Transformers	8	5476	November 10, 2024
T5 fp16 issue is fixed 🤗Transformers	18	15174	June 20, 2024
FP-16 training producing nans on t5-large/flan-t5-xl 🤗Transformers	0	714	June 1, 2023
Mt5 fine-tuning using fp16 yields zero loss 🤗Transformers	1	638	April 23, 2023
Finetuning MT0 produce 0 loss Models	1	550	September 5, 2023

Training Loss = 0.0, Validation Loss = nan

Related topics