Training Loss = 0.0, Validation Loss = nan

Hello, I am training a model, but the training loss is zero and the validation loss is nan. This only happened when I switched the pretrained model from t5 to mt5.

I don’t know what’s wrong because it was working with t5.

args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_strategy="steps", 
    logging_steps=100,
    save_strategy="steps",
    save_steps=200,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01, 
    save_total_limit=3, 
    num_train_epochs=6, 
    predict_with_generate=True, 
    fp16=True,
    load_best_model_at_end=True,
)

trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

2 Likes

Hi there,

I’m not very familiar with mT5, but your issue could be the fp16=True part. What happens if you switch that to False or try using bf16 instead?

6 Likes

Thank you for your answer! I totally forgot about this thread. And yes, you are right - this one fixed it for me :slight_smile:

1 Like

Can anyone help me understand why switching off fp16 would fix this issue? Thanks

1 Like

The root of the issue is that most T5 and T5-like models were pretrained by Google on TPUs, not GPUs. For TPU training, Google created its own half-precision floating point format, which is bf16.

Like fp16, bf16 uses 16 bits (instead of the 32 bits used in full precision). However, fp16 and bf16 represent different ranges of numbers - fp16 is limited to the range [-65k, 65k], whereas bf16 has a vastly bigger range of possible values (roughly the same range as fp32, except that chunks of numbers get skipped).

Ultimately, when you try to use fp16 to train a model that was pretrained with bf16, you frequently end up with a lot of overflow issues which cause inf/NaN values for the loss.

The best solution is to use bf16 when fine-tuning T5 models. However, not all GPUs support bf16 (the recent Nvidia GPUs support it, but older GPUs don’t). If you can’t get access to a bf16-compatible GPU, your best bet is probably to just train in fp32.

16 Likes

Just want to give you thanks, I have been searching for this info now for a while, this is great information!

1 Like

Thank you for sharing the insights! I was facing the similar issue while training my summarization t5. Now have better understanding on the infras/archit req.

1 Like