Training Loss = 0.0, Validation Loss = nan

Hello, I am training a model, but the training loss is zero and the validation loss is nan. This only happened when I switched the pretrained model from t5 to mt5.

I don’t know what’s wrong because it was working with t5.

args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_strategy="steps", 
    logging_steps=100,
    save_strategy="steps",
    save_steps=200,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01, 
    save_total_limit=3, 
    num_train_epochs=6, 
    predict_with_generate=True, 
    fp16=True,
    load_best_model_at_end=True,
)

trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Hi there,

I’m not very familiar with mT5, but your issue could be the fp16=True part. What happens if you switch that to False or try using bf16 instead?

Thank you for your answer! I totally forgot about this thread. And yes, you are right - this one fixed it for me :slight_smile:

1 Like