T5 variants return Training Loss 0 and Validation loss nan while fine tuning

Pedrambbk · January 31, 2023, 5:16pm

Hi the community,

I am fine tuning T5 for a question generation task.
If I use any of the reference models (mt5-smal, T5-small, T5-base) for fine tuning using the Trainer library I get training loss zero and validation loss as nan. If I use any of these models already fine tuned on a task I get a correct training and validation loss.

Any encounter to this problem? any solution or fix?

dblakely · January 31, 2023, 5:28pm

Hi there,

Are you using fp16 by any chance? That’s a common source of this issue with T5 models.

Pedrambbk · January 31, 2023, 6:10pm

Yes I do use fp16

Pedrambbk · January 31, 2023, 6:14pm

model_name = "google/mt5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
default_model = T5ForConditionalGeneration.from_pretrained(model_name)

training_inputs = tokenizer(text = train_questions.input_text.tolist(), text_target= train_questions.text.tolist(), padding="longest", return_tensors="pt")
eval_inputs = tokenizer(text = eval_questions.input_text.tolist(), text_target= eval_questions.text.tolist(), padding="longest", return_tensors="pt")

train_dataset = Dataset.from_dict(training_inputs)
train_dataset.set_format("torch")

eval_dataset = Dataset.from_dict(eval_inputs)
eval_dataset.set_format("torch")


training_args = TrainingArguments(output_dir="/home/jovyan/work/data/fine-tuned-T5-small",
                                  evaluation_strategy="steps",
                                  gradient_accumulation_steps=1,
                                  gradient_checkpointing=False,
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=16, fp16=True,
                                  optim="adamw_torch",
                                  report_to = "wandb",
                                  log_level = "debug",
                                  label_names= ["labels"],
                                  learning_rate=1e-5,
                                  do_train = True,
                                  do_eval = True,
                                  weight_decay=0.01, 
                                  logging_steps = True,
                                  save_strategy="epoch",
                                  resume_from_checkpoint=True,
                                  eval_steps= True,
                                  num_train_epochs=2 )

# finetuned_model.gradient_checkpointing_enable()
default_model.use_cache = False
data_collator = DataCollatorForSeq2Seq(tokenizer)

trainer = Trainer(
    model=default_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)
torch.cuda.empty_cache()
# wandb.watch(default_model, log="all")
trainer.train()

Pedrambbk · January 31, 2023, 7:17pm

Problem solved by changing fp16 to False. Apparently, there is a discrepancy for some T5 variants about fp16. Better to check both of the True and False if there is such problem that training loss is zero and validation loss is nan.

Pedrambbk · January 31, 2023, 7:18pm

In some variants fine tuning, the fp16 should be True but apparently not for all T5 variants.

cassianlewis · June 1, 2023, 1:48pm

Any idea when they’re going to fix this fp16 issue? It currently works in t5-small/base but large/xl produces nan outputs…

shreyas231219 · October 24, 2024, 4:20am

ya for mee too…finetuning large model. gives 0.00000 training loss for every step.

dblakely · November 10, 2024, 7:59pm

They can’t really fix it, the issue is the pretrained model itself doesn’t work well with fp16 training. A few years ago there were some heroic efforts to add workarounds, but they were complex and didn’t work very well. You should use bf16 or make do with fp32.

ya for mee too…finetuning large model. gives 0.00000 training loss for every step.

Is this for fp16? What about bf16 or fp32?

Topic		Replies	Views
Training Loss = 0.0, Validation Loss = nan Intermediate	6	14275	September 5, 2023
T5 fp16 issue is fixed 🤗Transformers	18	15278	June 20, 2024
FP-16 training producing nans on t5-large/flan-t5-xl 🤗Transformers	0	732	June 1, 2023
Gradient overflow when fine-tune t5 on CNN/DM dataset Beginners	5	1688	September 3, 2020
LongT5 fine-tunning Beginners	4	2028	April 19, 2024

T5 variants return Training Loss 0 and Validation loss nan while fine tuning

Related topics