T5 variants return Training Loss 0 and Validation loss nan while fine tuning

Hi the community,

I am fine tuning T5 for a question generation task.
If I use any of the reference models (mt5-smal, T5-small, T5-base) for fine tuning using the Trainer library I get training loss zero and validation loss as nan. If I use any of these models already fine tuned on a task I get a correct training and validation loss.

Any encounter to this problem? any solution or fix?

4 Likes

Hi there,

Are you using fp16 by any chance? That’s a common source of this issue with T5 models.

1 Like

Yes I do use fp16

model_name = "google/mt5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
default_model = T5ForConditionalGeneration.from_pretrained(model_name)

training_inputs = tokenizer(text = train_questions.input_text.tolist(), text_target= train_questions.text.tolist(), padding="longest", return_tensors="pt")
eval_inputs = tokenizer(text = eval_questions.input_text.tolist(), text_target= eval_questions.text.tolist(), padding="longest", return_tensors="pt")

train_dataset = Dataset.from_dict(training_inputs)
train_dataset.set_format("torch")

eval_dataset = Dataset.from_dict(eval_inputs)
eval_dataset.set_format("torch")


training_args = TrainingArguments(output_dir="/home/jovyan/work/data/fine-tuned-T5-small",
                                  evaluation_strategy="steps",
                                  gradient_accumulation_steps=1,
                                  gradient_checkpointing=False,
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=16, fp16=True,
                                  optim="adamw_torch",
                                  report_to = "wandb",
                                  log_level = "debug",
                                  label_names= ["labels"],
                                  learning_rate=1e-5,
                                  do_train = True,
                                  do_eval = True,
                                  weight_decay=0.01, 
                                  logging_steps = True,
                                  save_strategy="epoch",
                                  resume_from_checkpoint=True,
                                  eval_steps= True,
                                  num_train_epochs=2 )

# finetuned_model.gradient_checkpointing_enable()
default_model.use_cache = False
data_collator = DataCollatorForSeq2Seq(tokenizer)

trainer = Trainer(
    model=default_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)
torch.cuda.empty_cache()
# wandb.watch(default_model, log="all")
trainer.train()

Problem solved by changing fp16 to False. Apparently, there is a discrepancy for some T5 variants about fp16. Better to check both of the True and False if there is such problem that training loss is zero and validation loss is nan.

5 Likes

In some variants fine tuning, the fp16 should be True but apparently not for all T5 variants.

1 Like

Any idea when they’re going to fix this fp16 issue? It currently works in t5-small/base but large/xl produces nan outputs…

ya for mee too…finetuning large model. gives 0.00000 training loss for every step.

They can’t really fix it, the issue is the pretrained model itself doesn’t work well with fp16 training. A few years ago there were some heroic efforts to add workarounds, but they were complex and didn’t work very well. You should use bf16 or make do with fp32.

ya for mee too…finetuning large model. gives 0.00000 training loss for every step.

Is this for fp16? What about bf16 or fp32?