Reward becomes nan when switching from full precision to fp16 for gemma3-12b-it

QiyaoWei · April 4, 2025, 10:09pm

I am training gemma3-12b-it on a standard preference dataset. When I accelerate launch train.py on gemma3-12b-it in full precision, the training curve looks reasonable. However, if I switch from full precision to fp16, suddenly the logging shows loss=0, grad_norm=0, reward=nan.... Are multimodal models restricted to full precision training?

from datasets import load_dataset
from trl import RewardTrainer, RewardConfig, DPOConfig, DPOTrainer
from peft import LoraConfig, TaskType
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gemma-3-12b-it"
model = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation="eager")
tokenizer = AutoTokenizer.from_pretrained(model_name)
train_dataset = load_dataset("json", data_files="training_data.json", split="train")
tokenizer.pad_token = tokenizer.eos_token

def process_training_data(example):
    example["prompt"] = example.pop("input")
    example['rejected'] = example['rejected'][0]
    return example
train_dataset = train_dataset.map(process_training_data)

training_args = DPOConfig(
    dataloader_pin_memory=False,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    logging_steps=10,
    # fp16=True
)
training_args.optimize_cuda_cache=True

peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    "down_proj",
    "lm_head",
    ]
)

trainer = DPOTrainer(model=model,
                     args=training_args,
                     processing_class=tokenizer,
                     train_dataset=train_dataset,
                     peft_config=peft_config)
trainer.train()

John6666 · April 5, 2025, 5:58am

Perhaps mixed precision training issue?

github.com/huggingface/transformers

fp16 DDP training in 4.31.0

opened 07:58AM - 23 Jul 23 UTC

closed 08:03AM - 31 Aug 23 UTC

getao

### System Info pytorch 1.13.1 transformers==4.31.0 ### Who can help? …Hi @sgugger , I used the 4.31.0 to train a Llama model with LoRA. I observe some problems with --fp16 training and I'm not sure if it is a bug in Trainer.py: My model is like: ``` class MyModel(nn.Module): def __init__(self, model_name): super().__init__() self.model_name = model_name self.base_model = LlamaForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16) self.base_model = get_peft_model(self.base_model, lora_config) self.other_modules = nn.Linear(4096, 4096) ``` I used the Trainer to train the model with the following command line: `torchrun --nproc_per_node=4 main.py --max_steps 100000 --fp16 ` I find the model's gradients (in self.optimizer in the Trainer) are not fp16 but fp32. Is it correct? Also, I find that no gradient_scaling is conducted during training since self.do_grad_scaling is always False (because self.sharded_ddp is None and args.half_precision_backend will be always "auto"). The current trainer.py will not correctly set up args.half_precision_backend and scaler if self.sharded_ddp is None. Are these observations expected? I'm a little confused why setting up args.half_precision_backend and scaler require sharded_ddp. As a result, I've found that during the training process, I often encounter the loss becoming NaN. I'm not sure whether it is because no gradient_scaling is conducted and half_precision_backend is not correctly set up during training. Following are my grad_norm (before grad_clipping) with and without --fp16. (My base model here is "JackFram/llama-160m" for debugging) **The results are significantly different.** Without --fp16: step 1: grad_norm=0.059 Step 5: grad_norm=0.054 Step 10: grad_norm=0.048 Step 15: grad_norm=0.050 Step 20: grad_norm=0.050 With --fp16: Step 1: grad_norm = nan Step 5: grad_norm = 129.88 Step 10: grad_norm=126.98 Step 15: grad_norm=149.58 Step 20: grad_norm=80.7 ``` def compute_grad_norm(optimizer): # the function to compute grad_norm total_norm = 0.0 for group in optimizer.param_groups: for param in group['params']: if param.grad is not None: param_norm = param.grad.data.norm(2) total_norm += param_norm.item() ** 2 total_norm = torch.sqrt(torch.tensor(total_norm)) return total_norm ``` ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Expected behavior do_grad_scaling=True when --fp16 is enabled; rarely confronting loss becoming nan

system · April 5, 2025, 5:58pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

BenjaminB · April 7, 2025, 1:23pm

Could you check the dtype of the LoRA parameters after model initialization? Specifically, are they float16 or float32?

Topic		Replies	Views
Training Loss = 0.0, Validation Loss = nan Intermediate	6	13825	September 5, 2023
T5 variants return Training Loss 0 and Validation loss nan while fine tuning 🤗Transformers	8	5412	November 10, 2024
FP-16 training producing nans on t5-large/flan-t5-xl 🤗Transformers	0	704	June 1, 2023
Finetuning MT0 produce 0 loss Models	1	548	September 5, 2023
Training Loss 0.0000 and Validation Loss nan Intermediate	2	141	March 12, 2025

Reward becomes nan when switching from full precision to fp16 for gemma3-12b-it

Related topics