I am training gemma3-12b-it
on a standard preference dataset. When I accelerate launch train.py
on gemma3-12b-it
in full precision, the training curve looks reasonable. However, if I switch from full precision to fp16, suddenly the logging shows loss=0, grad_norm=0, reward=nan...
. Are multimodal models restricted to full precision training?
from datasets import load_dataset
from trl import RewardTrainer, RewardConfig, DPOConfig, DPOTrainer
from peft import LoraConfig, TaskType
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "gemma-3-12b-it"
model = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation="eager")
tokenizer = AutoTokenizer.from_pretrained(model_name)
train_dataset = load_dataset("json", data_files="training_data.json", split="train")
tokenizer.pad_token = tokenizer.eos_token
def process_training_data(example):
example["prompt"] = example.pop("input")
example['rejected'] = example['rejected'][0]
return example
train_dataset = train_dataset.map(process_training_data)
training_args = DPOConfig(
dataloader_pin_memory=False,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
logging_steps=10,
# fp16=True
)
training_args.optimize_cuda_cache=True
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
]
)
trainer = DPOTrainer(model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=train_dataset,
peft_config=peft_config)
trainer.train()
1 Like
Perhaps mixed precision training issue?
opened 07:58AM - 23 Jul 23 UTC
closed 08:03AM - 31 Aug 23 UTC
### System Info
pytorch 1.13.1
transformers==4.31.0
### Who can help?
… Hi @sgugger ,
I used the 4.31.0 to train a Llama model with LoRA. I observe some problems with --fp16 training and I'm not sure if it is a bug in Trainer.py:
My model is like:
```
class MyModel(nn.Module):
def __init__(self, model_name):
super().__init__()
self.model_name = model_name
self.base_model = LlamaForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
self.base_model = get_peft_model(self.base_model, lora_config)
self.other_modules = nn.Linear(4096, 4096)
```
I used the Trainer to train the model with the following command line:
`torchrun --nproc_per_node=4 main.py --max_steps 100000 --fp16
`
I find the model's gradients (in self.optimizer in the Trainer) are not fp16 but fp32. Is it correct?
Also, I find that no gradient_scaling is conducted during training since self.do_grad_scaling is always False (because self.sharded_ddp is None and args.half_precision_backend will be always "auto"). The current trainer.py will not correctly set up args.half_precision_backend and scaler if self.sharded_ddp is None. Are these observations expected? I'm a little confused why setting up args.half_precision_backend and scaler require sharded_ddp. As a result, I've found that during the training process, I often encounter the loss becoming NaN. I'm not sure whether it is because no gradient_scaling is conducted and half_precision_backend is not correctly set up during training.
Following are my grad_norm (before grad_clipping) with and without --fp16. (My base model here is "JackFram/llama-160m" for debugging) **The results are significantly different.**
Without --fp16:
step 1: grad_norm=0.059
Step 5: grad_norm=0.054
Step 10: grad_norm=0.048
Step 15: grad_norm=0.050
Step 20: grad_norm=0.050
With --fp16:
Step 1: grad_norm = nan
Step 5: grad_norm = 129.88
Step 10: grad_norm=126.98
Step 15: grad_norm=149.58
Step 20: grad_norm=80.7
```
def compute_grad_norm(optimizer): # the function to compute grad_norm
total_norm = 0.0
for group in optimizer.param_groups:
for param in group['params']:
if param.grad is not None:
param_norm = param.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = torch.sqrt(torch.tensor(total_norm))
return total_norm
```
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)
### Expected behavior
do_grad_scaling=True when --fp16 is enabled; rarely confronting loss becoming nan
system
Closed
April 5, 2025, 5:58pm
3
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.
Could you check the dtype of the LoRA parameters after model initialization? Specifically, are they float16 or float32?
1 Like