[RuntimeError] DPOTrainer - "element 0 of tensors does not require grad and does not have a grad_fn" on 8x A100 GPUs

Hi all, I’m encountering a critical issue when running DPOTrainer on a multi-GPU A100 server (8x A100 40GB) using trl==0.17.0 and transformers==4.51.3.
The training fails on all ranks with the same RuntimeError during the .backward() call in FP16 mode.


:white_check_mark: Setup Summary

  • Base model: SeaLLMs-v3-7B, loaded in 4-bit using BitsAndBytesConfig
  • LoRA adapter: loaded from ../sft/sft_output_SeaLLMs-v3-7B
  • DPOTrainer config: fp16=True, disable_dropout=True
  • Device setup: 8 GPUs (CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7)
  • Launcher: accelerate launch --mixed_precision="fp16" train_dpo.py
  • Library Used:
transformers==4.51.3
trl==0.17.0
peft==0.15.2
accelerate==1.7.0
torch==2.7.0+cu118
bitsandbytes==0.45.5
datasets==3.6.0

:fire: The Error

Once training begins, all ranks fail with this same error:

  0%|                                                                                                        | 0/120 [00:00<?, ?it/s][rank3]: Traceback (most recent call last):
[rank3]:   File "/raid/home/llmsosmed/test-amriz/TA/dpo/train_dpo.py", line 106, in <module>
[rank3]:     trainer.train()
[rank3]:   File "/raid/home/llmsosmed/rlaif/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank3]:     return inner_training_loop(
[rank3]:   File "/raid/home/llmsosmed/rlaif/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]:   File "/raid/home/llmsosmed/rlaif/lib/python3.10/site-packages/transformers/trainer.py", line 3782, in training_step
[rank3]:     self.accelerator.backward(loss, **kwargs)
[rank3]:   File "/raid/home/llmsosmed/rlaif/lib/python3.10/site-packages/accelerate/accelerator.py", line 2469, in backward
[rank3]:     self.scaler.scale(loss).backward(**kwargs)
[rank3]:   File "/raid/home/llmsosmed/rlaif/lib/python3.10/site-packages/torch/_tensor.py", line 648, in backward
[rank3]:     torch.autograd.backward(
[rank3]:   File "/raid/home/llmsosmed/rlaif/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
[rank3]:     _engine_run_backward(
[rank3]:   File "/raid/home/llmsosmed/rlaif/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank3]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank3]: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

:puzzle_piece: Minimal Code Snippet

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
import torch, os
from accelerate import PartialState

# Setup
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
model_name = "SeaLLMs-v3-7B"
sft_output_dir = f"../sft/sft_output_{model_name}"
base_cache_dir = f"../model_cache/{model_name}"
output_dir = f"dpo_output_{model_name}"
preference_dataset = f"dpo_preference_dataset_{model_name}_clean.jsonl"

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(sft_output_dir, local_files_only=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# 4-bit quant config
quant_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Base model
device_string = PartialState().process_index
base_model = AutoModelForCausalLM.from_pretrained(
    base_cache_dir,
    device_map={"": device_string},
    quantization_config=quant_cfg,
    local_files_only=True,
)
base_model.config.use_cache = False

# LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    sft_output_dir,
    torch_dtype=torch.float16,
    device_map={"": device_string},
)
model.eval()

# Dataset
train_dataset = load_dataset("json", data_files={"train": preference_dataset}, split="train")

# DPO Config
dpo_args = DPOConfig(
    output_dir=output_dir,
    num_train_epochs=3.0,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    learning_rate=1e-6,
    logging_steps=100,
    save_steps=500,
    fp16=True,
    save_safetensors=True,
    disable_dropout=True,
)

# Trainer
trainer = DPOTrainer(
    model=model,
    args=dpo_args,
    train_dataset=train_dataset,
    processing_class=tokenizer,
)
trainer.train()

:paperclip: Dataset Format (Preference JSONL)

{
  "prompt": "...\\nQuestion: {question}\\nAnswer:",
  "chosen": "...", 
  "rejected": "..."
}
...

Where:

  • "prompt" contains the instruction, context, and question.
  • "chosen" is the preferred model response.
  • "rejected" is the less preferred alternative response.

All examples follow this pairwise preference format for DPO training.

:folded_hands: Any guidance?

Any insight into why this might be happening (especially in the backward pass with LoRA + 4bit + DPO) would be really appreciated.

Thank you in advance!

1 Like

Hmm… This case?

can you call model.enable_input_require_grads() right after the call model = LlamaForCausalLM.from_pretrained(model_args.model_name_or_path, config=config). That method should make sure the inputs will have requires_grad set to True thus avoid your issue I believe