DPO Training Gemma2 2B CUDA out of memory error

I created a dataset for preference alignment using Gemma2 models. Now I want to train the Gemma2 2B model using DPO. However I keep running into the CUDA out of memory issue on A100 GPU at about 50% of the training. Training for only 100 steps.
I have already tried a few things:

  1. Why eval_accumulation_steps takes so much memory - #4 by nbroad → initially the error popped up with the first evaluation.
  2. model.config.use_cache = True → had initially set it to False, changing it helped a bit

The arguments I use are below:

   dpo_config = DPOConfig(
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    eval_accumulation_steps=8,
    gradient_checkpointing=True,
    learning_rate=8e-5,
    lr_scheduler_type="cosine",
    max_steps=100,
    save_strategy="no",
    logging_steps=1,
    output_dir=output_dir_name,
    optim="paged_adamw_32bit",
    warmup_steps=10,
    bf16=True,
    report_to="wandb",
    evaluation_strategy="steps",
    # Evaluate every 20% of training
    eval_steps=0.2,
)

dpo_trainer = DPOTrainer(
    model,
    args=dpo_config,
    train_dataset=ds['train'],
    eval_dataset=ds["test"],
    tokenizer=tokenizer,
    peft_config=peft_config,
    max_prompt_length=1380,
    beta=0.1,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
)

param_rank = 16
param_lora_alpha = 16
peft_config = LoraConfig(
    r=param_rank,
    lora_alpha=param_lora_alpha,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

Reducing the target modules to the set 'k_proj', 'v_proj', 'q_proj', 'o_proj' does successfully train for 100 steps but this is not full fine tuning and would affect the results.
Any suggestions on how I can train successfully without reducing the target modules?

1 Like

optim=“paged_adamw_32bit”,

Maybe due to this.