I created a dataset for preference alignment using Gemma2 models. Now I want to train the Gemma2 2B model using DPO. However I keep running into the CUDA out of memory issue on A100 GPU at about 50% of the training. Training for only 100 steps.
I have already tried a few things:
- Why eval_accumulation_steps takes so much memory - #4 by nbroad → initially the error popped up with the first evaluation.
- model.config.use_cache = True → had initially set it to False, changing it helped a bit
The arguments I use are below:
dpo_config = DPOConfig(
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=8,
eval_accumulation_steps=8,
gradient_checkpointing=True,
learning_rate=8e-5,
lr_scheduler_type="cosine",
max_steps=100,
save_strategy="no",
logging_steps=1,
output_dir=output_dir_name,
optim="paged_adamw_32bit",
warmup_steps=10,
bf16=True,
report_to="wandb",
evaluation_strategy="steps",
# Evaluate every 20% of training
eval_steps=0.2,
)
dpo_trainer = DPOTrainer(
model,
args=dpo_config,
train_dataset=ds['train'],
eval_dataset=ds["test"],
tokenizer=tokenizer,
peft_config=peft_config,
max_prompt_length=1380,
beta=0.1,
preprocess_logits_for_metrics=preprocess_logits_for_metrics,
)
param_rank = 16
param_lora_alpha = 16
peft_config = LoraConfig(
r=param_rank,
lora_alpha=param_lora_alpha,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)
Reducing the target modules to the set 'k_proj', 'v_proj', 'q_proj', 'o_proj'
does successfully train for 100 steps but this is not full fine tuning and would affect the results.
Any suggestions on how I can train successfully without reducing the target modules?