Running DPOTrainer with custom gpu management

# train_dpo.py
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

training_args = DPOConfig(output_dir="Qwen2-0.5B-DPO", logging_steps=10)
trainer = DPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()

I have an A100 workstation that consists of 4x80GB GPUs.
If my understanding is correct, by default (without using deepspeed or fsdp), running this code using accelerate launch test.py results in 8 copies of the model: 2(policy model to train and a reference model) models per device. Is there a way to map the reference model only to device 0 and use devices 1, 2, and 3 for the policy model?

It seems that setting device_map=“cuda:0” for the ref_model and then passing the ref_model to the DPOTrainer() gets overridden in the later.

1 Like