# train_dpo.py
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2-0.5B-DPO", logging_steps=10)
trainer = DPOTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
I have an A100 workstation that consists of 4x80GB GPUs.
If my understanding is correct, by default (without using deepspeed or fsdp), running this code using accelerate launch test.py
results in 8 copies of the model: 2(policy model to train and a reference model) models per device. Is there a way to map the reference model only to device 0 and use devices 1, 2, and 3 for the policy model?
It seems that setting device_map=âcuda:0â for the ref_model and then passing the ref_model to the DPOTrainer()
gets overridden in the later.