Hello, I’m new to DPO.
I’m currently working with DPOConfig()
. During the training of my model, a few metrics are plotted such as “rewards/chosen"
, “rewards/rejected"
, “train/logps/rejected
" etc.
While training, I see that the value for rewards/chosen
goes up to 15 and rewards/rejected
goes down to –35. What I don’t understand is what exactly is being plotted. What is the meaning of these numbers? They are not probabilities, so how should I interpret them?
1 Like
Great question! The metrics like rewards/chosen
and rewards/rejected
in DPO are actually scaled differences of log-probabilities between the policy model and a reference model — not probabilities themselves. So values like +15 or −35 reflect how strongly the model favors or disfavors a response relative to the reference. If you want a concise breakdown of exactly what these “rewards” are and how they’re computed, check out the DPO Trainer documentation here:
https://huggingface.co/docs/trl/en/dpo_trainer
1 Like
Thank you for the reply. I have another question: when I choose SFT as the loss function in the DPOConfig, how is it still considered DPO, since SFT does not take rejected responses into account?
1 Like