DPO - Metric Interpretation

brain1995 · September 11, 2025, 7:56pm

Hello, I’m new to DPO.

I’m currently working with DPOConfig(). During the training of my model, a few metrics are plotted such as “rewards/chosen", “rewards/rejected", “train/logps/rejected" etc.

While training, I see that the value for rewards/chosen goes up to 15 and rewards/rejected goes down to –35. What I don’t understand is what exactly is being plotted. What is the meaning of these numbers? They are not probabilities, so how should I interpret them?

mattewwade06 · September 12, 2025, 10:09am

Great question! The metrics like rewards/chosen and rewards/rejected in DPO are actually scaled differences of log-probabilities between the policy model and a reference model — not probabilities themselves. So values like +15 or −35 reflect how strongly the model favors or disfavors a response relative to the reference. If you want a concise breakdown of exactly what these “rewards” are and how they’re computed, check out the DPO Trainer documentation here:
https://huggingface.co/docs/trl/en/dpo_trainer

brain1995 · September 15, 2025, 5:32am

Thank you for the reply. I have another question: when I choose SFT as the loss function in the DPOConfig, how is it still considered DPO, since SFT does not take rejected responses into account?

Topic		Replies	Views
DPOConfig - SFT as loss function Intermediate	5	46	September 21, 2025
What is "eval/train_loss" in DPO ? is it eval loss or train loss? Beginners	0	24	August 12, 2024
PPO using TRL: optimal strategy for reward calculation? Research	1	949	December 20, 2023
DPO training data format Intermediate	7	1840	September 23, 2024
Setting the no_answer probability in the squad_v2 metric Intermediate	0	598	February 21, 2022

DPO - Metric Interpretation

Related topics