DPOConfig - SFT as loss function

Hello everyone, new to DPOConfig. I have quick question regarding DPOConfig. It provides the possibility to choose between different loss functions, one of them being sft.

Something I don’t understand: DPO is a method based on “chosen” and “rejected”, but SFT on the other hand works only with “chosen” without “rejected”.

What exactly is then plotted in train/reward/rejected? How does DPO work with SFT?

Best regards, Kathi

1 Like

How does DPO work with SFT?

Seems with some ways.

Thank you, that is a really detailed description. I have one question: if I work with DPO, because I have three prompts (prompt, chosen, rejected), then ideally the loss apo_zero should be optimal.

I can see an improvement in the plots for reward/chosen…it looks similar to the SFT reward/chosen. But since it now also takes the rejected examples into consideration, I see negative values, which do not appear in SFT.

However, when I check the model outputs, I notice quite a bit of repetition and randomness. I would have expected it to perform better than SFT, since it has more information to train on. But apo_zero is not as good as in SFT. Why is it like this?

1 Like

Hmm… idk…


Short answer: DPO/APO optimize a pairwise margin, not absolute likelihood. With apo_zero you push winners up and losers down, which can widen the margin while lowering the absolute prob. of some winner tokens and amplifying verbosity/length biases. That can look better in rewards/* plots yet decode worse than SFT and repeat more. This is expected and documented. (Hugging Face)

Why apo_zero can underperform SFT and cause repetition/randomness

  1. Objective mismatch. Standard DPO-type losses maximize a difference; they need not increase the winner’s likelihood everywhere. When chosen/rejected are very similar, vanilla DPO can even reduce chosen likelihood while still improving the pairwise logit. DPOP (a fix) was proposed for this exact failure mode. (arXiv)

  2. APO anchoring is situational. apo_zero assumes winners are better than your model and thus pushes winners ↑ and losers ↓. If your model is already stronger than many winners, the right variant is apo_down (decrease both, penalize losers more). Using the wrong anchor can harm quality. (Hugging Face)

  3. Preference noise. Flipped or ambiguous labels make DPO drift. Robust/cDPO variants or label smoothing mitigate this. TRL exposes label_smoothing and “Robust DPO.” (ar5iv)

  4. Length/verbosity bias. DPO magnifies length sensitivity; longer answers win, which increases repetition and randomness at decode time unless controlled. Use LD-DPO/SamPO-style regularizers. TRL ships LD-DPO. (ACL Anthology)

  5. β and reference sensitivity. DPO-style methods are sensitive to β and reference choice; poor settings yield unstable gradients and worse decoding. This has been observed empirically in small-compute settings. (cs224r.stanford.edu)

  6. It’s a known outcome. Practitioners have reported “better margins, worse generations” when pairs are too similar or data is small; mixing SFT back in helped. (GitHub)

About the plots you see

  • rewards/chosen and rewards/rejected are policy–reference log-prob gaps scaled by β. Negative values only mean “policy assigns lower log-prob than ref on that side,” not that training failed. Under apo_zero, pushing losers down often drives rewards/rejected sharply negative even if rewards/chosen wiggles. (Hugging Face)

What to change that usually fixes it

  • Pick the right anchor. If many “winners” are worse than your current model, switch to loss_type="apo_down". If winners are truly better, keep apo_zero. (Hugging Face)

  • Add positivity pressure. Try DPOP or add an SFT term: rpo_alpha≈1.0 in TRL or multi-loss ["sigmoid","sft"]. This counters the “margin-only” failure and stabilizes decoding. (arXiv)

  • Harden against noise. Set label_smoothing>0 or use the robust DPO loss when label flips are likely. (ar5iv)

  • Control verbosity. Enable LD-DPO (set ld_alpha) or adopt a length-regularized scheme like SamPO. (Hugging Face)

  • Mind decoding. Evaluate apples-to-apples with stable decoding (lower temperature, nucleus sampling, optional repetition penalty). Many “randomness” complaints come from sampler drift, not the policy. (Hugging Face)

  • Use ref syncing if drifting. TRL exposes reference-model syncing callbacks to reduce KL drift during DPO-style training. (Hugging Face)