DPOConfig - SFT as loss function

Kathi00 · September 17, 2025, 4:23am

Hello everyone, new to DPOConfig. I have quick question regarding DPOConfig. It provides the possibility to choose between different loss functions, one of them being sft.

Something I don’t understand: DPO is a method based on “chosen” and “rejected”, but SFT on the other hand works only with “chosen” without “rejected”.

What exactly is then plotted in train/reward/rejected? How does DPO work with SFT?

Best regards, Kathi

John6666 · September 17, 2025, 8:24am

How does DPO work with SFT?

Seems with some ways.

Kathi00 · September 18, 2025, 5:36am

Thank you, that is a really detailed description. I have one question: if I work with DPO, because I have three prompts (prompt, chosen, rejected), then ideally the loss apo_zero should be optimal.

I can see an improvement in the plots for reward/chosen…it looks similar to the SFT reward/chosen. But since it now also takes the rejected examples into consideration, I see negative values, which do not appear in SFT.

However, when I check the model outputs, I notice quite a bit of repetition and randomness. I would have expected it to perform better than SFT, since it has more information to train on. But apo_zero is not as good as in SFT. Why is it like this?

John6666 · September 18, 2025, 6:20am

Hmm… idk…

Short answer: DPO/APO optimize a pairwise margin, not absolute likelihood. With apo_zero you push winners up and losers down, which can widen the margin while lowering the absolute prob. of some winner tokens and amplifying verbosity/length biases. That can look better in rewards/* plots yet decode worse than SFT and repeat more. This is expected and documented. (Hugging Face)

Why apo_zero can underperform SFT and cause repetition/randomness

Objective mismatch. Standard DPO-type losses maximize a difference; they need not increase the winner’s likelihood everywhere. When chosen/rejected are very similar, vanilla DPO can even reduce chosen likelihood while still improving the pairwise logit. DPOP (a fix) was proposed for this exact failure mode. (arXiv)
APO anchoring is situational. apo_zero assumes winners are better than your model and thus pushes winners ↑ and losers ↓. If your model is already stronger than many winners, the right variant is apo_down (decrease both, penalize losers more). Using the wrong anchor can harm quality. (Hugging Face)
Preference noise. Flipped or ambiguous labels make DPO drift. Robust/cDPO variants or label smoothing mitigate this. TRL exposes label_smoothing and “Robust DPO.” (ar5iv)
Length/verbosity bias. DPO magnifies length sensitivity; longer answers win, which increases repetition and randomness at decode time unless controlled. Use LD-DPO/SamPO-style regularizers. TRL ships LD-DPO. (ACL Anthology)
β and reference sensitivity. DPO-style methods are sensitive to β and reference choice; poor settings yield unstable gradients and worse decoding. This has been observed empirically in small-compute settings. (cs224r.stanford.edu)
It’s a known outcome. Practitioners have reported “better margins, worse generations” when pairs are too similar or data is small; mixing SFT back in helped. (GitHub)

About the plots you see

rewards/chosen and rewards/rejected are policy–reference log-prob gaps scaled by β. Negative values only mean “policy assigns lower log-prob than ref on that side,” not that training failed. Under apo_zero, pushing losers down often drives rewards/rejected sharply negative even if rewards/chosen wiggles. (Hugging Face)

What to change that usually fixes it

Pick the right anchor. If many “winners” are worse than your current model, switch to loss_type="apo_down". If winners are truly better, keep apo_zero. (Hugging Face)
Add positivity pressure. Try DPOP or add an SFT term: rpo_alpha≈1.0 in TRL or multi-loss ["sigmoid","sft"]. This counters the “margin-only” failure and stabilizes decoding. (arXiv)
Harden against noise. Set label_smoothing>0 or use the robust DPO loss when label flips are likely. (ar5iv)
Control verbosity. Enable LD-DPO (set ld_alpha) or adopt a length-regularized scheme like SamPO. (Hugging Face)
Mind decoding. Evaluate apples-to-apples with stable decoding (lower temperature, nucleus sampling, optional repetition penalty). Many “randomness” complaints come from sampler drift, not the policy. (Hugging Face)
Use ref syncing if drifting. TRL exposes reference-model syncing callbacks to reduce KL drift during DPO-style training. (Hugging Face)

Kathi00 · September 21, 2025, 6:47am

Oh, okay I get it. Sorry for asking too many questions, and thanks for answering them. One doubt regarding my initial question: I chose DPOConfig with loss_type=“sft” (I have to and also “apo_zero). I need to vary beta (and learning rate) to train a model. If I understand correctly, beta does not affect the SFT formula. Based on the plotted formula, the diagnostics will reflect changes in beta because it’s included in it. But does a model trained with different beta values reflect those changes in SFT? No, right?

I trained models with different beta values and don’t see significant changes in their outputs. I’m trying to understand the logic between plots and trained models. The plots aren’t reflecting my model’s output (e.g., I see higher reward/chosen values for higher beta, but my model’s outputs aren’t different from smaller beta…).

John6666 · September 21, 2025, 7:34am

NP

But does a model trained with different beta values reflect those changes in SFT? No, right?

Seems right.

Topic		Replies	Views
Can SFT and DPO be done at the same time? Beginners	2	689	June 29, 2024
DPO - Metric Interpretation Intermediate	2	37	September 15, 2025
Identical Evaluation Metrics for SFT & DPO–Fine-Tuned LoRA Adapter on SeaLLMs-v3-7B 🤗Transformers	1	51	May 22, 2025
DPO Training ruins my model’s conversational coherence Intermediate	1	96	June 26, 2025
TFOpenAIGPTDoubleHeadsModel Loss Function Beginners	0	242	August 6, 2020

DPOConfig - SFT as loss function

Related topics