DPO Training ruins my model’s conversational coherence

Hi everyone,

I’m currently fine-tuning a chatbot. My pipeline first applies SFT to establish the desired style, then incorporates DPO training (with a mixed-in SFT loss for stability) to help the model understand its capability boundaries — e.g., to avoid making unrealistic promises like “I can help you turn on the air conditioner.”

The SFT phase works fine; however, once I apply DPO, the model’s behavior completely collapses. Specifically: with a system prompt, the model begins producing incoherent or repetitive output after a few regular turns. Without a system prompt, the degradation is even worse — output becomes pure noise or completely unreasonable for most of the time.
I’ve used DPO in other contexts, and while results can vary, I’ve never seen it completely destroy a model’s ability to hold a coherent conversation.

Some additional details:

-I’ve tried both my own custom trainer and existing frameworks like Swift, with similar outcomes.

-My training data follows standard DPO format, containing: conversation history, instruction, chosen, and rejected. (Note: system prompts are not included in training data.)

-Every assistant’s response is taken into account when calculating the loss. I also tried the regular way, which is to only consider the last round but didn’t see anything changed.

  • I did my experiments on 7B and 32B models; nothing really changed.

Has anyone encountered similar issues, or do you have any insights on what might be going wrong?

Any insight would be incredibly appreciated. Thank you!

1 Like

This issue might be similar.

1 Like