Plateau in Eval Loss after 100 steps in DPO Training

I’ve recently expanded my dataset to include a broader and higher quality of data for my DPO training of an LLM. Despite this enhancement, I’m encountering a consistent flattening in the eval loss curve after exactly 100 steps, every single time, with no improvement in model performance thereafter. The same thing happens with all the evaluation metrics, including rewards metrics. I’ve tried various DPO Beta and learning rate adjustments, but the issue persists. The plateau occurs despite the richer dataset, which, to my understanding, should ideally lead to better learning and performance.

Has anyone else faced a similar issue when introducing a more diverse dataset? Could there be a hidden factor in the training dynamics that might be causing this? Any insights or suggestions for troubleshooting this would be greatly appreciated.

Thank you!