Practical Exercise: GRPO with Unsloth reward curve

khalilbibi · March 31, 2025, 5:29pm

Hi, I used the code for “Practical Exercise: GRPO with Unsloth” , my rewards curve looks like this:

I was wondering if this is the expected to look like this? I was expecting the reward to get more stable after 150-200 steps..

John6666 · April 1, 2025, 10:31am

I wonder… It’s normal for there to be variation, but it’s a bit strange that there’s no gradient?

github.com/huggingface/trl

GRPO: Why does loss start at 0 for first K steps and then increase over time?

opened 06:27PM - 30 Jan 25 UTC

arnavgarg1

❓ question 🏋 GRPO

### Reproduction Hi all! I've been trying to train a variety of models using …GRPO, but I noticed that the train/loss metric remains 0 or close to 0 throughout training even after a large number of steps (>200). My mean rewards also don't change significantly during this same period. This seems unexpected and might indicate that the optimization step isn't working as intended. However, on the official docs page, I see that that the image of the learning curves shared suggests that loss can remain close to 0 for long periods of time and actually increases instead of decreasing: https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig. Similarly, in @philschmid's blogpost from earlier today (https://www.philschmid.de/mini-deepseek-r1), I noticed a similar trend where loss stayed 0 for nearly 200 steps before increasing. This makes it seem like it is the expected behavior, but I'm having a hard time understanding it. I had a few questions that I am hoping someone is able to help me understand: 1. Is there an issue with the loss computation in GRPO? 2. Should loss values be expected to remain near zero in this setting? What does an increasing loss suggest? 3. Could this be related to specific hyperparameters or gradient updates not being applied correctly? Would appreciate any insights or guidance on debugging this! Thanks! ### System Info - OS: Linux - Transformers 4.48.1 - TRL: Main Branch ### Checklist - [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue)) - [x] I have included my system information - [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks)) - [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks)) - [x] Any traceback provided is complete

Topic		Replies	Views
Help understanding GRPO quick start in docs Beginners	2	316	February 6, 2025
Format Reward Function in GRPO Training Doesn't Stabilise Intermediate	0	604	February 12, 2025
Plateau in Eval Loss after 100 steps in DPO Training Models	0	285	March 17, 2024
Trainer's step loss always drops sharply after each epoch regardless of model / data 🤗Transformers	3	2178	March 28, 2023
Offering a Technical Deep Dive on GRPO/DAPO/Dr. GRPO Algorithms Show and Tell	2	270	May 11, 2025

Practical Exercise: GRPO with Unsloth reward curve

Related topics