Hi, I used the code for “Practical Exercise: GRPO with Unsloth” , my rewards curve looks like this:
I was wondering if this is the expected to look like this? I was expecting the reward to get more stable after 150-200 steps..
1 Like
I wonder… It’s normal for there to be variation, but it’s a bit strange that there’s no gradient?