Format Reward Function in GRPO Training Doesn't Stabilise

I have been experimenting with GRPO training with format rewards plus a custom accuracy reward function. I have fun a few small experiments in a compute limited environment (on my local mac) to get the setup right before I start a larger training run.

I am using Qwen/Qwen2.5-1.5B-Instruct as the base model training a LORA adapter. I am running 4 generations per prompt with 2 gradient accumulation steps.

I am aware that this is not the ideal base model or hyperparameter setup for the training and am planning to up this for a full run but just wanted to check that everything was working and for the most part it went better than I expected. The format reward quickly shot up to near one and then it started optimising the accuracy reward.

One thing I noticed was that the format reward shot up and then has stopped at around an average score of 0.96. If I run a supervised fine tuning with examples in the desired format it reasonably picks it up to basically an accuracy of 100%.

My current theory is that once the format reward get close to perfect then the future batch generations all have the correct reward which means there is no advantage in the GRPO loss for the reward being correct but the KL divergence term still has a signal dragging the models distribution back to the base models distribution which didn’t naturally use the desired format and I end up in a bit of a yoyo situation where it unlearns the format and then quickly relearns it the moment it starts getting it wrong and an advantage signal appear again.

I just wanted to know if:

  1. Anyone else has noticed this sort of behaviour?
  2. Does my theory on what is happening sound sensible?
2 Likes