Format Reward Function in GRPO Training Doesn't Stabilise

JamesXanda · February 12, 2025, 10:29am

I have been experimenting with GRPO training with format rewards plus a custom accuracy reward function. I have fun a few small experiments in a compute limited environment (on my local mac) to get the setup right before I start a larger training run.

I am using Qwen/Qwen2.5-1.5B-Instruct as the base model training a LORA adapter. I am running 4 generations per prompt with 2 gradient accumulation steps.

I am aware that this is not the ideal base model or hyperparameter setup for the training and am planning to up this for a full run but just wanted to check that everything was working and for the most part it went better than I expected. The format reward quickly shot up to near one and then it started optimising the accuracy reward.

One thing I noticed was that the format reward shot up and then has stopped at around an average score of 0.96. If I run a supervised fine tuning with examples in the desired format it reasonably picks it up to basically an accuracy of 100%.

My current theory is that once the format reward get close to perfect then the future batch generations all have the correct reward which means there is no advantage in the GRPO loss for the reward being correct but the KL divergence term still has a signal dragging the models distribution back to the base models distribution which didn’t naturally use the desired format and I end up in a bit of a yoyo situation where it unlearns the format and then quickly relearns it the moment it starts getting it wrong and an advantage signal appear again.

I just wanted to know if:

Anyone else has noticed this sort of behaviour?
Does my theory on what is happening sound sensible?

Topic		Replies	Views
Practical Exercise: GRPO with Unsloth reward curve Course	1	182	April 1, 2025
Help understanding GRPO quick start in docs Beginners	2	299	February 6, 2025
Offering a Technical Deep Dive on GRPO/DAPO/Dr. GRPO Algorithms Show and Tell	2	189	May 11, 2025
Reward becomes nan when switching from full precision to fp16 for gemma3-12b-it 🤗Transformers	3	51	April 7, 2025
🔬 Exploring Reinforcement Learning for Molecule Generation with GPT-Based Models; Loss Fluctuations Intermediate	2	281	April 11, 2024

Format Reward Function in GRPO Training Doesn't Stabilise

Related topics