Offering a Technical Deep Dive on GRPO/DAPO/Dr. GRPO Algorithms

bird-of-paradise · May 10, 2025, 11:25pm

Hello Hugging Face community!

I recently spent significant time analyzing the GRPO family of RL algorithms (GRPO, DAPO, and Dr. GRPO) while working on a documentation PR for the TRL library (PR #3395). During this process, I uncovered several discrepancies between the original papers and the documentation, particularly around when to use KL divergence and how normalization affects optimization.

I’ve drafted a comprehensive technical blog post that:

Clarifies the theoretical foundations of all three algorithms
Explains key implementation details in the TRL library
Provides practical guidance on choosing the right configuration for different use cases
Highlights when KL divergence should be included or excluded (and why)

Content Preview

The post covers:

GRPO: Original formulation, group-relative advantage estimation, and multi-iteration updates
DAPO: Token-level normalization, Clip-Higher strategy, and Dynamic Sampling
Dr. GRPO: Removing optimization biases and the question-level difficulty bias
Implementation: Code examples showing how the algorithms are actually implemented in TRL
Configuration Tips: Practical guidance for tuning hyperparameters

Why This Matters

Many practitioners are confused about when to use which algorithm variant and how implementation details affect performance. For instance, the decision to include/exclude KL divergence can dramatically impact results for rule-based rewards, but this isn’t clearly documented.

Publishing Question

I’d love to share this content with the HF community but don’t currently have a premium membership. Are there opportunities for guest blog posts on the HF platform for technical deep dives like this? Alternatively, would anyone with blogging access be interested in collaborating to publish this content?

My goal is simply to help the community better understand these algorithms and their implementation details to save others the time I spent piecing together this information.

Here’s the full draft in Markdown (GitHub Gist):
.md · GitHub

Thank you for any insights!
Jen

John6666 · May 11, 2025, 12:17am

I think it would be very useful for the community, but I don’t have enough knowledge about the algorithm itself…

Is there a mechanism that can be used in this case? In Spaces, Community Grant is similar, but… @not-lain @meganariley @lunarflu

bird-of-paradise · May 11, 2025, 3:49pm

Thank you John for the support and for tagging people who might be able to help!

Topic		Replies	Views
GRPO or PPO or some RL Research	1	53	May 19, 2025
Practical Exercise: GRPO with Unsloth reward curve Course	1	196	April 1, 2025
Format Reward Function in GRPO Training Doesn't Stabilise Intermediate	0	593	February 12, 2025
PPO using TRL: optimal strategy for reward calculation? Research	1	923	December 20, 2023
Help understanding GRPO quick start in docs Beginners	2	307	February 6, 2025

Offering a Technical Deep Dive on GRPO/DAPO/Dr. GRPO Algorithms

Content Preview

Why This Matters

Publishing Question

Related topics