Offering a Technical Deep Dive on GRPO/DAPO/Dr. GRPO Algorithms

Hello Hugging Face community!

I recently spent significant time analyzing the GRPO family of RL algorithms (GRPO, DAPO, and Dr. GRPO) while working on a documentation PR for the TRL library (PR #3395). During this process, I uncovered several discrepancies between the original papers and the documentation, particularly around when to use KL divergence and how normalization affects optimization.

I’ve drafted a comprehensive technical blog post that:

  1. Clarifies the theoretical foundations of all three algorithms
  2. Explains key implementation details in the TRL library
  3. Provides practical guidance on choosing the right configuration for different use cases
  4. Highlights when KL divergence should be included or excluded (and why)

Content Preview

The post covers:

  • GRPO: Original formulation, group-relative advantage estimation, and multi-iteration updates
  • DAPO: Token-level normalization, Clip-Higher strategy, and Dynamic Sampling
  • Dr. GRPO: Removing optimization biases and the question-level difficulty bias
  • Implementation: Code examples showing how the algorithms are actually implemented in TRL
  • Configuration Tips: Practical guidance for tuning hyperparameters

Why This Matters

Many practitioners are confused about when to use which algorithm variant and how implementation details affect performance. For instance, the decision to include/exclude KL divergence can dramatically impact results for rule-based rewards, but this isn’t clearly documented.


Publishing Question

I’d love to share this content with the HF community but don’t currently have a premium membership. Are there opportunities for guest blog posts on the HF platform for technical deep dives like this? Alternatively, would anyone with blogging access be interested in collaborating to publish this content?

My goal is simply to help the community better understand these algorithms and their implementation details to save others the time I spent piecing together this information.


Here’s the full draft in Markdown (GitHub Gist):
:backhand_index_pointing_right: .md · GitHub

Thank you for any insights!
Jen

1 Like

I think it would be very useful for the community, but I don’t have enough knowledge about the algorithm itself…

Is there a mechanism that can be used in this case? In Spaces, Community Grant is similar, but… @not-lain @meganariley @lunarflu

Thank you John for the support and for tagging people who might be able to help!

1 Like