Huggingface trl GRPO loss is always zero

bird-of-paradise · May 18, 2025, 12:39am

Hi there! After taking a look at your logs and the TRL library’s GRPO implementation, I think I can help explain what’s happening with your “loss always zero” issue:

Your model IS being trained, and here’s why:

Math checks out: I calculated 0.0005513303462066687 (your final KL) * 0.04 (default beta) = 2.205321384826675e-05, which almost exactly matches your reported final train_loss: 2.1668560384568992e-05. This confirms the loss is being calculated correctly as beta * KL.
Non-zero gradients: Your grad_norm values are consistently non-zero (ranging ~0.06-0.12), which proves your model parameters are being updated throughout training.
Working as designed: When using num_iterations=1 (the default) with grpo_trainer, the normalized advantages sum to zero, so the only contribution to loss comes from the KL term. This matches exactly what qgallouedec explained here in that GitHub thread John6666 referenced. and fyi, your loss type is default to bnpo in the trainer file. see implementation

Some observations about the HF implementation:

Logging inconsistency: I’m not sure why your intermediate logs show loss: 0.0 while the final loss is non-zero. It could be a display/precision issue with the TRL library’s logging.
Reference model behavior: According to the DeepSeek paper (Algorithm 1, page 14), the reference model should be periodically updated to match the current policy model, which would reset the KL to zero. Your logs show KL steadily increasing from ~0.0004 to ~0.0006 without any resets, which suggests the HF implementation might differ from the paper in this aspect.
Default settings: HF’s TRL library uses num_iterations=1 by default, which simplifies the GRPO family objective function considerably.

My assessment:

Is your model being trained? Yes, definitely.
Is the training optimal? Probably not - there appear to be some differences between the TRL implementation and the full iterative approach described in the DeepSeek paper.

Suggestions:

Try setting num_iterations=4 or some value greater than 1 in your GRPOConfig to see if it improves training dynamics.
Check out the insights from qgallouedec in the GitHub thread - their mathematical explanation of why loss starts at zero and increases during training is spot on.

Disclaimer: I am a fellow community member studying the papers and implementations of the TRL library closely , but not the author of it. My intention is to help, if I’ve misunderstood anything, I welcome corrections from those more familiar with the codebase!

Hope this helps clarify things!

Topic		Replies	Views
Hugging Face Trainer class with accelerate 🤗Accelerate	2	389	May 21, 2024
Hugging Face to GGUF Conversion Broken? 🤗Hub	1	5265	February 11, 2024
Using huggingface transformers trainer method for hugging face datasets 🤗Datasets	1	1097	April 15, 2024
Basics for Multi GPU Training with Huggingface Trainer 🤗Transformers	0	2683	June 14, 2023
Problem loading local dataset using TRL Beginners	0	155	April 15, 2024

Huggingface trl GRPO loss is always zero

Your model IS being trained, and here’s why:

Some observations about the HF implementation:

My assessment:

Suggestions:

Related topics