Hi there! After taking a look at your logs and the TRL library’s GRPO implementation, I think I can help explain what’s happening with your “loss always zero” issue:
Your model IS being trained, and here’s why:
-
Math checks out: I calculated
0.0005513303462066687 (your final KL) * 0.04 (default beta) = 2.205321384826675e-05
, which almost exactly matches your reported finaltrain_loss: 2.1668560384568992e-05
. This confirms the loss is being calculated correctly asbeta * KL
. -
Non-zero gradients: Your
grad_norm
values are consistently non-zero (ranging ~0.06-0.12), which proves your model parameters are being updated throughout training. -
Working as designed: When using
num_iterations=1
(the default) with grpo_trainer, the normalized advantages sum to zero, so the only contribution to loss comes from the KL term. This matches exactly what qgallouedec explained here in that GitHub thread John6666 referenced. and fyi, your loss type is default tobnpo
in the trainer file. see implementation
Some observations about the HF implementation:
-
Logging inconsistency: I’m not sure why your intermediate logs show
loss: 0.0
while the final loss is non-zero. It could be a display/precision issue with the TRL library’s logging. -
Reference model behavior: According to the DeepSeek paper (Algorithm 1, page 14), the reference model should be periodically updated to match the current policy model, which would reset the KL to zero. Your logs show KL steadily increasing from ~0.0004 to ~0.0006 without any resets, which suggests the HF implementation might differ from the paper in this aspect.
-
Default settings: HF’s TRL library uses
num_iterations=1
by default, which simplifies the GRPO family objective function considerably.
My assessment:
Is your model being trained? Yes, definitely.
Is the training optimal? Probably not - there appear to be some differences between the TRL implementation and the full iterative approach described in the DeepSeek paper.
Suggestions:
-
Try setting
num_iterations=4
or some value greater than 1 in your GRPOConfig to see if it improves training dynamics. -
Check out the insights from qgallouedec in the GitHub thread - their mathematical explanation of why loss starts at zero and increases during training is spot on.
Disclaimer: I am a fellow community member studying the papers and implementations of the TRL library closely , but not the author of it. My intention is to help, if I’ve misunderstood anything, I welcome corrections from those more familiar with the codebase!
Hope this helps clarify things!