Help understanding GRPO quick start in docs

JamesXanda · February 5, 2025, 4:33pm

Hi all,

I have been looking at the GRPO training example in the trl docs (GRPO Trainer) and am confused about the reward function they define.

In the example they define the reward function to be

# Define the reward function, which rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
    return [abs(20 - len(completion)) for completion in completions]

To me this suggests that the model is being rewarded for making completions really long since abs(20 - len(completions)) grows without bound as len(completions) grows. The example claims it is training the model to make completion close to 20 characters long.

Can anyone explain what I am missing?

junujunu · February 6, 2025, 3:21am

I found the same question and raised Issue.

github.com/huggingface/trl

Question about reward function in GRPO example

opened 11:57AM - 05 Feb 25 UTC

junuMoon

❓ question 🏋 Reward 🏋 GRPO

https://huggingface.co/docs/trl/main/en/grpo_trainer Am I misunderstanding som…ething about the reward function in the documentation example? Looking at the current example: ```python def reward_len(completions, **kwargs): return [abs(20 - len(completion)) for completion in completions] ``` I think this reward function might be working opposite to what we want. Since GRPO tries to maximize the reward, wouldn't this make the model generate text that's far from 20 characters? Shouldn't we use something like: ```python def reward_len(completions, **kwargs): return [-abs(20 - len(completion)) for completion in completions] ``` or ```python def reward_len(completions, **kwargs): return [1 / (1 + abs(20 - len(completion))) for completion in completions] ``` so that completions closer to 20 characters get higher rewards? Would appreciate if someone could clarify this

JamesXanda · February 6, 2025, 9:14am

@junujunu this is great thanks, I thought I was going crazy.

Topic		Replies	Views
Huggingface trl GRPO loss is always zero Beginners	5	515	May 18, 2025
Practical Exercise: GRPO with Unsloth reward curve Course	1	273	April 1, 2025
Format Reward Function in GRPO Training Doesn't Stabilise Intermediate	0	705	February 12, 2025
Process Reward Model compatibility with PPOTrainer Research	0	128	October 23, 2024
Training RewardTrainer - Does the number of labels matter? 🤗Transformers	0	20	February 13, 2025

Help understanding GRPO quick start in docs

Related topics