I have been looking at the GRPO training example in the trl docs (GRPO Trainer) and am confused about the reward function they define.
In the example they define the reward function to be
# Define the reward function, which rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
return [abs(20 - len(completion)) for completion in completions]
To me this suggests that the model is being rewarded for making completions really long since abs(20 - len(completions)) grows without bound as len(completions) grows. The example claims it is training the model to make completion close to 20 characters long.