Reward Hacking Solutions

entfane · August 9, 2025, 9:27pm

In PPO we solve the issue of reward hacking by having a policy divergence penalty. Though the issue of reward hacking still persists and it does not only depend on Reward Model complexity and divergence penalty, how do we handle those cases and how do we identify this behavior?

Topic		Replies	Views
Process Reward Model compatibility with PPOTrainer Research	0	130	October 23, 2024
PPO using TRL: optimal strategy for reward calculation? Research	1	959	December 20, 2023
Penalizing model during training Intermediate	0	273	August 30, 2021
Improve DistilBERT Question and Answering model with reinforcement learning Beginners	3	912	October 15, 2022
New Version of PPOTrainer 🤗Transformers	6	548	November 24, 2024

Reward Hacking Solutions

Related topics