Reward Hacking Solutions

In PPO we solve the issue of reward hacking by having a policy divergence penalty. Though the issue of reward hacking still persists and it does not only depend on Reward Model complexity and divergence penalty, how do we handle those cases and how do we identify this behavior?

2 Likes