In PPO we solve the issue of reward hacking by having a policy divergence penalty. Though the issue of reward hacking still persists and it does not only depend on Reward Model complexity and divergence penalty, how do we handle those cases and how do we identify this behavior?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Process Reward Model compatibility with PPOTrainer | 0 | 127 | October 23, 2024 | |
New Version of PPOTrainer | 6 | 455 | November 24, 2024 | |
PPO using TRL: optimal strategy for reward calculation? | 1 | 931 | December 20, 2023 | |
Unstable PPO training: Highly negative KL divergence and highly positive average ratio of batch on LLMs | 0 | 337 | October 27, 2024 | |
đź”§ Beyond Pretraining: A Visual Guide to Post-Training Techniques | 1 | 21 | July 27, 2025 |