Negative KL-divergence RLHF implementation

maxime · September 2, 2023, 5:25pm

I am struggling to understand one part of the FAQ of the transformer reinforcement learning library from HuggingFace:

What Is the Concern with Negative KL Divergence?

If you generate text by purely sampling from the model distribution things work fine in general. But when you use the generate method there are a few caveats because it does not always purely sample depending on the settings which can cause KL-divergence to go negative. Essentially when the active model achieves log_p_token_active < log_p_token_ref we get negative KL-div. This can happen in a several cases:

top-k sampling: the model can smooth out the probability distribution causing the top-k tokens having a smaller probability than those of the reference model but they still are selected

min_length: this ignores the EOS token until min_length is reached. thus the model can assign a very high log prob to the EOS token and very low prob to all others until min_length is reached

batched generation: finished sequences in a batch are padded until all generations are finished. The model can learn to assign very low probabilities to the padding tokens unless they are properly masked or removed.

These are just a few examples. Why is negative KL an issue? The total reward R is computed R = r - beta * KL so if the model can learn how to drive KL-divergence negative it effectively gets a positive reward. In many cases it can be much easier to exploit such a bug in the generation than actually learning the reward function. In addition the KL can become arbitrarily small thus the actual reward can be very small compared to it.

I understand why the KL-divergence that is computed here is an approximation that can be negative as opposed to the real one. However, I cannot wrap my head around the details of why these specific sampling parameters would lead to negative KL-divergence. Could someone elaborate on these points?

baek26 · May 13, 2024, 7:03am

What do you mean with “these specific sampling parameters”?
min_length?

Topic		Replies	Views
Unstable PPO training: Highly negative KL divergence and highly positive average ratio of batch on LLMs 🤗Transformers	0	317	October 27, 2024
Negative Kl values during PPO training (TRL library) 🤗Transformers	0	333	April 28, 2024
Offering a Technical Deep Dive on GRPO/DAPO/Dr. GRPO Algorithms Show and Tell	2	243	May 11, 2025
How to negatively penalize a T5 model's generation (Cross entropy doesn't do the job)? 🤗Transformers	0	271	November 19, 2022
Format Reward Function in GRPO Training Doesn't Stabilise Intermediate	0	596	February 12, 2025

Negative KL-divergence RLHF implementation

Related topics