Negative KL-divergence RLHF implementation

I am struggling to understand one part of the FAQ of the transformer reinforcement learning library from HuggingFace:

What Is the Concern with Negative KL Divergence?

If you generate text by purely sampling from the model distribution things work fine in general. But when you use the generate method there are a few caveats because it does not always purely sample depending on the settings which can cause KL-divergence to go negative. Essentially when the active model achieves log_p_token_active < log_p_token_ref we get negative KL-div. This can happen in a several cases:

  • top-k sampling: the model can smooth out the probability distribution causing the top-k tokens having a smaller probability than those of the reference model but they still are selected
  • min_length: this ignores the EOS token until min_length is reached. thus the model can assign a very high log prob to the EOS token and very low prob to all others until min_length is reached
  • batched generation: finished sequences in a batch are padded until all generations are finished. The model can learn to assign very low probabilities to the padding tokens unless they are properly masked or removed.

These are just a few examples. Why is negative KL an issue? The total reward R is computed R = r - beta * KL so if the model can learn how to drive KL-divergence negative it effectively gets a positive reward. In many cases it can be much easier to exploit such a bug in the generation than actually learning the reward function. In addition the KL can become arbitrarily small thus the actual reward can be very small compared to it.

I understand why the KL-divergence that is computed here is an approximation that can be negative as opposed to the real one. However, I cannot wrap my head around the details of why these specific sampling parameters would lead to negative KL-divergence. Could someone elaborate on these points?