What is the most suitable padding strategy for PPOTrainer?

Generally, if I am using a decoder model to generate/inference I would pad to the left, so the model doesn’t try to generate after a PAD/EOS token.

And, if I am fine-tuning (without packing) I would pad to the right.

But, when training with PPO things are a bit more complex it seems. The PPOTrainer is generating responses for my prompts, and so padding to the left makes sense so it can naturally form a response without PADs in-between. However, even though we are using a PPO loss, we are still trying to make our underlying ‘policy’ model generate better responses and it’s still just a decoder we are training.

My instinct tells me PAD to the left as the underlying policy mode is not being evaluated using the labels in the model directly…

1 Like

I asked Hugging Chat because I didn’t understand at all.


The padding strategy for the PPOTrainer can be a bit nuanced as it involves balancing the generation process with the need to correctly compute rewards and gradients. Here’s a breakdown of the considerations based on the provided sources:

  1. General Padding Guidelines:
  • Left Padding: Often recommended for generation tasks, especially with decoder models, to prevent the model from generating after a PAD or EOS token 1.
  • Right Padding: Generally used during fine-tuning to align the end of sequences, but this can interfere with generation tasks where the model should naturally form responses without encountering PAD tokens in between 1.
  1. PPOTrainer Specifics:
  • The PPOTrainer generates responses to prompts and computes rewards based on these responses. The underlying policy model (decoder) generates text, and the padding strategy should facilitate this process without introducing PAD tokens into the generated text 3.
  • The PPOTrainer expects the dataset to have a ‘text’ column, which it renames to ‘query’. This suggests that the input prompts are processed in a way that the model can generate coherent responses 1.
  1. Recommendation:
  • Left Padding: Given that the PPOTrainer is generating responses and you want the model to generate text naturally without encountering PAD tokens, left padding is more suitable. This ensures that the model can generate responses without interruption from PAD tokens 3.
  • Right Padding: Although right padding is used in some contexts for fine-tuning, it can be problematic for generation tasks like those handled by the PPOTrainer because it might introduce PAD tokens into the generated output, disrupting the natural flow of text.

In summary, for the PPOTrainer, the most suitable padding strategy is left padding. This approach aligns with the need for the model to generate coherent responses without interruptions from PAD tokens

1 Like