Generally, if I am using a decoder model to generate/inference I would pad to the left, so the model doesnât try to generate after a PAD/EOS token.
And, if I am fine-tuning (without packing) I would pad to the right.
But, when training with PPO things are a bit more complex it seems. The PPOTrainer is generating responses for my prompts, and so padding to the left makes sense so it can naturally form a response without PADs in-between. However, even though we are using a PPO loss, we are still trying to make our underlying âpolicyâ model generate better responses and itâs still just a decoder we are training.
My instinct tells me PAD to the left as the underlying policy mode is not being evaluated using the labels in the model directlyâŚ
1 Like
I asked Hugging Chat because I didnât understand at all.
The padding strategy for the PPOTrainer
can be a bit nuanced as it involves balancing the generation process with the need to correctly compute rewards and gradients. Hereâs a breakdown of the considerations based on the provided sources:
- General Padding Guidelines:
- Left Padding: Often recommended for generation tasks, especially with decoder models, to prevent the model from generating after a PAD or EOS token 1.
- Right Padding: Generally used during fine-tuning to align the end of sequences, but this can interfere with generation tasks where the model should naturally form responses without encountering PAD tokens in between 1.
- PPOTrainer Specifics:
- The
PPOTrainer
generates responses to prompts and computes rewards based on these responses. The underlying policy model (decoder) generates text, and the padding strategy should facilitate this process without introducing PAD tokens into the generated text 3.
- The
PPOTrainer
expects the dataset to have a âtextâ column, which it renames to âqueryâ. This suggests that the input prompts are processed in a way that the model can generate coherent responses 1.
- Recommendation:
- Left Padding: Given that the
PPOTrainer
is generating responses and you want the model to generate text naturally without encountering PAD tokens, left padding is more suitable. This ensures that the model can generate responses without interruption from PAD tokens 3.
- Right Padding: Although right padding is used in some contexts for fine-tuning, it can be problematic for generation tasks like those handled by the
PPOTrainer
because it might introduce PAD tokens into the generated output, disrupting the natural flow of text.
In summary, for the PPOTrainer
, the most suitable padding strategy is left padding. This approach aligns with the need for the model to generate coherent responses without interruptions from PAD tokens
1 Like