How does padding side affect training?

zekeZZ · August 23, 2024, 4:58pm

It seems the most causal LM nowadays use padding_size=“left”. I have several questions with regards to this:

If I pad to the right, then the training will always start at position 0; but if I pad to the left, then training can start at an arbitrary positional encoding. So isn’t right padding more ideal?
If I properly mask my labels (with -100 say) as well, then does padding_size really matter (modulo the positional encoding difference)?
Another seemingly tempting reason to use right padding is that we can safely ignore attention_mask if the attention is causal – we are guaranteed to pay attention to everything to the left of the current token. If we properly mask labels, then the loss will be computed correctly. OTOH, a correct attention_mask is required to properly compute loss for left padding. Is this a valid point?

I appreciate any thoughts on these points. Thanks!

Topic		Replies	Views
The effect of padding_side 🤗Transformers	13	15042	May 27, 2025
Padding side in instruction fine-tuning using SFTT 🤗Transformers	1	1523	December 9, 2024
Padding strategy for classification Beginners	3	2484	July 20, 2020
Seq2seq padding 🤗Transformers	1	69	October 10, 2024
What is the most suitable padding strategy for PPOTrainer? 🤗Transformers	1	27	February 20, 2025