How does padding side affect training?

It seems the most causal LM nowadays use padding_size=“left”. I have several questions with regards to this:

  1. If I pad to the right, then the training will always start at position 0; but if I pad to the left, then training can start at an arbitrary positional encoding. So isn’t right padding more ideal?
  2. If I properly mask my labels (with -100 say) as well, then does padding_size really matter (modulo the positional encoding difference)?
  3. Another seemingly tempting reason to use right padding is that we can safely ignore attention_mask if the attention is causal – we are guaranteed to pay attention to everything to the left of the current token. If we properly mask labels, then the loss will be computed correctly. OTOH, a correct attention_mask is required to properly compute loss for left padding. Is this a valid point?

I appreciate any thoughts on these points. Thanks!