Confusion over use of -100 pad value for GPT2 Causal Modeling Fine-tuning

In the many sample codes and notebooks available online on how to fine-tune GPT2 for English and Non-English, and including the official training scripts here, there is no conversion of pad token IDs (input_ids) to -100 when doing causal modeling fine-tuning.

Yet this changing of the labels/input_ids to be -100 for the forward method to ignore it in the loss calculation is also mentioned in documentation and in some Github threads such as here and here:

All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

To add to my confusion, (and I was specifically working with the model aubmindlab/aragpt2-mega in case it is relevant to my question), when I did do the label conversion of pad input_ids to -100 using the reverse of the attention_mask, during causal fine-tuning, the trainer method would throw a CUDA Error: Device-Side Assert Triggered. Only when I did not change the pad labels to -100 did the trainer work as normal.

I’d like to understand what is happening here.

It makes sense that we do not want the pad token to be considered in the cross-entropy loss. Is this error occurring because the specific model I am working with does not use cross-entropy loss? And why do none of the tutorials and scripts convert the labels, as I mentioned above? Should we or should we not change the pad labels to -100?