Confusion over use of -100 pad value for GPT2 Causal Modeling Fine-tuning

In the many sample codes and notebooks available online on how to fine-tune GPT2 for English and Non-English, and including the official training scripts here, there is no conversion of pad token IDs (input_ids) to -100 when doing causal modeling fine-tuning.

Yet this changing of the labels/input_ids to be -100 for the forward method to ignore it in the loss calculation is also mentioned in documentation and in some Github threads such as here and here:

All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

To add to my confusion, (and I was specifically working with the model aubmindlab/aragpt2-mega in case it is relevant to my question), when I did do the label conversion of pad input_ids to -100 using the reverse of the attention_mask, during causal fine-tuning, the trainer method would throw a CUDA Error: Device-Side Assert Triggered. Only when I did not change the pad labels to -100 did the trainer work as normal.

Iā€™d like to understand what is happening here.

It makes sense that we do not want the pad token to be considered in the cross-entropy loss. Is this error occurring because the specific model I am working with does not use cross-entropy loss? And why do none of the tutorials and scripts convert the labels, as I mentioned above? Should we or should we not change the pad labels to -100?