Confusion over use of -100 pad value for GPT2 Causal Modeling Fine-tuning

KadriMufti · April 13, 2023, 5:51am

In the many sample codes and notebooks available online on how to fine-tune GPT2 for English and Non-English, and including the official training scripts here, there is no conversion of pad token IDs (input_ids) to -100 when doing causal modeling fine-tuning.

Yet this changing of the labels/input_ids to be -100 for the forward method to ignore it in the loss calculation is also mentioned in documentation and in some Github threads such as here and here:

All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

To add to my confusion, (and I was specifically working with the model aubmindlab/aragpt2-mega in case it is relevant to my question), when I did do the label conversion of pad input_ids to -100 using the reverse of the attention_mask, during causal fine-tuning, the trainer method would throw a CUDA Error: Device-Side Assert Triggered. Only when I did not change the pad labels to -100 did the trainer work as normal.

I’d like to understand what is happening here.

It makes sense that we do not want the pad token to be considered in the cross-entropy loss. Is this error occurring because the specific model I am working with does not use cross-entropy loss? And why do none of the tutorials and scripts convert the labels, as I mentioned above? Should we or should we not change the pad labels to -100?

Topic		Replies	Views
Pad token vs -100 index_id Intermediate	2	42	April 1, 2025
Seq2seq padding 🤗Transformers	1	69	October 10, 2024
Expected workflow -100 and padding in labels in seq2seq? 🤗Transformers	0	745	December 12, 2022
Processing the [-100] Mask in SFT 🤗Transformers	2	1139	April 9, 2024
Why my loss become NaN when I set the padding token in the labels to -100? Beginners	2	686	August 5, 2024

Confusion over use of -100 pad value for GPT2 Causal Modeling Fine-tuning

Related topics