Why my loss become NaN when I set the padding token in the labels to -100?

Hello everyone, I’m a new member of huggingface community, and here is my first topic!

Recently I’m trying to generate music scores by GPT-2, the length of the tokenized measures are not the same, so I must pad them to the same length. The course said I should use -100 as the padding token in the labels. But once I use -100, the loss becomes NaN and the accuracy starts to going down. But if I use 0 for padding token(as the input), there will be no such a problem. Why this happened?

My model is TFGPT2LMHeadModel, and the loss function is tf.keras.losses.SparseCategoricalCrossentropy