Should the padding token be ignored in the loss function?

Hi.

I’m trying to train a GPT2 model, and seeing the way the loss is computed, I don’t see the padding token or the eos token being ignored by the loss function. Why is that? Usually in RNNs and networks similar to that we ignore pads in the loss function since their backward route is not important for us, and we don’t want to waste resources, but reading the code, I think the thought behind this might be a bit different?

Thanks!

2 Likes