I understand the -100 label id is used so that the predictions for these are not included when calculating the loss.
However here, they state “complicated list comprehension here because pad_token_id alone is not good enough to know whether label should be excluded or not”, when replacing pad tokens. In the implementation, they use nn.CrossEntropyLoss(), which has an argument “ignore_index”.
Is there any benefit to changing the id to -100 as opposed to adding the argument ignore_index in the loss and setting it as the pad token id? Or are the results the same?
The way it is written makes me think there is some benefit, but the description of “ignore_index” appears to achieve what is wanted. Or was this just a choice in case someone chose to change the pad token id?