Pad token vs -100 index_id

I understand the -100 label id is used so that the predictions for these are not included when calculating the loss.

However here, they state “complicated list comprehension here because pad_token_id alone is not good enough to know whether label should be excluded or not”, when replacing pad tokens. In the implementation, they use nn.CrossEntropyLoss(), which has an argument “ignore_index”.

Is there any benefit to changing the id to -100 as opposed to adding the argument ignore_index in the loss and setting it as the pad token id? Or are the results the same?

The way it is written makes me think there is some benefit, but the description of “ignore_index” appears to achieve what is wanted. Or was this just a choice in case someone chose to change the pad token id?

1 Like

Its just for when someone wants to change the pad token id.

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.