Difference between setting label index to -100 & setting attention mask to 0

According to Docs setting a token label index to -100 makes the model not compute loss on such tokens.
And attention mask does the same thing
Is the functionality of both the same or does it differ where one or the other is used

Thanks

2 Likes

No the attention mask is not used in the loss computation. It’s just there to make sure your model is not paying attention to the masked tokens.
The two things should be used together.

1 Like

Thanks @sgugger , So If I’m Ignoring the special tokens - CLS/SEP in loss computation using -100, is it recommended to mask them out?

Yes, and you probably don’t want to ignore them with the attention mask (like we do with the padding) since they give useful information to the model.

1 Like

Alrighty, So in short,
special tokens - attend & don’t compute loss
padding - don’t attend & don’t compute loss

1 Like

That would be my recommendation, yes.

2 Likes