According to Docs setting a token label index to -100
makes the model not compute loss on such tokens.
And attention mask does the same thing
Is the functionality of both the same or does it differ where one or the other is used
Thanks
According to Docs setting a token label index to -100
makes the model not compute loss on such tokens.
And attention mask does the same thing
Is the functionality of both the same or does it differ where one or the other is used
Thanks
No the attention mask is not used in the loss computation. It’s just there to make sure your model is not paying attention to the masked tokens.
The two things should be used together.
Thanks @sgugger , So If I’m Ignoring the special tokens - CLS/SEP in loss computation using -100, is it recommended to mask them out?
Yes, and you probably don’t want to ignore them with the attention mask (like we do with the padding) since they give useful information to the model.
Alrighty, So in short,
special tokens - attend & don’t compute loss
padding - don’t attend & don’t compute loss
That would be my recommendation, yes.