Difference between setting label index to -100 & setting attention mask to 0

Praful932 · March 17, 2021, 5:47pm

According to Docs setting a token label index to -100 makes the model not compute loss on such tokens.
And attention mask does the same thing
Is the functionality of both the same or does it differ where one or the other is used

Thanks

sgugger · March 17, 2021, 6:11pm

No the attention mask is not used in the loss computation. It’s just there to make sure your model is not paying attention to the masked tokens.
The two things should be used together.

Praful932 · March 17, 2021, 6:29pm

Thanks @sgugger , So If I’m Ignoring the special tokens - CLS/SEP in loss computation using -100, is it recommended to mask them out?

sgugger · March 17, 2021, 6:32pm

Yes, and you probably don’t want to ignore them with the attention mask (like we do with the padding) since they give useful information to the model.

Praful932 · March 17, 2021, 6:39pm

Alrighty, So in short,
special tokens - attend & don’t compute loss
padding - don’t attend & don’t compute loss

sgugger · March 17, 2021, 7:05pm

That would be my recommendation, yes.

Topic		Replies	Views
Pad token vs -100 index_id Intermediate	2	44	April 1, 2025
Is the attention mask and tokenization taken into account? Beginners	0	351	December 7, 2021
Processing the [-100] Mask in SFT 🤗Transformers	2	1146	April 9, 2024
Will Trainer loss functions automatically ignore -100? 🤗Transformers	2	2145	June 29, 2023
Seq2seq padding 🤗Transformers	1	69	October 10, 2024

Difference between setting label index to -100 & setting attention mask to 0

Related topics