Processing the [-100] Mask in SFT

Chahnwoo · April 9, 2024, 2:26am

Does anyone know how exactly the masking token is processed? I’ve seen code that uses the [-100] id as a mask token to mask certain parts of the ‘labels’ for supervised fine-tuning. I think I get what it’s meant to do (specify certain parts as the actual ‘labels’) but I haven’t been able to make sense of how exactly it works (in terms of how it works with loss or how it guides the model to specific outputs). I’ve tried looking at the GitHub repository for the Trainer but haven’t really been able to make sense of it.

If there’s anyone who knows an explanation or maybe even a guide or just any resource in general relating to this topic, I’d really appreciate the help.

swtb · April 9, 2024, 9:50am

-100 is typically used in pytorch loss functions to say “Skip me”. For example, the word: Chocolate may be tokenised as “Cho”, “co”, “o”, “l”, “ate”. If we compute the loss and update the gradients based on each token then we will effectively be doing so 5 times (one per token). Compared with “rain” which may be tokenised as “ra”, “in” would double the impact of the associated loss. So longer words with more tokens would have greater impact on the model weights than smaller words with fewer tokens.

Therefore we usually just take the first token, and the self attention creates the association between the first token and the subsequent tokens in a word.

on the internal level this just involves not calculating loss for certain tokens. these tokens are given the ID -100 by default in Pytorch. Though this can be changed.

The above is true of encoder style models, I assume -100 fills a similar purpose in decoder models using the SFTTrainer.

See:
CrossEntropyLoss — PyTorch 2.2 documentation

I also wrote about it here, though I admit I could certainly revisit this article to make it more accurate.

Transformer Attention and Tokenisation | N E R | D S (medium.com)

system · April 11, 2024, 4:41am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How Labelled Data is Processed \| Transformers Trainer 🤗Transformers	10	4099	April 16, 2024
Seq2seq padding 🤗Transformers	1	68	October 10, 2024
Difference between setting label index to -100 & setting attention mask to 0 🤗Transformers	5	2945	March 17, 2021
Wav2Vec2: Inner workings of the Trainer class Beginners	6	383	September 6, 2021
Multi-label token classification: "-100" special label 🤗Transformers	1	503	September 18, 2023

Processing the [-100] Mask in SFT

Related topics