Processing the [-100] Mask in SFT

Does anyone know how exactly the masking token is processed? Iā€™ve seen code that uses the [-100] id as a mask token to mask certain parts of the ā€˜labelsā€™ for supervised fine-tuning. I think I get what itā€™s meant to do (specify certain parts as the actual ā€˜labelsā€™) but I havenā€™t been able to make sense of how exactly it works (in terms of how it works with loss or how it guides the model to specific outputs). Iā€™ve tried looking at the GitHub repository for the Trainer but havenā€™t really been able to make sense of it.

If thereā€™s anyone who knows an explanation or maybe even a guide or just any resource in general relating to this topic, Iā€™d really appreciate the help.

-100 is typically used in pytorch loss functions to say ā€œSkip meā€. For example, the word: Chocolate may be tokenised as ā€œChoā€, ā€œcoā€, ā€œoā€, ā€œlā€, ā€œateā€. If we compute the loss and update the gradients based on each token then we will effectively be doing so 5 times (one per token). Compared with ā€œrainā€ which may be tokenised as ā€œraā€, ā€œinā€ would double the impact of the associated loss. So longer words with more tokens would have greater impact on the model weights than smaller words with fewer tokens.

Therefore we usually just take the first token, and the self attention creates the association between the first token and the subsequent tokens in a word.

on the internal level this just involves not calculating loss for certain tokens. these tokens are given the ID -100 by default in Pytorch. Though this can be changed.

The above is true of encoder style models, I assume -100 fills a similar purpose in decoder models using the SFTTrainer.

See:
CrossEntropyLoss ā€” PyTorch 2.2 documentation

I also wrote about it here, though I admit I could certainly revisit this article to make it more accurate.

Transformer Attention and Tokenisation | N E R | D S (medium.com)

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.