Does anyone know how exactly the masking token is processed? Iāve seen code that uses the [-100] id as a mask token to mask certain parts of the ālabelsā for supervised fine-tuning. I think I get what itās meant to do (specify certain parts as the actual ālabelsā) but I havenāt been able to make sense of how exactly it works (in terms of how it works with loss or how it guides the model to specific outputs). Iāve tried looking at the GitHub repository for the Trainer but havenāt really been able to make sense of it.
If thereās anyone who knows an explanation or maybe even a guide or just any resource in general relating to this topic, Iād really appreciate the help.
-100 is typically used in pytorch loss functions to say āSkip meā. For example, the word: Chocolate may be tokenised as āChoā, ācoā, āoā, ālā, āateā. If we compute the loss and update the gradients based on each token then we will effectively be doing so 5 times (one per token). Compared with ārainā which may be tokenised as āraā, āinā would double the impact of the associated loss. So longer words with more tokens would have greater impact on the model weights than smaller words with fewer tokens.
Therefore we usually just take the first token, and the self attention creates the association between the first token and the subsequent tokens in a word.
on the internal level this just involves not calculating loss for certain tokens. these tokens are given the ID -100 by default in Pytorch. Though this can be changed.
The above is true of encoder style models, I assume -100 fills a similar purpose in decoder models using the SFTTrainer.