I noticed in seq2seq tasks like translation the labels are padded with -100 however the predictions are padded with the padding spacial token
So why don’t we use the same pad for both why each one with different value
The -100
token you mentioned is actually the “mask token,” and serves a somewhat different purpose than the padding token.
It isn’t quite exactly that the labels are “padded with -100,” but rather that the input_ids
corresponding to inputs have labels
values of -100.
The mask token is used to indicate to the Trainer that the associated token and prediction should not be used for loss. Which means that the model focuses solely on learning how to generate the “outputs,” which is the translated text in this case.
Just because an input has been masked, however, does not mean that it is meaningless. The model learns to generate the labels
based on the inputs
. Which means that even though the inputs have been masked and the model does not explicitly learn to predict them, it still uses them as context for generating labels.
Padding tokens are tokens that have no real “meaning,” so to speak, and serve the sole purpose of ensuring that input lengths match the length expected by the model.