Seq2seq padding

AbdulrahmanAhmed · October 10, 2024, 12:11am

I noticed in seq2seq tasks like translation the labels are padded with -100 however the predictions are padded with the padding spacial token
So why don’t we use the same pad for both why each one with different value

Chahnwoo · October 10, 2024, 12:39am

The -100 token you mentioned is actually the “mask token,” and serves a somewhat different purpose than the padding token.

It isn’t quite exactly that the labels are “padded with -100,” but rather that the input_ids corresponding to inputs have labels values of -100.

The mask token is used to indicate to the Trainer that the associated token and prediction should not be used for loss. Which means that the model focuses solely on learning how to generate the “outputs,” which is the translated text in this case.

Just because an input has been masked, however, does not mean that it is meaningless. The model learns to generate the labels based on the inputs. Which means that even though the inputs have been masked and the model does not explicitly learn to predict them, it still uses them as context for generating labels.

Padding tokens are tokens that have no real “meaning,” so to speak, and serve the sole purpose of ensuring that input lengths match the length expected by the model.

Topic		Replies	Views
Expected workflow -100 and padding in labels in seq2seq? 🤗Transformers	0	745	December 12, 2022
-100 in predictions Beginners	1	54	December 20, 2024
Attention mask and token ids Awesome paper	1	2264	October 18, 2022
Processing the [-100] Mask in SFT 🤗Transformers	2	1152	April 9, 2024
Why my loss become NaN when I set the padding token in the labels to -100? Beginners	2	689	August 5, 2024

Seq2seq padding

Related topics