Seems like a silly question, but I’m learning and can’t find anything definitive…
In models where input_ids and labels may be of different length (i.e. denoising, where a span of several tokens in labels may have been replaced by a single token), should the attention_mask correspond to labels (so the original chunk size) or to input_ids (so resized after noising)?
The attention_mask tells the model which positions in the input to attend to, i.e., which tokens are real vs padding. It applies only to the forward pass — specifically, how attention is computed over the input_ids.
The labels are not used during attention computation — they are only used in the loss computation
But I guess my question can still stand regarding decoder_input_ids, in case it’s based on labels (see my other question, which would mean - if I understand correctly - that labels (shifted right) are used during computation, at decoder side, no?
So in my dataset, I should include both attention_mask and decoder_attention_mask? Will the model know which mask to use at which phase? I’m a bit confused…
With HF Trainer, you only need to pass input_ids, attention_mask, labels
If you pass labels, the model will:
1. Automatically shift them to create decoder_input_ids
2. Create the decoder_attention_mask to match the decoder_input_ids
3. Handle masking and loss computation (ignoring -100 in labels)
So the full decoder setup is inferred internally — as long as you provide labels.
You do not need to manually include decoder_input_ids or decoder_attention_mask — they are automatically derived at runtime by the model or tokenizer.
So just to make it absolutely clear (just correct me if I’m wrong; ignore otherwise ): I must pass attention_mask based on the noised text (input_ids), for the encoder. I can just leave the (possibly longer) decoder_attention_mask for the trainer to handle. Great!