Does attention_mask refer to input_ids or to labels?

Seems like a silly question, but I’m learning and can’t find anything definitive…

In models where input_ids and labels may be of different length (i.e. denoising, where a span of several tokens in labels may have been replaced by a single token), should the attention_mask correspond to labels (so the original chunk size) or to input_ids (so resized after noising)?

2 Likes

The attention_mask tells the model which positions in the input to attend to, i.e., which tokens are real vs padding. It applies only to the forward pass — specifically, how attention is computed over the input_ids.

The labels are not used during attention computation — they are only used in the loss computation

2 Likes

Thanks, that’s a clear and succinct explanation!

But I guess my question can still stand regarding decoder_input_ids, in case it’s based on labels (see my other question, which would mean - if I understand correctly - that labels (shifted right) are used during computation, at decoder side, no?

1 Like

My bad, I completely didn’t see that

Yes, the decoder_attention_mask (or just attention_mask on decoder_input_ids ) should match the decoder input, which is usually labels shifted right.

decoder_input_ids are either provided manually or auto-generated by shifting labels right.

2 Likes

So in my dataset, I should include both attention_mask and decoder_attention_mask? Will the model know which mask to use at which phase? I’m a bit confused…

1 Like

With HF Trainer, you only need to pass input_ids, attention_mask, labels

If you pass labels, the model will:
1. Automatically shift them to create decoder_input_ids
2. Create the decoder_attention_mask to match the decoder_input_ids
3. Handle masking and loss computation (ignoring -100 in labels)

So the full decoder setup is inferred internally — as long as you provide labels.

You do not need to manually include decoder_input_ids or decoder_attention_mask — they are automatically derived at runtime by the model or tokenizer.

1 Like

Thank you!

So just to make it absolutely clear (just correct me if I’m wrong; ignore otherwise :wink: ): I must pass attention_mask based on the noised text (input_ids), for the encoder. I can just leave the (possibly longer) decoder_attention_mask for the trainer to handle. Great!

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.