Does attention_mask refer to input_ids or to labels?

Philomath868 · June 18, 2025, 3:29pm

Seems like a silly question, but I’m learning and can’t find anything definitive…

In models where input_ids and labels may be of different length (i.e. denoising, where a span of several tokens in labels may have been replaced by a single token), should the attention_mask correspond to labels (so the original chunk size) or to input_ids (so resized after noising)?

Mdrnfox · June 18, 2025, 4:22pm

The attention_mask tells the model which positions in the input to attend to, i.e., which tokens are real vs padding. It applies only to the forward pass — specifically, how attention is computed over the input_ids.

The labels are not used during attention computation — they are only used in the loss computation

Philomath868 · June 18, 2025, 4:41pm

Thanks, that’s a clear and succinct explanation!

But I guess my question can still stand regarding decoder_input_ids, in case it’s based on labels (see my other question, which would mean - if I understand correctly - that labels (shifted right) are used during computation, at decoder side, no?

Mdrnfox · June 18, 2025, 5:06pm

My bad, I completely didn’t see that

Yes, the decoder_attention_mask (or just attention_mask on decoder_input_ids ) should match the decoder input, which is usually labels shifted right.

decoder_input_ids are either provided manually or auto-generated by shifting labels right.

Philomath868 · June 18, 2025, 5:13pm

So in my dataset, I should include both attention_mask and decoder_attention_mask? Will the model know which mask to use at which phase? I’m a bit confused…

Mdrnfox · June 18, 2025, 5:33pm

With HF Trainer, you only need to pass input_ids, attention_mask, labels

If you pass labels, the model will:
1. Automatically shift them to create decoder_input_ids
2. Create the decoder_attention_mask to match the decoder_input_ids
3. Handle masking and loss computation (ignoring -100 in labels)

So the full decoder setup is inferred internally — as long as you provide labels.

You do not need to manually include decoder_input_ids or decoder_attention_mask — they are automatically derived at runtime by the model or tokenizer.

Philomath868 · June 18, 2025, 5:40pm

Thank you!

So just to make it absolutely clear (just correct me if I’m wrong; ignore otherwise ): I must pass attention_mask based on the noised text (input_ids), for the encoder. I can just leave the (possibly longer) decoder_attention_mask for the trainer to handle. Great!

system · June 19, 2025, 5:40am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
The meaning of 'decoder input ids' in encoder-decoder model Beginners	1	2377	July 29, 2022
Decoder attention mask in text2text/se2seq generation encoder-decoder models 🤗Transformers	1	1638	March 22, 2022
What should decoder_input_ids be when pre-training mBART? Models	0	10	June 18, 2025
Wav2Vec2: Inner workings of the Trainer class Beginners	6	387	September 6, 2021
How to use `inputs_embed` and `attention_mask` together? Intermediate	1	934	May 19, 2024

Does attention_mask refer to input_ids or to labels?

Related topics