Decoder attention mask in text2text/se2seq generation encoder-decoder models

yurii · April 29, 2021, 11:13am

Hi guys!

Suppose I have batch of just two sentences (for simplicity) with different lengths (let len(sent_1) > len(sent_2)). For training I have to provide labels parameter:

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(some_list_of_two_sents, padding=True)

Suppose I got

labels["input_ids"] = [[x,   x,   x,   x,   x,     x,   eos],  # sent_1
                       [x,   x,   x,   x,   eos,   pad, pad]]  # sent_2

In case of training conditional generation models (e.g. T5ForConditionalGeneration or BartForConditionalGeneration), when user omits parameter decoder_input_ids, it will be created automatically by shifting the labels to the right:

decoder_input_ids = shift_tokens_right(labels["input_ids"], self.config.pad_token_id, 
                                       self.config.decoder_start_token_id)

so for our simple example

decoder_input_ids = [[bos,  x,   x,   x,   x,   x,   x ],   # sent_1
                     [bos,  x,   x,   x,   x,  eos, pad]]   # sent_2

The question is - how can I evaluate/deduce parameter decoder_attention_mask ?
So far I see two options here:

decoder_attention_mask = labels["attention_mask"]
decoder_attention_mask = some_manipulations_on(labels["attention_mask"]), maybe right shift as well?

In my opinion 1) should be the case, since difference between labels and decoder_input_ids is following:

lables            = bla_bla + eos + pads
decoder_input_ids = bos + bla_bla + pads

and one can see, that labels["attention_mask"] ignores eos in shifted labels and take into account inserted bos. But I’m not sure, that is why I ask you.

Thanks!

feifang24 · March 22, 2022, 8:56pm

I am currently using VisionEncoderDecoderModel for seq2seq generation. I had chosen 1) without much thought until the same exact question occurred to me. I’m now doubting that approach since I’m not sure that ignoring eos in shifted labels is desirable. Wouldn’t we want the model to attend to eos so that the model learns when to stop?

Topic		Replies	Views
Is T5 expected to ignore padding tokens in `decoder_input_ids` when `decoder_attention_mask` is not provided 🤗Transformers	4	2703	April 5, 2023
Does attention_mask refer to input_ids or to labels? Beginners	7	42	June 19, 2025
T5 - Padded decoder inputs yields differerent results Beginners	1	726	June 14, 2022
Is there a way to return the "decoder_input_ids" from "tokenizer.prepare_seq2seq_batch"? 🤗Transformers	5	3351	December 29, 2020
How to use `inputs_embed` and `attention_mask` together? Intermediate	1	947	May 19, 2024

Decoder attention mask in text2text/se2seq generation encoder-decoder models

Related topics