Hi guys!
Suppose I have batch of just two sentences (for simplicity) with different lengths (let len(sent_1) > len(sent_2)). For training I have to provide labels parameter:
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(some_list_of_two_sents, padding=True)
Suppose I got
labels["input_ids"] = [[x,   x,   x,   x,   x,     x,   eos],  # sent_1
                       [x,   x,   x,   x,   eos,   pad, pad]]  # sent_2
In case of training conditional generation models (e.g. T5ForConditionalGeneration or BartForConditionalGeneration), when user omits parameter decoder_input_ids, it will be created automatically by shifting the labels to the right:
decoder_input_ids = shift_tokens_right(labels["input_ids"], self.config.pad_token_id, 
                                       self.config.decoder_start_token_id)
so for our simple example
decoder_input_ids = [[bos,  x,   x,   x,   x,   x,   x ],   # sent_1
                     [bos,  x,   x,   x,   x,  eos, pad]]   # sent_2
The question is - how can I evaluate/deduce parameter decoder_attention_mask ?
So far I see two options here:
decoder_attention_mask = labels["attention_mask"]- 
decoder_attention_mask = some_manipulations_on(labels["attention_mask"]), maybe right shift as well? 
In my opinion 1) should be the case, since difference between labels and decoder_input_ids is following:
lables            = bla_bla + eos + pads
decoder_input_ids = bos + bla_bla + pads
and one can see, that labels["attention_mask"] ignores eos in shifted labels and take into account inserted bos. But I’m not sure, that is why I ask you.
Thanks!