Is T5 expected to ignore padding tokens in `decoder_input_ids` when `decoder_attention_mask` is not provided

jyoo · September 24, 2021, 11:28am

I’m currently trying to train T5ForConditionalGeneration model for seq2seq task, and I was wondering if we can expect T5 to internally ignore (e.g. by generating attention mask) padding tokens in decoder_input_ids if we don’t explicitly provide decoder_attention_mask?

I noticed from the code that T5 simply create attention mask of all 1s if decoder_attention_mask is not provided, so it seems we’re attending to padding tokens. I also ran a sanity check to see if providing decoder_attention_mask had any meaningful difference for the logits and saw that it does matter.

So I’m wondering if this is by design, because it doesn’t seem to make sense to attend to padding tokens for batched passes.

Below is sanity check that I ran (I know decoder_input_ids is supposed to be different from input_ids normally, but figured it’s not important for this particular issue).

import torch
import transformers

model = transformers.T5ForConditionalGeneration.from_pretrained("t5-base")
tokenizer = transformers.T5Tokenizer.from_pretrained("t5-base")
model.cuda()
model.eval()

texts = ["This is a test input.", "This is a test input to test T5 padding scheme."]
input_ids = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
input_ids.to("cuda")

with torch.inference_mode():
    # Shift decoder input ids to the right
    decoder_input_ids = model._shift_right(input_ids.input_ids)

    # Manually give correct attention mask
    with_attn_mask_logits = model(
        input_ids=input_ids.input_ids,
        attention_mask=input_ids.attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=torch.cat((
            torch.tensor([[1], [1]], device="cuda"),
            input_ids.attention_mask[:, :-1]), dim=1
        )
    ).logits

    # Give attention mask of all 1 explicitly
    all_1_attn_mask_logits = model(
        input_ids=input_ids.input_ids,
        attention_mask=input_ids.attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=torch.ones(decoder_input_ids.shape, device="cuda"),
    ).logits

    # Do give attention mask at all.
    no_attn_mask_logits = model(
        input_ids=input_ids.input_ids,
        attention_mask=input_ids.attention_mask,
        decoder_input_ids=decoder_input_ids,
    ).logits

    print(torch.all(torch.isclose(with_attn_mask_logits, no_attn_mask_logits)).item())  # False
    print(torch.equal(with_attn_mask_logits, no_attn_mask_logits))  # False

    print(torch.all(torch.isclose(all_1_attn_mask_logits, no_attn_mask_logits)).item())  # True
    print(torch.equal(all_1_attn_mask_logits, no_attn_mask_logits))  # True

nielsr · September 24, 2021, 11:49am

Good catch. I think that in the decoder, one attends to all tokens (including padding), but as one needs to set the labels of the padding tokens to -100, they are not taken into account by the loss function.

But it’s weird indeed, as in the encoder one uses the attention mask to ignore padding tokens when calculating the attention scores.

cc @patrickvonplaten @valhalla

oliversn · July 18, 2022, 7:56pm

Hi @nielsr ,

Thanks for the insight. If I may I would like to ask a follow-up question:
Should I still convert the labels to -100 if I already provided the correct decoder_attention_mask?
I am using PyTorch.

i.e. Is there any difference between the “loss_1” and “loss_2” below?

loss_1 = model(
      input_ids=batch["input_ids"], 
      labels=**labels_original_with_0_pad**, 
      attention_mask=batch["attention_mask"], 
      decoder_attention_mask=batch["target_attention_mask"]
).loss  # return 12.38

loss_2 = model(
      input_ids=batch["input_ids"], 
      labels=**labels_with_neg_100**, 
      attention_mask=batch["attention_mask"]
).loss # return 0.041

But I found there is actually a difference. (12.38 vs 0.041). So I am still a little bit confused here.

Thanks!!

Praful932 · August 31, 2022, 9:30am

Ideally the labels should be -100 wherever the pad token is there since you neither want to compute loss or attend to it

aj70 · April 5, 2023, 7:34pm

I examined the T5forConditionalGeneration code and found that if you don’t specify the decoder_attention_mask, the model creates one automatically by assigning a value of 1 to all tokens in the target sequence, including padding tokens. Nevertheless, in auto-regressive decoding, if you have padding tokens on the right and set their labels to -100, the model won’t compute loss for those tokens, and non-padding tokens that were previously decoded won’t attend to padding tokens in the future.

However, if padding tokens are present between non-padding tokens (for some reason), even if you set their labels to -100, you should provide an attention mask. This is because non-padding tokens decoded after the padding tokens will attend to them when computing their loss.

Topic		Replies	Views
T5 - Padded decoder inputs yields differerent results Beginners	1	712	June 14, 2022
T5 decoder predicting tokens even after hitting end of sequence token, i.e </s> 🤗Transformers	4	307	February 26, 2024
Do automatically generated attention masks ignore padding? 🤗Transformers	4	14592	March 8, 2022
How does T5 create the correct decoder_input_ids? 🤗Transformers	2	2471	September 21, 2020
Can attention_mask hold float values in [0,1] in T5? How these masks act in Attention blocks? 🤗Transformers	0	658	May 26, 2022

Is T5 expected to ignore padding tokens in `decoder_input_ids` when `decoder_attention_mask` is not provided

Related topics