Is T5 expected to ignore padding tokens in `decoder_input_ids` when `decoder_attention_mask` is not provided

I’m currently trying to train T5ForConditionalGeneration model for seq2seq task, and I was wondering if we can expect T5 to internally ignore (e.g. by generating attention mask) padding tokens in decoder_input_ids if we don’t explicitly provide decoder_attention_mask?

I noticed from the code that T5 simply create attention mask of all 1s if decoder_attention_mask is not provided, so it seems we’re attending to padding tokens. I also ran a sanity check to see if providing decoder_attention_mask had any meaningful difference for the logits and saw that it does matter.

So I’m wondering if this is by design, because it doesn’t seem to make sense to attend to padding tokens for batched passes.

Below is sanity check that I ran (I know decoder_input_ids is supposed to be different from input_ids normally, but figured it’s not important for this particular issue).

import torch
import transformers

model = transformers.T5ForConditionalGeneration.from_pretrained("t5-base")
tokenizer = transformers.T5Tokenizer.from_pretrained("t5-base")
model.cuda()
model.eval()

texts = ["This is a test input.", "This is a test input to test T5 padding scheme."]
input_ids = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
input_ids.to("cuda")

with torch.inference_mode():
    # Shift decoder input ids to the right
    decoder_input_ids = model._shift_right(input_ids.input_ids)

    # Manually give correct attention mask
    with_attn_mask_logits = model(
        input_ids=input_ids.input_ids,
        attention_mask=input_ids.attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=torch.cat((
            torch.tensor([[1], [1]], device="cuda"),
            input_ids.attention_mask[:, :-1]), dim=1
        )
    ).logits

    # Give attention mask of all 1 explicitly
    all_1_attn_mask_logits = model(
        input_ids=input_ids.input_ids,
        attention_mask=input_ids.attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=torch.ones(decoder_input_ids.shape, device="cuda"),
    ).logits

    # Do give attention mask at all.
    no_attn_mask_logits = model(
        input_ids=input_ids.input_ids,
        attention_mask=input_ids.attention_mask,
        decoder_input_ids=decoder_input_ids,
    ).logits

    print(torch.all(torch.isclose(with_attn_mask_logits, no_attn_mask_logits)).item())  # False
    print(torch.equal(with_attn_mask_logits, no_attn_mask_logits))  # False

    print(torch.all(torch.isclose(all_1_attn_mask_logits, no_attn_mask_logits)).item())  # True
    print(torch.equal(all_1_attn_mask_logits, no_attn_mask_logits))  # True

Good catch. I think that in the decoder, one attends to all tokens (including padding), but as one needs to set the labels of the padding tokens to -100, they are not taken into account by the loss function.

But it’s weird indeed, as in the encoder one uses the attention mask to ignore padding tokens when calculating the attention scores.

cc @patrickvonplaten @valhalla

Hi @nielsr ,

Thanks for the insight. If I may I would like to ask a follow-up question:
Should I still convert the labels to -100 if I already provided the correct decoder_attention_mask?
I am using PyTorch.

i.e. Is there any difference between the “loss_1” and “loss_2” below?

loss_1 = model(
      input_ids=batch["input_ids"], 
      labels=**labels_original_with_0_pad**, 
      attention_mask=batch["attention_mask"], 
      decoder_attention_mask=batch["target_attention_mask"]
).loss  # return 12.38

loss_2 = model(
      input_ids=batch["input_ids"], 
      labels=**labels_with_neg_100**, 
      attention_mask=batch["attention_mask"]
).loss # return 0.041

But I found there is actually a difference. (12.38 vs 0.041). So I am still a little bit confused here.

Thanks!!

Ideally the labels should be -100 wherever the pad token is there since you neither want to compute loss or attend to it

I examined the T5forConditionalGeneration code and found that if you don’t specify the decoder_attention_mask, the model creates one automatically by assigning a value of 1 to all tokens in the target sequence, including padding tokens. Nevertheless, in auto-regressive decoding, if you have padding tokens on the right and set their labels to -100, the model won’t compute loss for those tokens, and non-padding tokens that were previously decoded won’t attend to padding tokens in the future.

However, if padding tokens are present between non-padding tokens (for some reason), even if you set their labels to -100, you should provide an attention mask. This is because non-padding tokens decoded after the padding tokens will attend to them when computing their loss.

2 Likes