XLMProphetNet returning different results when using padding

Issue also available here: XLMProphetNet returning different results when using padding 路 Issue #24289 路 huggingface/transformers 路 GitHub

I鈥檓 noticing a behavior with the XLMProphetNet model where when I use padding for the same decoder_input_ids then the resulting decoder output differs. This results in a different loss being computed for a batch versus each sample individually without padding.

Code to reproduce:

from transformers import XLMProphetNetTokenizer, XLMProphetNetForConditionalGeneration
import torch

tokenizer = XLMProphetNetTokenizer.from_pretrained("microsoft/xprophetnet-large-wiki100-cased-xglue-ntg")
model = XLMProphetNetForConditionalGeneration.from_pretrained("microsoft/xprophetnet-large-wiki100-cased-xglue-ntg").eval()

enc_input = tokenizer("test", return_tensors="pt")
input_ids = enc_input.input_ids
attention_mask = enc_input.attention_mask

dec_input_ids = torch.tensor([[model.config.decoder_start_token_id]], dtype=torch.int64)
dec_attention_mask = torch.tensor([[1]], dtype=torch.int64)

dec_input_ids_pad = torch.tensor([[model.config.decoder_start_token_id, model.config.pad_token_id]], dtype=torch.int64)
dec_attention_mask_pad = torch.tensor([[1, 0]], dtype=torch.int64)

out1 = model(
    input_ids=input_ids, attention_mask=attention_mask,
    decoder_input_ids=dec_input_ids, decoder_attention_mask=dec_attention_mask
)

out2 = model(
    input_ids=input_ids, attention_mask=attention_mask,
    decoder_input_ids=dec_input_ids_pad, decoder_attention_mask=dec_attention_mask_pad
)

torch.isclose(out1.logits, out2.logits[:, 0], atol=1e-1).all() # false

Is this expected? During training I鈥檓 noticing that runs that don鈥檛 use any padding (accumulation iterations) are converging whereas any of my runs that use padding are failing to train properly.

@patrickvonplaten Do you have any insight on why this is happening? Thanks!