Issue also available here: XLMProphetNet returning different results when using padding 路 Issue #24289 路 huggingface/transformers 路 GitHub
I鈥檓 noticing a behavior with the XLMProphetNet model where when I use padding for the same decoder_input_ids then the resulting decoder output differs. This results in a different loss being computed for a batch versus each sample individually without padding.
Code to reproduce:
from transformers import XLMProphetNetTokenizer, XLMProphetNetForConditionalGeneration
import torch
tokenizer = XLMProphetNetTokenizer.from_pretrained("microsoft/xprophetnet-large-wiki100-cased-xglue-ntg")
model = XLMProphetNetForConditionalGeneration.from_pretrained("microsoft/xprophetnet-large-wiki100-cased-xglue-ntg").eval()
enc_input = tokenizer("test", return_tensors="pt")
input_ids = enc_input.input_ids
attention_mask = enc_input.attention_mask
dec_input_ids = torch.tensor([[model.config.decoder_start_token_id]], dtype=torch.int64)
dec_attention_mask = torch.tensor([[1]], dtype=torch.int64)
dec_input_ids_pad = torch.tensor([[model.config.decoder_start_token_id, model.config.pad_token_id]], dtype=torch.int64)
dec_attention_mask_pad = torch.tensor([[1, 0]], dtype=torch.int64)
out1 = model(
input_ids=input_ids, attention_mask=attention_mask,
decoder_input_ids=dec_input_ids, decoder_attention_mask=dec_attention_mask
)
out2 = model(
input_ids=input_ids, attention_mask=attention_mask,
decoder_input_ids=dec_input_ids_pad, decoder_attention_mask=dec_attention_mask_pad
)
torch.isclose(out1.logits, out2.logits[:, 0], atol=1e-1).all() # false
Is this expected? During training I鈥檓 noticing that runs that don鈥檛 use any padding (accumulation iterations) are converging whereas any of my runs that use padding are failing to train properly.