XLMProphetNet returning different results when using padding

alexayala · June 15, 2023, 8:06pm

Issue also available here: XLMProphetNet returning different results when using padding · Issue #24289 · huggingface/transformers · GitHub

I’m noticing a behavior with the XLMProphetNet model where when I use padding for the same decoder_input_ids then the resulting decoder output differs. This results in a different loss being computed for a batch versus each sample individually without padding.

Code to reproduce:

from transformers import XLMProphetNetTokenizer, XLMProphetNetForConditionalGeneration
import torch

tokenizer = XLMProphetNetTokenizer.from_pretrained("microsoft/xprophetnet-large-wiki100-cased-xglue-ntg")
model = XLMProphetNetForConditionalGeneration.from_pretrained("microsoft/xprophetnet-large-wiki100-cased-xglue-ntg").eval()

enc_input = tokenizer("test", return_tensors="pt")
input_ids = enc_input.input_ids
attention_mask = enc_input.attention_mask

dec_input_ids = torch.tensor([[model.config.decoder_start_token_id]], dtype=torch.int64)
dec_attention_mask = torch.tensor([[1]], dtype=torch.int64)

dec_input_ids_pad = torch.tensor([[model.config.decoder_start_token_id, model.config.pad_token_id]], dtype=torch.int64)
dec_attention_mask_pad = torch.tensor([[1, 0]], dtype=torch.int64)

out1 = model(
    input_ids=input_ids, attention_mask=attention_mask,
    decoder_input_ids=dec_input_ids, decoder_attention_mask=dec_attention_mask
)

out2 = model(
    input_ids=input_ids, attention_mask=attention_mask,
    decoder_input_ids=dec_input_ids_pad, decoder_attention_mask=dec_attention_mask_pad
)

torch.isclose(out1.logits, out2.logits[:, 0], atol=1e-1).all() # false

Is this expected? During training I’m noticing that runs that don’t use any padding (accumulation iterations) are converging whereas any of my runs that use padding are failing to train properly.

alexayala · June 16, 2023, 6:59pm

@patrickvonplaten Do you have any insight on why this is happening? Thanks!

Topic		Replies	Views
T5 - Padded decoder inputs yields differerent results Beginners	1	732	June 14, 2022
Padding strategy for classification Beginners	3	2526	July 20, 2020
Custom XLMProphetNetForCG Model, Forward Pass fails 🤗Transformers	0	196	July 10, 2021
Bert strugling with Padded sentence 🤗Transformers	0	392	August 24, 2021
Padding causes wrong predictions? Beginners	2	1564	August 11, 2021

XLMProphetNet returning different results when using padding

Related topics