GPT2Model model output inconsistency between different transformers versions

We fine-tuned the GPT2Model (distilgpt2) some time ago. Due to tool vulnerability issues, we have to upgrade transformers 4.48.0 or above. However, the exact same GPT2 model produces different outputs for the exact same input after the upgrading. It seems to me that the masked portion of the model output changed, while the unmasked portion stays the same. Therefore, after applying a classification head (linear layer) on top of GPT-2 output, we got different scores for the same input. Can anyone help to point to what’s changed?

The code to reproduce the results:
import torch
import tokenizers
import transformers
from transformers import GPT2Model, GPT2Tokenizer

Sample input

tokenizer = GPT2Tokenizer.from_pretrained(“distilgpt2”)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = “left”

text = ‘Model output changed’
model_inputs = tokenizer(text, padding=‘max_length’, max_length=12,
truncation=True, return_tensors=“pt”)
input_ids, attention_mask = model_inputs[“input_ids”], model_inputs[“attention_mask”]
print(‘input_ids:’, input_ids)
print(‘mask:’, attention_mask)

Load GPT-2 Model

model = GPT2Model.from_pretrained(“distilgpt2”)
model.eval()

Run model

with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)

last_hidden_state = outputs.last_hidden_state
print(last_hidden_state)

Here are the 2 requirements.txt files and model outputs:
Before:
torch==2.4.0
transformers==4.41.0
huggingface_hub==0.27.1

input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 17633, 5072, 3421]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]])
Model output:
tensor([[[-0.1352, 0.0991, -0.2160, …, -0.1755, -0.0512, -0.0338],
[-0.5171, -0.0978, -0.3561, …, -0.3091, 0.1552, -0.1503],
[-0.4233, -0.1778, -0.1415, …, -0.0925, 0.1203, -0.1014],
…,
[-0.3410, 0.2196, -0.1369, …, -0.4246, 0.3772, -0.4357],
[-0.6979, 0.1779, -1.0862, …, -0.5422, 0.1065, -0.2090],
[-0.5766, 0.1015, -0.2526, …, -1.4290, -0.1708, 0.1124]]])

After:
torch==2.4.0
transformers==4.42.0
huggingface_hub==0.27.1

input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 17633, 5072, 3421]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]])
Model output:
tensor([[[-5.1260e-02, 1.1421e-01, -6.7051e-02, …, -8.8936e-02,
-7.6510e-02, 8.6264e-03],
[-1.5280e-01, -5.6395e-02, 2.1665e-01, …, 1.1190e-01,
2.2004e-02, -9.5938e-02],
[-1.1987e-01, -5.4886e-02, 2.0053e-01, …, 1.3524e-01,
-4.1297e-04, -8.2952e-02],
…,
[-3.4099e-01, 2.1960e-01, -1.3687e-01, …, -4.2462e-01,
3.7722e-01, -4.3574e-01],
[-6.9789e-01, 1.7786e-01, -1.0862e+00, …, -5.4218e-01,
1.0647e-01, -2.0897e-01],
[-5.7657e-01, 1.0148e-01, -2.5263e-01, …, -1.4290e+00,
-1.7080e-01, 1.1240e-01]]])

1 Like

Possibly related this phenomenon.

Also, the part that has changed a lot recently is the KV cache-related area, which seems to have changed quite a bit.

Thanks @John6666 for your input. I tried and it did not work. They were trying to resolve the model output inconsistency between batch run and single run, but my issue is the model output inconsistency between different transformers versions (4.39.2 vs 4.48.0). Also, the inconsistency lies in the masked portion only, but not in the unmasked portion.

1 Like

After digging into it a little deeper, I found that the model output inconsistency was introduced between transformers v4.41.0 and v4.42.0.

1 Like

Perhaps this? SDPA is now default attention.

1 Like

Really appreciate your help @John6666. It worked after I switched back to the “eager” attention with attn_implementation=“eager”.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.