GPT2Model model output inconsistency between different transformers versions

Wenzhong2005 · March 21, 2025, 5:36pm

We fine-tuned the GPT2Model (distilgpt2) some time ago. Due to tool vulnerability issues, we have to upgrade transformers 4.48.0 or above. However, the exact same GPT2 model produces different outputs for the exact same input after the upgrading. It seems to me that the masked portion of the model output changed, while the unmasked portion stays the same. Therefore, after applying a classification head (linear layer) on top of GPT-2 output, we got different scores for the same input. Can anyone help to point to what’s changed?

The code to reproduce the results:
import torch
import tokenizers
import transformers
from transformers import GPT2Model, GPT2Tokenizer

Sample input

tokenizer = GPT2Tokenizer.from_pretrained(“distilgpt2”)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = “left”

text = ‘Model output changed’
model_inputs = tokenizer(text, padding=‘max_length’, max_length=12,
truncation=True, return_tensors=“pt”)
input_ids, attention_mask = model_inputs[“input_ids”], model_inputs[“attention_mask”]
print(‘input_ids:’, input_ids)
print(‘mask:’, attention_mask)

Load GPT-2 Model

model = GPT2Model.from_pretrained(“distilgpt2”)
model.eval()

Run model

with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)

last_hidden_state = outputs.last_hidden_state
print(last_hidden_state)

Here are the 2 requirements.txt files and model outputs:
Before:
torch==2.4.0
transformers==4.41.0
huggingface_hub==0.27.1

input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 17633, 5072, 3421]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]])
Model output:
tensor([[[-0.1352, 0.0991, -0.2160, …, -0.1755, -0.0512, -0.0338],
[-0.5171, -0.0978, -0.3561, …, -0.3091, 0.1552, -0.1503],
[-0.4233, -0.1778, -0.1415, …, -0.0925, 0.1203, -0.1014],
…,
[-0.3410, 0.2196, -0.1369, …, -0.4246, 0.3772, -0.4357],
[-0.6979, 0.1779, -1.0862, …, -0.5422, 0.1065, -0.2090],
[-0.5766, 0.1015, -0.2526, …, -1.4290, -0.1708, 0.1124]]])

After:
torch==2.4.0
transformers==4.42.0
huggingface_hub==0.27.1

input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 17633, 5072, 3421]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]])
Model output:
tensor([[[-5.1260e-02, 1.1421e-01, -6.7051e-02, …, -8.8936e-02,
-7.6510e-02, 8.6264e-03],
[-1.5280e-01, -5.6395e-02, 2.1665e-01, …, 1.1190e-01,
2.2004e-02, -9.5938e-02],
[-1.1987e-01, -5.4886e-02, 2.0053e-01, …, 1.3524e-01,
-4.1297e-04, -8.2952e-02],
…,
[-3.4099e-01, 2.1960e-01, -1.3687e-01, …, -4.2462e-01,
3.7722e-01, -4.3574e-01],
[-6.9789e-01, 1.7786e-01, -1.0862e+00, …, -5.4218e-01,
1.0647e-01, -2.0897e-01],
[-5.7657e-01, 1.0148e-01, -2.5263e-01, …, -1.4290e+00,
-1.7080e-01, 1.1240e-01]]])

John6666 · March 21, 2025, 6:31pm

Possibly related this phenomenon.

Also, the part that has changed a lot recently is the KV cache-related area, which seems to have changed quite a bit.

Wenzhong2005 · March 21, 2025, 8:36pm

Thanks @John6666 for your input. I tried and it did not work. They were trying to resolve the model output inconsistency between batch run and single run, but my issue is the model output inconsistency between different transformers versions (4.39.2 vs 4.48.0). Also, the inconsistency lies in the masked portion only, but not in the unmasked portion.

Wenzhong2005 · March 21, 2025, 10:23pm

After digging into it a little deeper, I found that the model output inconsistency was introduced between transformers v4.41.0 and v4.42.0.

John6666 · March 22, 2025, 4:55am

Perhaps this? SDPA is now default attention.

github.com/huggingface/transformers

[`GPT2`] Add SDPA support (#31172)

committed 07:40AM - 19 Jun 24 UTC

vasqu

+191 -11

* `gpt2` sdpa support * fix (at least) one test, style, repo consistency *… fix sdpa mask in forward --> fixes generation * test * test2 * test3 * test4 * simplify shapes for attn mask creation and small comments * hub fail test * benchmarks * flash attn 2 mask should not be inverted on enc-dec setup * fix comment * apply some suggestion from code review - only save _attn_implentation once - remove unnecessary comment * change elif logic * [run-slow] gpt2 * modify `test_gpt2_sample_max_time` to follow previous assertion patterns

Wenzhong2005 · March 22, 2025, 6:25pm

Really appreciate your help @John6666. It worked after I switched back to the “eager” attention with attn_implementation=“eager”.

system · March 23, 2025, 6:26am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPT-2 Forward w/ and w/o caching of past gives different results Beginners	0	421	May 31, 2022
Why if use cache in gpt2 model from transformers , the logits are different if i do a forward pass from scratch Models	1	355	February 25, 2024
GPT2 Generated Output Always the Same? Beginners	3	5700	December 16, 2020
Infinity output from gpt2 model? Beginners	2	153	June 22, 2024
How to decode GPT2 🤗Transformers	3	7770	June 17, 2022

GPT2Model model output inconsistency between different transformers versions

Sample input

Load GPT-2 Model

Run model

Related topics