GPT2Model model output inconsistency between different transformers versions

Wenzhong2005 · March 21, 2025, 5:36pm

We fine-tuned the GPT2Model (distilgpt2) some time ago. Due to tool vulnerability issues, we have to upgrade transformers 4.48.0 or above. However, the exact same GPT2 model produces different outputs for the exact same input after the upgrading. It seems to me that the masked portion of the model output changed, while the unmasked portion stays the same. Therefore, after applying a classification head (linear layer) on top of GPT-2 output, we got different scores for the same input. Can anyone help to point to what’s changed?

The code to reproduce the results:
import torch
import tokenizers
import transformers
from transformers import GPT2Model, GPT2Tokenizer

Sample input

tokenizer = GPT2Tokenizer.from_pretrained(“distilgpt2”)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = “left”

text = ‘Model output changed’
model_inputs = tokenizer(text, padding=‘max_length’, max_length=12,
truncation=True, return_tensors=“pt”)
input_ids, attention_mask = model_inputs[“input_ids”], model_inputs[“attention_mask”]
print(‘input_ids:’, input_ids)
print(‘mask:’, attention_mask)

Load GPT-2 Model

model = GPT2Model.from_pretrained(“distilgpt2”)
model.eval()

Run model

with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)

last_hidden_state = outputs.last_hidden_state
print(last_hidden_state)

Here are the 2 requirements.txt files and model outputs:
Before:
torch==2.4.0
transformers==4.41.0
huggingface_hub==0.27.1

input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 17633, 5072, 3421]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]])
Model output:
tensor([[[-0.1352, 0.0991, -0.2160, …, -0.1755, -0.0512, -0.0338],
[-0.5171, -0.0978, -0.3561, …, -0.3091, 0.1552, -0.1503],
[-0.4233, -0.1778, -0.1415, …, -0.0925, 0.1203, -0.1014],
…,
[-0.3410, 0.2196, -0.1369, …, -0.4246, 0.3772, -0.4357],
[-0.6979, 0.1779, -1.0862, …, -0.5422, 0.1065, -0.2090],
[-0.5766, 0.1015, -0.2526, …, -1.4290, -0.1708, 0.1124]]])

After:
torch==2.4.0
transformers==4.42.0
huggingface_hub==0.27.1

input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 17633, 5072, 3421]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]])
Model output:
tensor([[[-5.1260e-02, 1.1421e-01, -6.7051e-02, …, -8.8936e-02,
-7.6510e-02, 8.6264e-03],
[-1.5280e-01, -5.6395e-02, 2.1665e-01, …, 1.1190e-01,
2.2004e-02, -9.5938e-02],
[-1.1987e-01, -5.4886e-02, 2.0053e-01, …, 1.3524e-01,
-4.1297e-04, -8.2952e-02],
…,
[-3.4099e-01, 2.1960e-01, -1.3687e-01, …, -4.2462e-01,
3.7722e-01, -4.3574e-01],
[-6.9789e-01, 1.7786e-01, -1.0862e+00, …, -5.4218e-01,
1.0647e-01, -2.0897e-01],
[-5.7657e-01, 1.0148e-01, -2.5263e-01, …, -1.4290e+00,
-1.7080e-01, 1.1240e-01]]])

Topic		Replies	Views
Inconsistent GPT2Model results between transformers versions Intermediate	7	32	July 19, 2025
Difference in output logits when using a subsection of the input sentence 🤗Transformers	0	372	January 8, 2023
Different model.generate() predictions between batched and unbatched/padded token inputs 🤗Transformers	2	2282	August 26, 2023
Hidden States of OpenAI GPT2 inconsistent 🤗Transformers	2	284	October 25, 2021
Looks like the new transformer 4.49.0 has some issues 🤗Transformers	3	289	March 6, 2025

GPT2Model model output inconsistency between different transformers versions

Sample input

Load GPT-2 Model

Run model

Related topics