We fine-tuned the GPT2Model (distilgpt2) some time ago. The exact same GPT2 model produces different outputs for the exact same input after the upgrading. Therefore, after applying a classification head (linear layer) on top of GPT-2 output, we got different scores for the same input. It seems to me that the masked portion of the model output changed, while the unmasked portion stays the same. In the past upgrade, we have seen the default value for the attn_implementation changed from βeagerβ to βsdpaβ. See my previous topic. Due to tool vulnerability issues, we have to upgrade transformers 4.52.3 or above. This time, I already specified attn_implementation=βeagerβ, I still got different results after the upgrade. Can anyone help to point to whatβs changed?
The code to reproduce the results:
import torch
import tokenizers
import transformers
from transformers import GPT2Model, GPT2Tokenizer
#Sample input
tokenizer = GPT2Tokenizer.from_pretrained(βdistilgpt2β)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = βleftβ
text = βDAVID DAVISβ
model_inputs = tokenizer(text, padding=βmax_lengthβ, max_length=12, truncation=True, return_tensors=βptβ)
input_ids, attention_mask = model_inputs[βinput_idsβ],model_inputs[βattention_maskβ]
print(βinput_ids:β, input_ids)
print(βmask:β, attention_mask)
#Load GPT-2 Model
model = GPT2Model.from_pretrained(βdistilgpt2β, attn_implementation=βeagerβ)
#Run model
model.eval()
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
last_hidden_state = outputs.last_hidden_state
print(last_hidden_state)
Here are the 2 requirements.txt files and model outputs:
Before:
torch==2.6.0
transformers==4.50.0
huggingface_hub==0.33.4
input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 5631, 11008, 42274, 1797]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
Model output: tensor([[[-3.1153e-01, 1.1569e-01, 2.4667e-02, β¦, -1.6813e-01, -1.9119e-01, -4.2739e-02],
[-8.7119e-01, 2.1186e-04, 5.6834e-01, β¦, -1.1233e-01, -4.8243e-01, 4.7066e-02],
[-7.1241e-01, -4.7743e-02, 5.6767e-01, β¦, 1.0435e-02, -4.7335e-01, 2.1707e-04],
β¦,
[-1.3753e+00, 2.9666e-01, 5.7950e-01, β¦, -6.4851e-01, -1.2977e+00, -8.4751e-02],
[-1.2291e+00, 1.6299e-01, 4.4637e-01, β¦, -5.1411e-01, -6.0615e-01, 4.3908e-01],
[-1.3633e+00, 8.3929e-02, 5.4821e-01, β¦, -5.7178e-01, -6.4784e-01, 4.6220e-01]]])
After:
torch==2.6.0
transformers==4.52.3
huggingface_hub==0.33.4
input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 5631, 11008, 42274, 1797]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
Model output: tensor([[[-0.0724, 0.4212, 0.0130, β¦, -0.1462, 0.1229, -0.0698],
[-0.0360, 0.4466, -0.0973, β¦, -0.0136, 0.1273, -0.0545],
[ 0.0104, 0.3948, -0.0099, β¦, 0.0273, 0.1091, -0.0364],
β¦,
[-1.3753, 0.2967, 0.5795, β¦, -0.6485, -1.2978, -0.0848],
[-1.2291, 0.1630, 0.4464, β¦, -0.5141, -0.6062, 0.4391],
[-1.3633, 0.0839, 0.5482, β¦, -0.5718, -0.6479, 0.4622]]])