Inconsistent GPT2Model results between transformers versions

Wenzhong2005 · July 17, 2025, 4:01pm

We fine-tuned the GPT2Model (distilgpt2) some time ago. The exact same GPT2 model produces different outputs for the exact same input after the upgrading. Therefore, after applying a classification head (linear layer) on top of GPT-2 output, we got different scores for the same input. It seems to me that the masked portion of the model output changed, while the unmasked portion stays the same. In the past upgrade, we have seen the default value for the attn_implementation changed from “eager” to “sdpa”. See my previous topic. Due to tool vulnerability issues, we have to upgrade transformers 4.52.3 or above. This time, I already specified attn_implementation=“eager”, I still got different results after the upgrade. Can anyone help to point to what’s changed?

The code to reproduce the results:
import torch
import tokenizers
import transformers
from transformers import GPT2Model, GPT2Tokenizer

#Sample input
tokenizer = GPT2Tokenizer.from_pretrained(‘distilgpt2’)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = ‘left’

text = ‘DAVID DAVIS’
model_inputs = tokenizer(text, padding=‘max_length’, max_length=12, truncation=True, return_tensors=‘pt’)
input_ids, attention_mask = model_inputs[‘input_ids’],model_inputs[‘attention_mask’]
print(‘input_ids:’, input_ids)
print(‘mask:’, attention_mask)

#Load GPT-2 Model
model = GPT2Model.from_pretrained(‘distilgpt2’, attn_implementation=“eager”)

#Run model
model.eval()
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)

last_hidden_state = outputs.last_hidden_state
print(last_hidden_state)

Here are the 2 requirements.txt files and model outputs:
Before:
torch==2.6.0
transformers==4.50.0
huggingface_hub==0.33.4

input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 5631, 11008, 42274, 1797]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
Model output: tensor([[[-3.1153e-01, 1.1569e-01, 2.4667e-02, …, -1.6813e-01, -1.9119e-01, -4.2739e-02],
[-8.7119e-01, 2.1186e-04, 5.6834e-01, …, -1.1233e-01, -4.8243e-01, 4.7066e-02],
[-7.1241e-01, -4.7743e-02, 5.6767e-01, …, 1.0435e-02, -4.7335e-01, 2.1707e-04],
…,
[-1.3753e+00, 2.9666e-01, 5.7950e-01, …, -6.4851e-01, -1.2977e+00, -8.4751e-02],
[-1.2291e+00, 1.6299e-01, 4.4637e-01, …, -5.1411e-01, -6.0615e-01, 4.3908e-01],
[-1.3633e+00, 8.3929e-02, 5.4821e-01, …, -5.7178e-01, -6.4784e-01, 4.6220e-01]]])

After:
torch==2.6.0
transformers==4.52.3
huggingface_hub==0.33.4

input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 5631, 11008, 42274, 1797]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
Model output: tensor([[[-0.0724, 0.4212, 0.0130, …, -0.1462, 0.1229, -0.0698],
[-0.0360, 0.4466, -0.0973, …, -0.0136, 0.1273, -0.0545],
[ 0.0104, 0.3948, -0.0099, …, 0.0273, 0.1091, -0.0364],
…,
[-1.3753, 0.2967, 0.5795, …, -0.6485, -1.2978, -0.0848],
[-1.2291, 0.1630, 0.4464, …, -0.5141, -0.6062, 0.4391],
[-1.3633, 0.0839, 0.5482, …, -0.5718, -0.6479, 0.4622]]])

John6666 · July 18, 2025, 12:03am

Although not mentioned in the release notes, it appears that the implementation of masks and attention has been significantly changed…

Wenzhong2005 · July 18, 2025, 12:30am

@John6666 thanks for the response. I figured that the latest version has the correct implementation for masks and attention: both from padded to non-padded tokens and other way around. I think we better to use the latest version to rebuild the fine-tuned model in the long term. However, for security reasons we need to upgrade it now, and the performance impact is too big to be ignored. Are there any workaround on this issue?

John6666 · July 18, 2025, 3:03am

Since we can get the same output by using the same code, there are two options: simply download the old version of the source code and replace it, or fork Transformers and revert only the specific changes.

Another option is a monkey patch like the one below. I haven’t confirmed whether it works or not…

# full_monkey_patch_gpt2_mask.py

import torch
from transformers import GPT2Model, GPT2Tokenizer
from transformers.modeling_attn_mask_utils import AttentionMaskConverter

# ─── 1. Legacy v4.50.0 mask helpers ───────────────────────────────────────────
# Copied from https://raw.githubusercontent.com/huggingface/transformers/v4.50.0/.../modeling_attn_mask_utils.py

def old_expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: int = None):
    bsz, src_len = mask.size()
    tgt_len = tgt_len if tgt_len is not None else src_len
    expanded = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
    inv = 1.0 - expanded
    return inv.masked_fill(inv.to(torch.bool), torch.finfo(dtype).min)

def old_to_causal_4d(
    attention_mask: torch.Tensor,
    input_shape: tuple[int, int],
    inputs_embeds: torch.Tensor,
    past_key_values_length: int,
    sliding_window: int | None = None,
):
    # Reconstruct converter usage from v4.50.0
    converter = AttentionMaskConverter(is_causal=True, sliding_window=sliding_window)
    key_value_length = input_shape[-1] + past_key_values_length
    if attention_mask is not None and attention_mask.dim() == 2:
        return converter.to_4d(
            attention_mask,
            input_shape[-1],
            key_value_length=key_value_length,
            dtype=inputs_embeds.dtype,
        )
    return converter.to_causal_4d(
        input_shape[0],
        input_shape[-1],
        key_value_length,
        dtype=inputs_embeds.dtype,
        device=inputs_embeds.device,
    )

# ─── 2. Monkey-patch the new converter ────────────────────────────────────────
# This forces Transformers ≥ 4.51 to use our old logic instead of the refactored one

AttentionMaskConverter._expand_mask    = staticmethod(old_expand_mask)
AttentionMaskConverter.to_causal_4d   = staticmethod(old_to_causal_4d)
AttentionMaskConverter.to_4d          = staticmethod(lambda mask, qlen, key_value_length=None, dtype=None: 
    old_expand_mask(mask, dtype, tgt_len=qlen))

# Prevent SDPA from dropping masks on trivial sequences:
AttentionMaskConverter._ignore_causal_mask_sdpa = staticmethod(lambda *args, **kwargs: False)

Wenzhong2005 · July 18, 2025, 5:37pm

Thanks @John6666. Tried the above monkey patch you provided, but it does not change the model output.

John6666 · July 18, 2025, 11:47pm

As a last resort, downloading this file and saving it locally should allow you to import the old version of GPT2Model. Compared to forking and reversing committing, this method is slightly less consistent, but it has the advantage of not being affected by version updates.
The import statements at the beginning can be rewritten to suit your environment.

Additionally, you could simply copy and paste the code from the old version, define the GPT2Model class, and use it. Since the modules are designed to have minimal dependencies on each other, the implementation should not be too difficult.
If we decide to use AutoModel, there will be an extra step, but if we only use GPT2Model, defining the class is all that’s needed.

Wenzhong2005 · July 19, 2025, 3:25am

Thanks @John6666 This is a good recommendation. We had a workaround with a slightly lower version v4.51.3 which still satisfies our security requirements. So it is fine for now.

system · July 19, 2025, 3:26pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPT2Model model output inconsistency between different transformers versions Intermediate	6	69	March 22, 2025
Looks like the new transformer 4.49.0 has some issues 🤗Transformers	3	379	March 6, 2025
GPT2 Implementation from scratch 🤗Transformers	0	401	August 11, 2020
Hidden States of OpenAI GPT2 inconsistent 🤗Transformers	2	298	October 25, 2021
Api and parameters change from transofrmers 2.5.1 to 3.5.1 for GPT2 🤗Transformers	0	250	January 4, 2021

Inconsistent GPT2Model results between transformers versions

Related topics