Inconsistent GPT2Model results between transformers versions

We fine-tuned the GPT2Model (distilgpt2) some time ago. The exact same GPT2 model produces different outputs for the exact same input after the upgrading. Therefore, after applying a classification head (linear layer) on top of GPT-2 output, we got different scores for the same input. It seems to me that the masked portion of the model output changed, while the unmasked portion stays the same. In the past upgrade, we have seen the default value for the attn_implementation changed from β€œeager” to β€œsdpa”. See my previous topic. Due to tool vulnerability issues, we have to upgrade transformers 4.52.3 or above. This time, I already specified attn_implementation=β€œeager”, I still got different results after the upgrade. Can anyone help to point to what’s changed?

The code to reproduce the results:
import torch
import tokenizers
import transformers
from transformers import GPT2Model, GPT2Tokenizer

#Sample input
tokenizer = GPT2Tokenizer.from_pretrained(β€˜distilgpt2’)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = β€˜left’

text = β€˜DAVID DAVIS’
model_inputs = tokenizer(text, padding=β€˜max_length’, max_length=12, truncation=True, return_tensors=β€˜pt’)
input_ids, attention_mask = model_inputs[β€˜input_ids’],model_inputs[β€˜attention_mask’]
print(β€˜input_ids:’, input_ids)
print(β€˜mask:’, attention_mask)

#Load GPT-2 Model
model = GPT2Model.from_pretrained(β€˜distilgpt2’, attn_implementation=β€œeager”)

#Run model
model.eval()
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)

last_hidden_state = outputs.last_hidden_state
print(last_hidden_state)

Here are the 2 requirements.txt files and model outputs:
Before:
torch==2.6.0
transformers==4.50.0
huggingface_hub==0.33.4

input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 5631, 11008, 42274, 1797]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
Model output: tensor([[[-3.1153e-01, 1.1569e-01, 2.4667e-02, …, -1.6813e-01, -1.9119e-01, -4.2739e-02],
[-8.7119e-01, 2.1186e-04, 5.6834e-01, …, -1.1233e-01, -4.8243e-01, 4.7066e-02],
[-7.1241e-01, -4.7743e-02, 5.6767e-01, …, 1.0435e-02, -4.7335e-01, 2.1707e-04],
…,
[-1.3753e+00, 2.9666e-01, 5.7950e-01, …, -6.4851e-01, -1.2977e+00, -8.4751e-02],
[-1.2291e+00, 1.6299e-01, 4.4637e-01, …, -5.1411e-01, -6.0615e-01, 4.3908e-01],
[-1.3633e+00, 8.3929e-02, 5.4821e-01, …, -5.7178e-01, -6.4784e-01, 4.6220e-01]]])

After:
torch==2.6.0
transformers==4.52.3
huggingface_hub==0.33.4

input_ids: tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 5631, 11008, 42274, 1797]])
mask: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
Model output: tensor([[[-0.0724, 0.4212, 0.0130, …, -0.1462, 0.1229, -0.0698],
[-0.0360, 0.4466, -0.0973, …, -0.0136, 0.1273, -0.0545],
[ 0.0104, 0.3948, -0.0099, …, 0.0273, 0.1091, -0.0364],
…,
[-1.3753, 0.2967, 0.5795, …, -0.6485, -1.2978, -0.0848],
[-1.2291, 0.1630, 0.4464, …, -0.5141, -0.6062, 0.4391],
[-1.3633, 0.0839, 0.5482, …, -0.5718, -0.6479, 0.4622]]])

1 Like

Although not mentioned in the release notes, it appears that the implementation of masks and attention has been significantly changed…

@John6666 thanks for the response. I figured that the latest version has the correct implementation for masks and attention: both from padded to non-padded tokens and other way around. I think we better to use the latest version to rebuild the fine-tuned model in the long term. However, for security reasons we need to upgrade it now, and the performance impact is too big to be ignored. Are there any workaround on this issue?

1 Like

Since we can get the same output by using the same code, there are two options: simply download the old version of the source code and replace it, or fork Transformers and revert only the specific changes.

Another option is a monkey patch like the one below. I haven’t confirmed whether it works or not…

# full_monkey_patch_gpt2_mask.py

import torch
from transformers import GPT2Model, GPT2Tokenizer
from transformers.modeling_attn_mask_utils import AttentionMaskConverter

# ─── 1. Legacy v4.50.0 mask helpers ───────────────────────────────────────────
# Copied from https://raw.githubusercontent.com/huggingface/transformers/v4.50.0/.../modeling_attn_mask_utils.py

def old_expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: int = None):
    bsz, src_len = mask.size()
    tgt_len = tgt_len if tgt_len is not None else src_len
    expanded = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
    inv = 1.0 - expanded
    return inv.masked_fill(inv.to(torch.bool), torch.finfo(dtype).min)

def old_to_causal_4d(
    attention_mask: torch.Tensor,
    input_shape: tuple[int, int],
    inputs_embeds: torch.Tensor,
    past_key_values_length: int,
    sliding_window: int | None = None,
):
    # Reconstruct converter usage from v4.50.0
    converter = AttentionMaskConverter(is_causal=True, sliding_window=sliding_window)
    key_value_length = input_shape[-1] + past_key_values_length
    if attention_mask is not None and attention_mask.dim() == 2:
        return converter.to_4d(
            attention_mask,
            input_shape[-1],
            key_value_length=key_value_length,
            dtype=inputs_embeds.dtype,
        )
    return converter.to_causal_4d(
        input_shape[0],
        input_shape[-1],
        key_value_length,
        dtype=inputs_embeds.dtype,
        device=inputs_embeds.device,
    )

# ─── 2. Monkey-patch the new converter ────────────────────────────────────────
# This forces Transformers β‰₯ 4.51 to use our old logic instead of the refactored one

AttentionMaskConverter._expand_mask    = staticmethod(old_expand_mask)
AttentionMaskConverter.to_causal_4d   = staticmethod(old_to_causal_4d)
AttentionMaskConverter.to_4d          = staticmethod(lambda mask, qlen, key_value_length=None, dtype=None: 
    old_expand_mask(mask, dtype, tgt_len=qlen))

# Prevent SDPA from dropping masks on trivial sequences:
AttentionMaskConverter._ignore_causal_mask_sdpa = staticmethod(lambda *args, **kwargs: False)

Thanks @John6666. Tried the above monkey patch you provided, but it does not change the model output.

1 Like

As a last resort, downloading this file and saving it locally should allow you to import the old version of GPT2Model. Compared to forking and reversing committing, this method is slightly less consistent, but it has the advantage of not being affected by version updates.
The import statements at the beginning can be rewritten to suit your environment.

Additionally, you could simply copy and paste the code from the old version, define the GPT2Model class, and use it. Since the modules are designed to have minimal dependencies on each other, the implementation should not be too difficult.
If we decide to use AutoModel, there will be an extra step, but if we only use GPT2Model, defining the class is all that’s needed.

Thanks @John6666 This is a good recommendation. We had a workaround with a slightly lower version v4.51.3 which still satisfies our security requirements. So it is fine for now.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.