Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype

Environment:

  • transformers version: 4.45.2
  • datasets version: 3.0.1
  • Platform: Linux-5.15.0-1070-aws-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.26.1
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.0+cu118 (True)
  • Tensorflow version (GPU?): 2.14.1 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: NVIDIA A10G

Bug description:
I am unable to train a model using both bfloat16 and torch compile, I am getting RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 even though all parameters in the model appear to have torch.bfloat16 dtype (see script below). When disabling torch compilation or using float32 (or doing both), everything works fine.

Minimal reproducible example:

import torch
from transformers import pipeline
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# Load classification pipeline from pretrained model
pipe = pipeline(
    "text-classification",
    model="Qwen/Qwen2.5-0.5B" ,
    model_kwargs={
        "num_labels": 5,
    },
    device_map="cuda"
)
print({p.data.dtype for p in pipe.model.parameters()})

# Load + format dataset
dataset = load_dataset("yelp_review_full")["train"].select(range(100))
def tokenize_function(examples):
    return pipe.tokenizer(
        examples["text"], 
        max_length=124, 
        padding="max_length", 
        truncation=True
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train 
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    num_train_epochs=1,
    torch_compile=True, 
    bf16=True,  # use bfloat16 mixed precision training
    output_dir="/tmp/tests/test_1",
)
trainer = Trainer(
    model=pipe.model,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
    args=training_args,
    tokenizer=pipe.tokenizer,
)
trainer.train()

End of traceback:

File /tmp/torchinductor_root/sq/csqz5rruxwlzuuvfjpvwprouxopxgytlrulekcxpejp4ojprvao7.py:637, in call(args)
    635 buf20 = empty_strided_cuda((992, 896), (896, 1), torch.float32)
    636 # Topologically Sorted Source Nodes: [linear_3], Original ATen: [aten.mm]
--> 637 extern_kernels.mm(reinterpret_tensor(buf19, (992, 896), (896, 1), 0), reinterpret_tensor(primals_12, (896, 896), (1, 896), 0), out=buf20)
    638 buf21 = reinterpret_tensor(buf20, (8, 124, 896), (111104, 896, 1), 0); del buf20  # reuse
    639 buf22 = empty_strided_cuda((8, 124, 1), (124, 1, 992), torch.float32)

It looks like a bug to me but I want to be sure before opening an issue.

1 Like

The problem does not seem to occur when pytorch is downgraded to version 2.4.1.

I am not fully sure though because in this case another error occur: `RuntimeError: invalid dtype for bias` when use compile + autocast · Issue #124901 · pytorch/pytorch · GitHub (at the end of the issue they mention that the problem is fixed with pytorch 2.5.0, but then the issue above occurs, I am stuck in a circular loop :sweat_smile: )

1 Like

The same problem seems to occur with float16 instead of bfloat16 (but not for tensorfloat32 apparently).

1 Like

Surprisingly, the same code works perfectly well with "facebook/bart-large" instead of "Qwen/Qwen2.5-0.5B". It may be related to the way "Qwen/Qwen2.5-0.5B" is coded (maybe a forced casting is done in the forward pass of Qwen network or something like that).

1 Like

But other models like "TinyLlama/TinyLlama_v1.1" suffer from the same issue as "Qwen/Qwen2.5-0.5B".

Not sure what is causing this problem exactly

1 Like

Perhaps.

1 Like

Thanks for the link @John6666

I’ve seen on this page that Type mismatch errors in an autocast-enabled region are a bug; if this is what you observe, please file an issue.

It is a type mismatch issue and even though I have not deep dived a lot into the hugging face code, I know that autocast is used with dtype=torch.bfloat16 when parameter bf16 is set to True. But I am hesitant to open an issue on pytorch side since from what I understood, it could also be a problem on transformers side for some models (for e.g., it is also mentioned in the documentation that You should not call half() or bfloat16() on your model(s) or inputs when using autocasting and something like that may be present in the code).