Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype

rfruit · October 22, 2024, 10:17pm

Environment:

transformers version: 4.45.2
datasets version: 3.0.1
Platform: Linux-5.15.0-1070-aws-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.26.1
Safetensors version: 0.4.5
Accelerate version: 1.0.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.0+cu118 (True)
Tensorflow version (GPU?): 2.14.1 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no
Using GPU in script?: yes
GPU type: NVIDIA A10G

Bug description:
I am unable to train a model using both bfloat16 and torch compile, I am getting RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 even though all parameters in the model appear to have torch.bfloat16 dtype (see script below). When disabling torch compilation or using float32 (or doing both), everything works fine.

Minimal reproducible example:

import torch
from transformers import pipeline
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# Load classification pipeline from pretrained model
pipe = pipeline(
    "text-classification",
    model="Qwen/Qwen2.5-0.5B" ,
    model_kwargs={
        "num_labels": 5,
    },
    device_map="cuda"
)
print({p.data.dtype for p in pipe.model.parameters()})

# Load + format dataset
dataset = load_dataset("yelp_review_full")["train"].select(range(100))
def tokenize_function(examples):
    return pipe.tokenizer(
        examples["text"], 
        max_length=124, 
        padding="max_length", 
        truncation=True
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train 
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    num_train_epochs=1,
    torch_compile=True, 
    bf16=True,  # use bfloat16 mixed precision training
    output_dir="/tmp/tests/test_1",
)
trainer = Trainer(
    model=pipe.model,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
    args=training_args,
    tokenizer=pipe.tokenizer,
)
trainer.train()

End of traceback:

File /tmp/torchinductor_root/sq/csqz5rruxwlzuuvfjpvwprouxopxgytlrulekcxpejp4ojprvao7.py:637, in call(args)
    635 buf20 = empty_strided_cuda((992, 896), (896, 1), torch.float32)
    636 # Topologically Sorted Source Nodes: [linear_3], Original ATen: [aten.mm]
--> 637 extern_kernels.mm(reinterpret_tensor(buf19, (992, 896), (896, 1), 0), reinterpret_tensor(primals_12, (896, 896), (1, 896), 0), out=buf20)
    638 buf21 = reinterpret_tensor(buf20, (8, 124, 896), (111104, 896, 1), 0); del buf20  # reuse
    639 buf22 = empty_strided_cuda((8, 124, 1), (124, 1, 992), torch.float32)

It looks like a bug to me but I want to be sure before opening an issue.

rfruit · October 23, 2024, 9:46am

The problem does not seem to occur when pytorch is downgraded to version 2.4.1.

I am not fully sure though because in this case another error occur: `RuntimeError: invalid dtype for bias` when use compile + autocast · Issue #124901 · pytorch/pytorch · GitHub (at the end of the issue they mention that the problem is fixed with pytorch 2.5.0, but then the issue above occurs, I am stuck in a circular loop )

rfruit · October 23, 2024, 4:13pm

The same problem seems to occur with float16 instead of bfloat16 (but not for tensorfloat32 apparently).

rfruit · October 25, 2024, 3:24pm

Surprisingly, the same code works perfectly well with "facebook/bart-large" instead of "Qwen/Qwen2.5-0.5B". It may be related to the way "Qwen/Qwen2.5-0.5B" is coded (maybe a forced casting is done in the forward pass of Qwen network or something like that).

rfruit · October 25, 2024, 3:33pm

But other models like "TinyLlama/TinyLlama_v1.1" suffer from the same issue as "Qwen/Qwen2.5-0.5B".

Not sure what is causing this problem exactly

John6666 · October 25, 2024, 4:07pm

Perhaps.

rfruit · October 25, 2024, 8:26pm

Thanks for the link @John6666

I’ve seen on this page that Type mismatch errors in an autocast-enabled region are a bug; if this is what you observe, please file an issue.

It is a type mismatch issue and even though I have not deep dived a lot into the hugging face code, I know that autocast is used with dtype=torch.bfloat16 when parameter bf16 is set to True. But I am hesitant to open an issue on pytorch side since from what I understood, it could also be a problem on transformers side for some models (for e.g., it is also mentioned in the documentation that You should not call half() or bfloat16() on your model(s) or inputs when using autocasting and something like that may be present in the code).

John6666 · October 26, 2024, 1:16am

something like that may be present in the code

It’s a personal hunch, but maybe there is. Then we’d better report it to the transformers github or HF Discord to be sure.
If it’s a torch bug, they should open an issue on torch.

rfruit · October 28, 2024, 4:31pm

I have created an issue: Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype · Issue #34470 · huggingface/transformers · GitHub

Topic		Replies	Views
Loading in Float32 vs Float16 has very different speed 🤗Transformers	1	126	February 20, 2025
Bfloat16 conversion results in significantly slower computation for various transformer models 🤗Transformers	0	1420	December 20, 2021
Confused with setting up torch_dtype while using CPU as device 🤗Transformers	0	2269	October 12, 2022
Expected mat1 and mat2 to have the same dtype, but got: c10::Half != float 🤗Transformers	3	1766	July 8, 2024
Looks like the new transformer 4.49.0 has some issues 🤗Transformers	3	252	March 6, 2025

Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype

Related topics