Loading in Float32 vs Float16 has very different speed

velezbeltran · February 20, 2025, 2:41pm

Hello!

I am facing huge issues when trying to load model in float16/bfloat16. Essentially, if I load the model in float16 it get’s stuck. If I try loading it in float32 it is very quick and works.

This is the code that I am using and the only thing changing is the dtype passed. Any ideas of what could be happening? I have tried without the low_cpu_mem_usage, local_files_only, device_map but nothing seems to work.

        self.llm = AutoModelForCausalLM.from_pretrained(
            llm_model_name,
            torch_dtype=self.dtype,
            low_cpu_mem_usage=True,
            device_map= "auto", 
            local_files_only=True,  
        ).to(device=self.device)

I have also tried the following:

import torch
from transformers import AutoModelForCausalLM

torch.cuda.empty_cache()
torch.backends.cuda.matmul.allow_tf32 = True  # Enable Tensor Cores

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b",
    torch_dtype=torch.float32,  
    local_files_only=True
).to("cuda")  # Move to GPU

model.half()  # Convert to float16

and for some reason it still gets stuck in the conversion to float16 point. I have tried this both on an A100 and on a quadro rtx 8000 and have the same issue.

Thank you!

John6666 · February 20, 2025, 4:47pm

This is very strange. If you are using a GeForce from the 20x0 generation, it can be easily explained (it does not support bfloat16. If it is 1xx0, it does not support float16), but you are using an A100.

Perhaps your version of the CUDA Toolkit or PyTorch is very old?

Topic	Replies	Views
Confused with setting up torch_dtype while using CPU as device 🤗Transformers	2268	October 12, 2022
Float16 on CPU torch support Beginners	1020	January 16, 2024
Bfloat16 conversion results in significantly slower computation for various transformer models 🤗Transformers	1420	December 20, 2021
Llama2 torch_dtype Models	306	November 20, 2023
GPTQ model to bfloat16 🤗Transformers	431	January 10, 2024

Loading in Float32 vs Float16 has very different speed

Related topics