Loading in Float32 vs Float16 has very different speed

velezbeltran · February 20, 2025, 2:41pm

Hello!

I am facing huge issues when trying to load model in float16/bfloat16. Essentially, if I load the model in float16 it get’s stuck. If I try loading it in float32 it is very quick and works.

This is the code that I am using and the only thing changing is the dtype passed. Any ideas of what could be happening? I have tried without the low_cpu_mem_usage, local_files_only, device_map but nothing seems to work.

        self.llm = AutoModelForCausalLM.from_pretrained(
            llm_model_name,
            torch_dtype=self.dtype,
            low_cpu_mem_usage=True,
            device_map= "auto", 
            local_files_only=True,  
        ).to(device=self.device)

I have also tried the following:

import torch
from transformers import AutoModelForCausalLM

torch.cuda.empty_cache()
torch.backends.cuda.matmul.allow_tf32 = True  # Enable Tensor Cores

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b",
    torch_dtype=torch.float32,  
    local_files_only=True
).to("cuda")  # Move to GPU

model.half()  # Convert to float16

and for some reason it still gets stuck in the conversion to float16 point. I have tried this both on an A100 and on a quadro rtx 8000 and have the same issue.

Thank you!

Topic	Replies	Views
Confused with setting up torch_dtype while using CPU as device 🤗Transformers	2307	October 12, 2022
Float16 on CPU torch support Beginners	1054	January 16, 2024
Bfloat16 conversion results in significantly slower computation for various transformer models 🤗Transformers	1430	December 20, 2021
Llama2 torch_dtype Models	308	November 20, 2023
GPTQ model to bfloat16 🤗Transformers	436	January 10, 2024

Loading in Float32 vs Float16 has very different speed

Related topics