Loading in Float32 vs Float16 has very different speed

Hello!

I am facing huge issues when trying to load model in float16/bfloat16. Essentially, if I load the model in float16 it get’s stuck. If I try loading it in float32 it is very quick and works.

This is the code that I am using and the only thing changing is the dtype passed. Any ideas of what could be happening? I have tried without the low_cpu_mem_usage, local_files_only, device_map but nothing seems to work.

        self.llm = AutoModelForCausalLM.from_pretrained(
            llm_model_name,
            torch_dtype=self.dtype,
            low_cpu_mem_usage=True,
            device_map= "auto", 
            local_files_only=True,  
        ).to(device=self.device)

I have also tried the following:

import torch
from transformers import AutoModelForCausalLM

torch.cuda.empty_cache()
torch.backends.cuda.matmul.allow_tf32 = True  # Enable Tensor Cores

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b",
    torch_dtype=torch.float32,  
    local_files_only=True
).to("cuda")  # Move to GPU

model.half()  # Convert to float16

and for some reason it still gets stuck in the conversion to float16 point. I have tried this both on an A100 and on a quadro rtx 8000 and have the same issue.

Thank you!

1 Like

This is very strange. If you are using a GeForce from the 20x0 generation, it can be easily explained (it does not support bfloat16. If it is 1xx0, it does not support float16), but you are using an A100.

Perhaps your version of the CUDA Toolkit or PyTorch is very old?