Loading in Float32 vs Float16 has very different speed

Hello!

I am facing huge issues when trying to load model in float16/bfloat16. Essentially, if I load the model in float16 it get’s stuck. If I try loading it in float32 it is very quick and works.

This is the code that I am using and the only thing changing is the dtype passed. Any ideas of what could be happening? I have tried without the low_cpu_mem_usage, local_files_only, device_map but nothing seems to work.

        self.llm = AutoModelForCausalLM.from_pretrained(
            llm_model_name,
            torch_dtype=self.dtype,
            low_cpu_mem_usage=True,
            device_map= "auto", 
            local_files_only=True,  
        ).to(device=self.device)

I have also tried the following:

import torch
from transformers import AutoModelForCausalLM

torch.cuda.empty_cache()
torch.backends.cuda.matmul.allow_tf32 = True  # Enable Tensor Cores

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b",
    torch_dtype=torch.float32,  
    local_files_only=True
).to("cuda")  # Move to GPU

model.half()  # Convert to float16

and for some reason it still gets stuck in the conversion to float16 point. I have tried this both on an A100 and on a quadro rtx 8000 and have the same issue.

Thank you!

1 Like