Hello!
I am facing huge issues when trying to load model in float16/bfloat16. Essentially, if I load the model in float16 it get’s stuck. If I try loading it in float32 it is very quick and works.
This is the code that I am using and the only thing changing is the dtype passed. Any ideas of what could be happening? I have tried without the low_cpu_mem_usage, local_files_only, device_map
but nothing seems to work.
self.llm = AutoModelForCausalLM.from_pretrained(
llm_model_name,
torch_dtype=self.dtype,
low_cpu_mem_usage=True,
device_map= "auto",
local_files_only=True,
).to(device=self.device)
I have also tried the following:
import torch
from transformers import AutoModelForCausalLM
torch.cuda.empty_cache()
torch.backends.cuda.matmul.allow_tf32 = True # Enable Tensor Cores
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b",
torch_dtype=torch.float32,
local_files_only=True
).to("cuda") # Move to GPU
model.half() # Convert to float16
and for some reason it still gets stuck in the conversion to float16 point. I have tried this both on an A100 and on a quadro rtx 8000 and have the same issue.
Thank you!