Error Debugging

Ibsaa · April 29, 2025, 8:02am

i have been encountering the following error message for a while:

`.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

and the code that throws the error is as follows:

loading an LLM locally

import os
import torch
from dotenv import load_dotenv
from huggingface_hub import login
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
)
from transformers.utils import is_flash_attn_2_available

00. Loading the environment variable

load_dotenv()

01. Grab token

hf_token = os.getenv(“HF_TOKEN”)
if not hf_token:
raise RuntimeError(“Hugging Face token not found! Please set HF_TOKEN in your .env file.”)

02. (Optional) Cache it locally so you can also use CLI-backed commands

login(token=hf_token)

1.Create a quantization config

from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16)

if use_quantization_config:

quantization_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_compute_dtype=torch.float16,

)

else:

quantization_config = None

flash attention 2 - faster attention mechanism

Flash attention requires a GPU capability 8.0+

if (is_flash_attn_2_available()) and (torch_cuda.get_device_capability(0)[0] >=8):
attn_implementation = “flash_attention_2”
else:
attn_implementation = “sdpa” # scaled dot product attention

2. Pick a model

model_id = “google/gemma-2-2b-it”
model_id = model_id

3. Instantiate tokenizer (tokenizer turns text into tokens)

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id,
token=hf_token,)

4 Instantiate the model

llm_model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=model_id,
device_map=“auto”,
token=hf_token,
torch_dtype=torch.float16,
quantization_config=quantization_config if use_quantization_config else None,
# quantization_config=quantization_config,
low_cpu_mem_usage=True,
attn_implementation=attn_implementation,
# device_map=“auto”,
)

if not use_quantization_config:
llm_model.to(“cuda”)

John6666 · April 29, 2025, 8:55am

Hmm… How about this?

pretrained_model_name_or_path=model_id,
#device_map=“auto”,
token=hf_token,

Topic		Replies	Views
An error i ve been trying to fix for days now Intermediate	4	427	November 19, 2024
SmolVLM 8bit Quantization Problem Models	3	467	November 29, 2024
Qlora - 8 bit quantization using bitsandbytes gives error for owl-vit model Intermediate	1	490	April 12, 2024
Pushing a quantized (4bit) model on the Hub 🤗Transformers	9	4234	January 8, 2024
Push 4-bit converted model to hub Models	2	2302	October 27, 2023