Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models

SilkeP · June 20, 2023, 1:15pm

Is it possible to use deepspeed inference with a 4/8-bit quantized model using bitsandbytes?

I use the bitsandbytes package like this:

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(self.model_id, device_map="auto", quantization_config=self.bnb_config)

zero_config =  {
        "stage": 3,
        "offload_param": {
            "device": "cpu"
            }
        }

ds_model = deepspeed.init_inference(
            model=model,     
            mp_size=1,        
            zero=zero_config
        )

 pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=self.max_new_tokens
            )

However, it throws an error:

ValueError: .to is not supported for 4-bit or 8-bit models. Please use the model as it is, since the model
has already been set to the correct devices and casted to the correct dtype.

The ultimate goal is to combine the quantization with deepspeed zero infinity offload in the hope to run a larger model that currently does not fit on my GPU.

Topic		Replies	Views
Finetuning LLama2-70B using 4-bit quantization on multi-GPU using Deepspeed ZeRO Intermediate	1	2440	March 19, 2024
An error i ve been trying to fix for days now Intermediate	4	480	November 19, 2024
BitsandBytes conflict with Accelerate 🤗Accelerate	6	775	April 14, 2025
Deepspeed inference stage 3 + quantization DeepSpeed	0	1012	March 8, 2024
Deepspeed ZeRO2, PEFT, bitsnbytes training DeepSpeed	0	129	June 4, 2024

Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models

Related topics