Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models

SilkeP · June 20, 2023, 1:15pm

Is it possible to use deepspeed inference with a 4/8-bit quantized model using bitsandbytes?

I use the bitsandbytes package like this:

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(self.model_id, device_map="auto", quantization_config=self.bnb_config)

zero_config =  {
        "stage": 3,
        "offload_param": {
            "device": "cpu"
            }
        }

ds_model = deepspeed.init_inference(
            model=model,     
            mp_size=1,        
            zero=zero_config
        )

 pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=self.max_new_tokens
            )

However, it throws an error:

ValueError: .to is not supported for 4-bit or 8-bit models. Please use the model as it is, since the model
has already been set to the correct devices and casted to the correct dtype.

The ultimate goal is to combine the quantization with deepspeed zero infinity offload in the hope to run a larger model that currently does not fit on my GPU.

Johnguitarist · June 30, 2023, 4:58pm

Hey,
Also currently trying to achieve the same.

Did you manage to get bitsandbytes 4bit and deepspeed working together?

My error is slightly different:
ValueError: .half() is not supported for 4-bit or 8-bit models. Please use the model as it is, since the model has already been casted to the correct dtype.

Consider this a bump

AvivB · July 27, 2023, 8:46pm

I am getting this error too.

Edit: There appears to be support now:

Topic		Replies	Views
Finetuning LLama2-70B using 4-bit quantization on multi-GPU using Deepspeed ZeRO Intermediate	1	2429	March 19, 2024
An error i ve been trying to fix for days now Intermediate	4	441	November 19, 2024
BitsandBytes conflict with Accelerate 🤗Accelerate	6	637	April 14, 2025
Deepspeed inference stage 3 + quantization DeepSpeed	0	1000	March 8, 2024
Deepspeed ZeRO2, PEFT, bitsnbytes training DeepSpeed	0	125	June 4, 2024

Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models

Related topics