Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models

Is it possible to use deepspeed inference with a 4/8-bit quantized model using bitsandbytes?

I use the bitsandbytes package like this:

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(self.model_id, device_map="auto", quantization_config=self.bnb_config)

zero_config =  {
        "stage": 3,
        "offload_param": {
            "device": "cpu"
            }
        }

ds_model = deepspeed.init_inference(
            model=model,     
            mp_size=1,        
            zero=zero_config
        )

 pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=self.max_new_tokens
            )

However, it throws an error:

ValueError: .to is not supported for 4-bit or 8-bit models. Please use the model as it is, since the model
has already been set to the correct devices and casted to the correct dtype.

The ultimate goal is to combine the quantization with deepspeed zero infinity offload in the hope to run a larger model that currently does not fit on my GPU.

Hey,
Also currently trying to achieve the same.

Did you manage to get bitsandbytes 4bit and deepspeed working together?

My error is slightly different:
ValueError: .half() is not supported for 4-bit or 8-bit models. Please use the model as it is, since the model has already been casted to the correct dtype.

Consider this a bump :slight_smile:

I am getting this error too.

Edit: There appears to be support now: