Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models

Is it possible to use deepspeed inference with a 4/8-bit quantized model using bitsandbytes?

I use the bitsandbytes package like this:

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(self.model_id, device_map="auto", quantization_config=self.bnb_config)

zero_config =  {
        "stage": 3,
        "offload_param": {
            "device": "cpu"
            }
        }

ds_model = deepspeed.init_inference(
            model=model,     
            mp_size=1,        
            zero=zero_config
        )

 pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=self.max_new_tokens
            )

However, it throws an error:

ValueError: .to is not supported for 4-bit or 8-bit models. Please use the model as it is, since the model
has already been set to the correct devices and casted to the correct dtype.

The ultimate goal is to combine the quantization with deepspeed zero infinity offload in the hope to run a larger model that currently does not fit on my GPU.