Is it possible to use deepspeed inference with a 4/8-bit quantized model using bitsandbytes?
I use the bitsandbytes package like this:
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(self.model_id, device_map="auto", quantization_config=self.bnb_config)
zero_config = {
"stage": 3,
"offload_param": {
"device": "cpu"
}
}
ds_model = deepspeed.init_inference(
model=model,
mp_size=1,
zero=zero_config
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=self.max_new_tokens
)
However, it throws an error:
ValueError:
.to
is not supported for4-bit
or8-bit
models. Please use the model as it is, since the model
has already been set to the correct devices and casted to the correctdtype
.
The ultimate goal is to combine the quantization with deepspeed zero infinity offload in the hope to run a larger model that currently does not fit on my GPU.