Is it possible to use deepspeed inference with a 4/8-bit quantized model using bitsandbytes?
I use the bitsandbytes package like this:
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(self.model_id, device_map="auto", quantization_config=self.bnb_config)
zero_config = {
"stage": 3,
"offload_param": {
"device": "cpu"
}
}
ds_model = deepspeed.init_inference(
model=model,
mp_size=1,
zero=zero_config
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=self.max_new_tokens
)
However, it throws an error:
ValueError:
.tois not supported for4-bitor8-bitmodels. Please use the model as it is, since the model
has already been set to the correct devices and casted to the correctdtype.
The ultimate goal is to combine the quantization with deepspeed zero infinity offload in the hope to run a larger model that currently does not fit on my GPU.