I’ve fine-tuned a model via QLoRA by following this notebook: Google Colab
And I pushed the adapter weights to the hugging face hub.
When it comes time to predict with the base model+adapters, should I quantize the base model again (given the adapters were trained alongside a frozen quantized base model)?
Or is it valid to load the base model unquantized, attach/merge the adapters as usual, and predict away?
## TRAINING bnb_config = BitsAndBytesConfig(...) model = AutoModelForCausalLM.from_pretrained(..., quantization_config=bnb_config) # fit adapters alongside quantized base model model = prepare_model_for_kbit_training(model) peft_config = LoraConfig(...) trainer = SFTTrainer(model=model, peft_config=peft_config, ...) trainer.train() trainer.push_to_hub() # pushes adapter weights only ## INFERENCE peft_config = PeftConfig.from_pretrained(<hub_id>) base_model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, device_map="auto") # should I quantize here as I did when fitting the adapters with QLoRA? peft_model = PeftModel.from_pretrained(base_model, <hub_id>) model = peft_model.merge_and_unload()