Difference between AutoModelForCausalLM and peft_model.merge_and_unload() for a LoRA model during inference

Hello,

I have finetuned a model, with a PEFT configuration and I am trying to load them for inference.

I have read in this answer that loading the model that you have saved:

trainer.model.save_pretrained(new_model)

using the AutoModelForCausalLM:

model = AutoModelForCausalLM.from_pretrained(
    new_model,
    quantization_config=bnb_config,
    device_map=device_map
)

would load both the base_model + adapters.

However most of the tutorials suggest the use of merge_and_unload() when working with peft models:

base_model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    device_map=device_map,
    quantization_config=bnb_config,
)

model = PeftModel.from_pretrained(model=base_model, 
                                  model_id=new_model, 
                                  quantization_config=bnb_config,
                                  device_map=device_map,)

model = model.merge_and_unload(safe_merge=True)

As far as I understand these 2 blocks of code should yield the same results but when I print the model structures they are nothing alike. Moreover, they yield different results when used for inference - the results of the latter resemble the output of the base model, whereas the results of the former look like the training data that I used.

1 Like