If you train a model with LoRa (low-rank adaptation), you only train adapters on top of the base model. E.g. if you fine-tune LLaMa with LoRa, you only add a couple of linear layers (so-called adapters) on top of the original (also called base) model. Hence calling save_pretrained()
or push_to_hub()
will only save 2 things:
- the adapter configuration (in an
adapter_config.json
file) - the adapter weights (typically in a safetensors file).
See here for example: ybelkada/opt-350m-lora at main. Here, OPT-350m is the base model.
In order to merge these adapter layers back into the base model, one can call the merge_and_unload method. Afterwards, you can call save_pretrained()
on it which will save both the weights and the configuration in a config.json file:
from transformers import AutoModelForCausalLM
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)
model = model.merge_and_unload()
model.save_pretrained("my_model")
One feature of the Transformers library is that it has PEFT integration, which means that you can call from_pretrained()
directly on a folder/repository that only contains this adapter_config.json
file and the adapter weights, and it will automatically load the weights of the base model + adapters. See PEFT integrations. Hence we could also just have done this:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(folder_containing_only_adapter_weights)
model.save_pretrained("my_model")