Further finetuning a LoRA finetuned CausalLM Model

Hey everyone,

I am a bit unsure how to proceed regarding the mentioned topic.

The baseline is a model created via Huggingface’s library as an AutoModelForCausalLM model, PEFT and a LoRA approach with subsequent merging of the weights.

I now want to further fine tune the model without losing its original properties - in this case via instruction fine tuning or prefix tuning.

My approach would be the following:

 model = AutoModelForCausalLM.from_pretrained(
        use_cache=False if gradient_checkpointing else True

model = create_peft_config(model)

output_dir = "/tmp"
training_args = TrainingArguments(

trainer = Trainer(



del model
del trainer

peft_config = PeftConfig.from_pretrained(output_dir)
model = AutoModelForCausalLM.from_pretrained(
model = PeftModel.from_pretrained(
os.makedirs("lora", exist_ok=True)

merged_model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained(model_id)

In principle, I am loading the original model with the merged weights, finetune that on new data likewise with PEFT and LoRA and afterwards merging the weights again into the base model.

Is this a sensible approach, or is there something to suggest, for example, that I might even significantly compromise the original capabilities by doing so? If something speaks against it, what would be a better approach?

Kind regards and thanks in advance


Did you able to figure out it? I am also looking for the same


I am also interested in fine-tuning a CasualLM model with Peft, saving it, then continue fine-tuning on a different dataset, and repeating this pattern in order to avoid the scenario where the training fails after an extended period, which would waste the experiment, as in my case it will take a long time to train on the entire dataset. I am getting errors when attempting to further fine-tune the previously-fine-tuned model, but now am also seeing posts stating that it is advisable to train the model on a combined dataset instead all at once to ensure the greater patterns are learned.

So even if I could figure out how to iteratively fine-tune, is it even advisable?

Now I am reading about merging multiple fine-tuned models, but will this also risk affecting the performance and should we just simply train the model on the entire dataset all at once?


I think there are several possibilities:

  • adding an adapter which you train on task 1, then merge with the base model, and add another adapter which is trained on task 2.
  • adding an adapter which you train on one dataset, then you train it on another dataset, etc. (so only 1 adapter). Eventually you could merge it into the base model.
  • adding and training several adapters (one for each task/dataset) separately and merge them simultanously

For the first case, if you fine-tuned an xxxForCausalLM model using PEFT (i.e. by adding an adapter), you can load the model with its adapter using the AutoPeftModelForCausalLM class. Note that the adapter weights will still be separated from the base model. You could merge the adapter weights into the base model by calling the merge_and_unload method. Next, you could add another adapter, and apply PEFT again.

Let’s show this in code.

Step 1: load base model + adapter weights

First of all, note that there are 2 ways to load a base model with its adapter weights:

from peft import PeftModel, PeftConfig, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

# let's say you fine-tuned OPT using PEFT

# method 1: separately
base_model_id = "facebook/opt-350m"
adapter_id = "ybelkada/opt-350m-lora"
base_model = AutoModelForCausalLM.from_pretrained(base_model_id)
base_with_adapters_model = PeftModel.from_pretrained(base_model, adapter_id)

# method 2: conveniently with the AutoPeftModelForCausalLM class
base_with_adapters_model = AutoPeftModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora")

Note that in both cases, the adapter weights are still stored separately from the base model (you can see this in the state dictionary, which still includes separate base_model and adapter keys).

Step 2: merge adapter weights into the base model

Hence, you can merge the adapter parameters with the base model:

# now we just have a regular AutoModelForCausalLM Transformers model
model = base_with_adapters_model.merge_and_unload()

Step 3: add another adapter

Next, we could apply PEFT again by adding another adapter:

# next, we could apply PEFT again by adding another adapter
from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    target_modules=["q_proj", "v_proj"],

base_model_with_new_adapter = get_peft_model(model, lora_config)

You can fine-tune the base_model_with_new_adapter using the Trainer API or PyTorch.

See this guide for more info: PEFT integrations.


Thanks a lot for your reply @nielsr.

I was not aware that a model trained with PEFT is considered an “adapter”, and thought it was generally the same as a regular fine-tuned model, as I’ve been able to load the saved pretrained model that was trained with PEFT using AutoModelForCausalLM and passing the local path where the PEFT model was saved, and generate inferences with it, as opposed to loading it using PeftModel.from_pretrained() and passing both the adapter ID as well as a loaded base model.

I see the second method you offered only requires the adapter ID, so anyways this gives me a lot to consider and look into and I really appreciate your reply.


Well PEFT is actually integrated into the Transformers library, see here: PEFT integrations.

This means that if you do the following:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora")

it will automatically load the base model + adapter weights (as the base_model_name_or_path is present in the config). Note that the adapter weights will still be separated, which you can see by printing the state dict:

for name, param in model.named_parameters():
    print(name, param.shape)

which prints (among other things):

model.decoder.layers.23.self_attn.k_proj.weight torch.Size([1024, 1024])
model.decoder.layers.23.self_attn.k_proj.bias torch.Size([1024])
model.decoder.layers.23.self_attn.v_proj.base_layer.weight torch.Size([1024, 1024])
model.decoder.layers.23.self_attn.v_proj.base_layer.bias torch.Size([1024])
model.decoder.layers.23.self_attn.v_proj.lora_A.default.weight torch.Size([16, 1024])
model.decoder.layers.23.self_attn.v_proj.lora_B.default.weight torch.Size([1024, 16])
model.decoder.layers.23.self_attn.q_proj.base_layer.weight torch.Size([1024, 1024])
model.decoder.layers.23.self_attn.q_proj.base_layer.bias torch.Size([1024])
model.decoder.layers.23.self_attn.q_proj.lora_A.default.weight torch.Size([16, 1024])
model.decoder.layers.23.self_attn.q_proj.lora_B.default.weight torch.Size([1024, 16])
model.decoder.layers.23.self_attn.out_proj.weight torch.Size([1024, 1024])
model.decoder.layers.23.self_attn.out_proj.bias torch.Size([1024])

=> so here you can clearly see that the adapter weights (lora_A, lora_B) are added to the query and values projection layers (base_layer).

1 Like

Hi @nielsr ,

Thanks for writing this, it was very helpful.

Can you please extend this to how exactly we would load the step 2 fine tuned model for inferencing ? using PeftModel.from_pretrained(some_model, adapter_2)
Now what exactly is this some model ?
a) The original model “facebook/opt-350m” or
b) it would be base_with_adapters_model.merge_and_unload()

I am really confused about this. Any help would be really appreciated.

Hi, could you please elaborate more on the second posibility:

adding an adapter which you train on one dataset, then you train it on another dataset, etc. (so only 1 adapter). Eventually you could merge it into the base model

Hi, is there a way to implement the second possibility?
“adding an adapter which you train on one dataset, then you train it on another dataset, etc. (so only 1 adapter). Eventually you could merge it into the base model”

Hi @luispintoc @mayjul
as per nielsr . if i get it correct you can finetune using LoRa on your custom datasets.

  • Create a dataset for Task1 , you finetune it using LoRa, but you wont be saving this adapter yet.
  • You will Again have a new dataset for Task2, you will then continue the finetuning on top of the previous finetuned weights,
  • Can iteratively do this and once you are finished with different datasets you can then save the Adapter weights .

This Adapter weights should contain the knowledge from the different iterations you did on different datasets.

This is just a possibillity in which you can do , else you can combine different datasets together too and run it at once also.

Do correct me if you meant otherwise nielsr .

Hi @JulianGerhard

The approach is fine and it does work. i had tried similar for my usecase, although I am still skeptical on how much the Finetuned adapters affect and influence the results with updated knowledge .

Its quite interesting to read: - but how was model update Pre lora? / Peft models? and can they still be updated or fine tuned this way:

Actually I have the same question as you I want to know how much affect has been done on the original base model because it seems to me when I evaluate the model it is still the same

1 Like

yes i think the loras are not great unless saved after the first pass ; so if the lora was stoped and started continually its worthless:
I also did a low rank decopostion on my model and added to the mistral base and it did not make much difference !: personally i merge all my loras with my model !:
To create a lora it should always be made from the mistral base ! not form a model on top of a model ! then the lora can be used as an add on :slight_smile:

or use your own base model only (no merged) ; then the loras you make will have effect … but if you merge them your model already moved forwards from when the previous loras were created so adding them will send you back in time … only usefull if your latest training messed up … crazy ideas … and is doing mad stuff … you can use your loras to revery a little bit back in time !

so create a great base model :slight_smile:
then create many loras for specific tasks (so you dont mess with your model ) … then use the loras as adapters: (for your model) when sharing they need to use your base ! unless you did not upgrade it from the original base?
BUT :slight_smile:
If you upgrade you base then all loras previous ar invalid and wont effect you ! and may degrade the model to its previous bad responses!

1 Like

How do you stack loras or add multiple loras to a model and are they the same thing or is “stacking” something different?

I believe it you are saying is very interesting but I’m not sure if I get you fully correctly I will really really appreciate it if you can give us a simple code of what do you mean by add as many Loras without merge