LLM2VEC QLora Quantization after merge_and_upload()

I am currently working on a project where I want to finetune a LLM2VEC model with QLora. The ‘McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised’ consists of two Lora adapters which to the documentation should be loaded in the following way:

config = AutoConfig.from_pretrained(...)

model = AutoModel.from_pretrained(...)

model = PeftModel.from_pretrained(
    model,   first_adapter,
)
model = model.merge_and_unload() 

model = PeftModel.from_pretrained(
    model,  second_adapter
)

I now want to fine-tune the LLM2Vec model on a downstream task using QLora. The usual approach for using QLora according to the Huggingface documentation is loading the model from_pretrained with a 4bit bnb config LINK. Merging the adapter into the quantized model leads to rounding errors, decreasing model performance. Thus I am looking for a way to add the quantization after merging the adapters into the main model. Thanks for the help in advance.

What I tried so far:

  1. save_pretrained with the two merge_and_unload adapters and loading it from_pretrained with the quantization.
  • I get the error “UnboundLocalError: cannot access local variable ‘active_adapters’ where it is not associated with a value” while saving the model with save_pretrained(). Not sure if this is a bug or intended behaviour.
#...previous code
model = model.merge_and_unload() 
model.save_pretrained(...)
# UnboundLocalError: cannot access local variable 'active_adapters' where it is not associated with a value

bnb_config = BitsAndBytesConfig()
quantized_model = AutoModel.from_pretrained(... + bnb_config)
  1. save_pretrained after adding an empty Lora adapter for finetuning and then loading from_pretrained with quantization
# ... previous code
model = model.merge_and_unload()
peft_config = LoraConfig(...)
model = get_peft_model(model, peft_config)
model.save_pretrained()

#load model + quantization
bnb_config = BitsAndBytesConfig()
quantized_model = AutoModel.from_pretrained(... + bnb_config)
quantized_peft_model = PeftModel.from_pretrained(
    quantized_model,
    empty_adapter,
)
  • This seams to work. I am not sure though if this is the intended way to do it and why I cant save the model without an adapter.

Here the model specification of the merged_and_unloaded model:

MistralEncoderModel(
  (embed_tokens): Embedding(32000, 4096)
  (layers): ModuleList(
    (0-31): 32 x ModifiedMistralDecoderLayer(
      (self_attn): ModifiedMistralSdpaAttention(
        (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
        (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
        (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (rotary_emb): MistralRotaryEmbedding()
      )
      (mlp): MistralMLP(
        (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
        (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
        (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
        (act_fn): SiLU()
      )
      (input_layernorm): MistralRMSNorm()
      (post_attention_layernorm): MistralRMSNorm()
    )
  )
  (norm): MistralRMSNorm()
)