Peft model from pretrained load in 8/4 bit

So I’m training this QLora model and then saving the adapter.

Then, I do

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = 'tiiuae/falcon-7b'

tokenizer = AutoTokenizer.from_pretrained(
    model_name, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    # torch_dtype=torch.bfloat16,
    # load_in_8bit=True,
    # device_map={"":"cpu"},
    # quantization_config=bnb_config,
    # offload_folder="offload",
)


adapters_path = f'./checkpoints/from_s3/{model_name}'

model = PeftModel.from_pretrained(model, adapters_path)
model = model.merge_and_unload()

print(f"Successfully loaded the model {model_name} into memory")

to load the adapter and finally merge the stuff.
Now I would like to accomplish the same result without having to load the whole model in 32/16 bit precision but there doesn’t seem to be a way…

It’s very dumb that I can train the model in this sci-fi quantized version but need a ton of ram to load it back haha
And the documentation is very unclear on this but anyways

Any workaround?

My understanding of this is if I want to load back the 4 bit version of my QLORA-finetuned model I need to either:

  1. do model.merge_and_unload() after training and then save model all together
  2. load model and adapter in 16bit, then model.merge_and_unload(), then save it (lol) and then load it in 4bit (stralol)
1 Like

Hey, did you find a way around this? in your second point do you mean to load the model in 16bit as in, a ‘sharded’ version?

Even I am trying to do something similar with the llm, but facing issues. Will be nice if we get some pointer on this.

still nothing here.

I’m still looking for a way to store and load both the model and (possibily multiple) adapters in 8 bit or 4bit!

Also, I am struggling to understand what happens when I set the load_in_8bit flag while loading a model that’s stored in 32 bit format… is it sort of streamed trough a quantization process or is first loaded in memory and then quantized?
is there a link to the code for this?

THIS IS NOT AN ADVERTSIEMENT SIMPLY TRYING TO GIVE AN ALTERNATE SOURCE OF INFO AS NO ONE REPLIES IN THESE FORUMS - EVER. THANKS FOR FLAGGING AS SPAM. HOW ABOUT REPLY INSTEAD OF REPORTING ???
have you tried code_your_own_AI on youtube?

Boost Fine-Tuning Performance of LLM: Optimal Architecture w/ PEFT LoRA Adapter-Tuning on Your GPU

he walks you through the steps in slow methodical, understanding tutorial then shows the code. the video i posted above he talks about what happens and then this one below he goes way into it!

PEFT LoRA Explained in Detail - Fine-Tune your LLM on your local GPU
^i think this one can really help explain exactly. Or atleast one of his videos. Does a great one about QLoRA

1 Like

I used the following hack. Hope it helps someone or points you in the direction of a less hacky solution. I have some ugly colab notebooks here for reference.
In the training phase I use SFTTRainer from trl to train a 4 bit version of the falcon model and save the QLora weights using trainer.save_model('./qlora_weights/').

In the Inference notebook I want to load the model and the newly created weights, again in 4 bit, so I can fit in the free tier on colab. The hack was the last line in the snippet below

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

from peft import prepare_model_for_kbit_training

from peft import PeftModel, PeftConfig
config = PeftConfig.from_pretrained('drive/My Drive/falcon_weights/bank_regs_qlora')
model = PeftModel.from_pretrained(model, 'drive/My Drive/falcon_weights/bank_regs_qlora')
# Even though we are not going to train the model, I struggled with the implementation of some of the libraries
# that could not reconcile the different floating point precision in the Model and in the LoRAs. The command
# below manages to reconcile the different precisions
model = prepare_model_for_kbit_training(model)

After this I created the pipeline and all worked out ok. Hopefully I understood your question and this helps to load the trained model using less GPU ram than having to load the full precision model and weights

The code you have commented out when loading the base-model is all that’s needed to load a large model with LoRA weights into a GPU with less memory.

Below is the code I used to load a llama-2-13b-hf model in 8-bit along with LoRA weights I trained into T4 GPU (15GB) on colab for running inference.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
from peft import PeftModel, PeftConfig

model_id = "meta-llama/Llama-2-13b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)

model = PeftModel.from_pretrained(model, "vijayfound/ludwig-llama2-demo")

pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                max_new_tokens=200,
)

prompt = "Sort an array:"

print(pipe(prompt)[0]['generated_text'])