Peft model from pretrained load in 8/4 bit

So I’m training this QLora model and then saving the adapter.

Then, I do

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = 'tiiuae/falcon-7b'

tokenizer = AutoTokenizer.from_pretrained(
    model_name, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    # torch_dtype=torch.bfloat16,
    # load_in_8bit=True,
    # device_map={"":"cpu"},
    # quantization_config=bnb_config,
    # offload_folder="offload",
)


adapters_path = f'./checkpoints/from_s3/{model_name}'

model = PeftModel.from_pretrained(model, adapters_path)
model = model.merge_and_unload()

print(f"Successfully loaded the model {model_name} into memory")

to load the adapter and finally merge the stuff.
Now I would like to accomplish the same result without having to load the whole model in 32/16 bit precision but there doesn’t seem to be a way…

It’s very dumb that I can train the model in this sci-fi quantized version but need a ton of ram to load it back haha
And the documentation is very unclear on this but anyways

Any workaround?

My understanding of this is if I want to load back the 4 bit version of my QLORA-finetuned model I need to either:

  1. do model.merge_and_unload() after training and then save model all together
  2. load model and adapter in 16bit, then model.merge_and_unload(), then save it (lol) and then load it in 4bit (stralol)
1 Like

Hey, did you find a way around this? in your second point do you mean to load the model in 16bit as in, a ‘sharded’ version?

Even I am trying to do something similar with the llm, but facing issues. Will be nice if we get some pointer on this.

still nothing here.

I’m still looking for a way to store and load both the model and (possibily multiple) adapters in 8 bit or 4bit!

Also, I am struggling to understand what happens when I set the load_in_8bit flag while loading a model that’s stored in 32 bit format… is it sort of streamed trough a quantization process or is first loaded in memory and then quantized?
is there a link to the code for this?

THIS IS NOT AN ADVERTSIEMENT SIMPLY TRYING TO GIVE AN ALTERNATE SOURCE OF INFO AS NO ONE REPLIES IN THESE FORUMS - EVER. THANKS FOR FLAGGING AS SPAM. HOW ABOUT REPLY INSTEAD OF REPORTING ???
have you tried code_your_own_AI on youtube?

Boost Fine-Tuning Performance of LLM: Optimal Architecture w/ PEFT LoRA Adapter-Tuning on Your GPU

he walks you through the steps in slow methodical, understanding tutorial then shows the code. the video i posted above he talks about what happens and then this one below he goes way into it!

PEFT LoRA Explained in Detail - Fine-Tune your LLM on your local GPU
^i think this one can really help explain exactly. Or atleast one of his videos. Does a great one about QLoRA

1 Like

I used the following hack. Hope it helps someone or points you in the direction of a less hacky solution. I have some ugly colab notebooks here for reference.
In the training phase I use SFTTRainer from trl to train a 4 bit version of the falcon model and save the QLora weights using trainer.save_model('./qlora_weights/').

In the Inference notebook I want to load the model and the newly created weights, again in 4 bit, so I can fit in the free tier on colab. The hack was the last line in the snippet below

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

from peft import prepare_model_for_kbit_training

from peft import PeftModel, PeftConfig
config = PeftConfig.from_pretrained('drive/My Drive/falcon_weights/bank_regs_qlora')
model = PeftModel.from_pretrained(model, 'drive/My Drive/falcon_weights/bank_regs_qlora')
# Even though we are not going to train the model, I struggled with the implementation of some of the libraries
# that could not reconcile the different floating point precision in the Model and in the LoRAs. The command
# below manages to reconcile the different precisions
model = prepare_model_for_kbit_training(model)

After this I created the pipeline and all worked out ok. Hopefully I understood your question and this helps to load the trained model using less GPU ram than having to load the full precision model and weights

1 Like

The code you have commented out when loading the base-model is all that’s needed to load a large model with LoRA weights into a GPU with less memory.

Below is the code I used to load a llama-2-13b-hf model in 8-bit along with LoRA weights I trained into T4 GPU (15GB) on colab for running inference.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
from peft import PeftModel, PeftConfig

model_id = "meta-llama/Llama-2-13b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)

model = PeftModel.from_pretrained(model, "vijayfound/ludwig-llama2-demo")

pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                max_new_tokens=200,
)

prompt = "Sort an array:"

print(pipe(prompt)[0]['generated_text'])