So I’m training this QLora model and then saving the adapter.
Then, I do
import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_name = 'tiiuae/falcon-7b'
tokenizer = AutoTokenizer.from_pretrained(
model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
# torch_dtype=torch.bfloat16,
# load_in_8bit=True,
# device_map={"":"cpu"},
# quantization_config=bnb_config,
# offload_folder="offload",
)
adapters_path = f'./checkpoints/from_s3/{model_name}'
model = PeftModel.from_pretrained(model, adapters_path)
model = model.merge_and_unload()
print(f"Successfully loaded the model {model_name} into memory")
to load the adapter and finally merge the stuff.
Now I would like to accomplish the same result without having to load the whole model in 32/16 bit precision but there doesn’t seem to be a way…
It’s very dumb that I can train the model in this sci-fi quantized version but need a ton of ram to load it back haha
And the documentation is very unclear on this but anyways
Any workaround?
My understanding of this is if I want to load back the 4 bit version of my QLORA-finetuned model I need to either:
- do model.merge_and_unload() after training and then save model all together
- load model and adapter in 16bit, then model.merge_and_unload(), then save it (lol) and then load it in 4bit (stralol)
1 Like
Hey, did you find a way around this? in your second point do you mean to load the model in 16bit as in, a ‘sharded’ version?
Even I am trying to do something similar with the llm, but facing issues. Will be nice if we get some pointer on this.
still nothing here.
I’m still looking for a way to store and load both the model and (possibily multiple) adapters in 8 bit or 4bit!
Also, I am struggling to understand what happens when I set the load_in_8bit flag while loading a model that’s stored in 32 bit format… is it sort of streamed trough a quantization process or is first loaded in memory and then quantized?
is there a link to the code for this?
THIS IS NOT AN ADVERTSIEMENT SIMPLY TRYING TO GIVE AN ALTERNATE SOURCE OF INFO AS NO ONE REPLIES IN THESE FORUMS - EVER. THANKS FOR FLAGGING AS SPAM. HOW ABOUT REPLY INSTEAD OF REPORTING ???
have you tried code_your_own_AI on youtube?
Boost Fine-Tuning Performance of LLM: Optimal Architecture w/ PEFT LoRA Adapter-Tuning on Your GPU
he walks you through the steps in slow methodical, understanding tutorial then shows the code. the video i posted above he talks about what happens and then this one below he goes way into it!
PEFT LoRA Explained in Detail - Fine-Tune your LLM on your local GPU
^i think this one can really help explain exactly. Or atleast one of his videos. Does a great one about QLoRA
1 Like
I used the following hack. Hope it helps someone or points you in the direction of a less hacky solution. I have some ugly colab notebooks here for reference.
In the training phase I use SFTTRainer from trl to train a 4 bit version of the falcon model and save the QLora weights using trainer.save_model('./qlora_weights/')
.
In the Inference notebook I want to load the model and the newly created weights, again in 4 bit, so I can fit in the free tier on colab. The hack was the last line in the snippet below
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True
)
model.config.use_cache = False
from peft import prepare_model_for_kbit_training
from peft import PeftModel, PeftConfig
config = PeftConfig.from_pretrained('drive/My Drive/falcon_weights/bank_regs_qlora')
model = PeftModel.from_pretrained(model, 'drive/My Drive/falcon_weights/bank_regs_qlora')
# Even though we are not going to train the model, I struggled with the implementation of some of the libraries
# that could not reconcile the different floating point precision in the Model and in the LoRAs. The command
# below manages to reconcile the different precisions
model = prepare_model_for_kbit_training(model)
After this I created the pipeline and all worked out ok. Hopefully I understood your question and this helps to load the trained model using less GPU ram than having to load the full precision model and weights
1 Like
The code you have commented out when loading the base-model is all that’s needed to load a large model with LoRA weights into a GPU with less memory.
Below is the code I used to load a llama-2-13b-hf
model in 8-bit along with LoRA weights I trained into T4 GPU (15GB) on colab for running inference.
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
from peft import PeftModel, PeftConfig
model_id = "meta-llama/Llama-2-13b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
model = PeftModel.from_pretrained(model, "vijayfound/ludwig-llama2-demo")
pipe = pipeline("text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=200,
)
prompt = "Sort an array:"
print(pipe(prompt)[0]['generated_text'])