I’m trying to test the new QLoRA model (guanaco-7b) locally but I’m facing an error loading the Llama model.
This is the code to load the model:
# Load the model.
# Note: It can take a while to download LLaMA and add the adapter modules.
# You can also use the 13B model by loading in 4bits.
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
model_name = "decapoda-research/llama-7b-hf"
adapters_name = 'timdettmers/guanaco-7b'
print(f"Starting to load the model {model_name} into memory")
m = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
torch_dtype=torch.bfloat16,
device_map={"": 0}
)
m = PeftModel.from_pretrained(m, adapters_name)
#m = m.merge_and_unload()
tok = LlamaTokenizer.from_pretrained(model_name)
tok.bos_token_id = 1
stop_token_ids = [0]
print(f"Successfully loaded the model {model_name} into memory")
And this is the error that I face:
TypeError: LlamaForCausalLM.__init__() got an unexpected keyword argument 'load_in_4bit'
The funny thing is that the same code works when I run it in Colab. So I thought it must be a version issue. But the transformer’s version in both cases are:
# Load the model.
# Note: It can take a while to download LLaMA and add the adapter modules.
# You can also use the 13B model by loading in 4bits.
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer, BitsAndBytesConfig
from torch import cuda, bfloat16
model_name = "decapoda-research/llama-13b-hf"
adapters_name = 'timdettmers/guanaco-13b'
print(f"Starting to load the model {model_name} into memory")
m = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
), torch_dtype=torch.bfloat16,
device_map={"": 0}
)
m = PeftModel.from_pretrained(m, adapters_name)
#m = m.merge_and_unload()
tok = LlamaTokenizer.from_pretrained(model_name)
tok.bos_token_id = 1
stop_token_ids = [0]
print(f"Successfully loaded the model {model_name} into memory")
Thanks, @rhamnett . While your solution is technically correct and it works but it does not quantize the model itself. And as the result, my machine runs out of vRAM. Basically, your solution does not use QLoRA while using it is the whole point.
To reiterate, load_in_4bit=True must be part of the from_pretrained() function call arguments or the model is not quantized and the GPU will run out of memory. It’s just that I don’t understand why sometimes that argument is not part of the function contract while some other times it is. And in both cases, the version number is matching! It does not make sense to me at all.
I have managed to run my version of the code (mentioned in the question) in a docker container without running out of the GPU memory. But the same code still fails to load the model on my OS. It’s a very annoying error.
I tried the above code in my setup. The BitsAndByteConfig and the rest of the classes itself not getting imported. Below is the transformers library version
Name: transformers
Version: 4.30.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /home/user/jupyter_env/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft, sentence-transformers
The bitsandbytes version details are shown below
Name: bitsandbytes
Version: 0.40.1.post1
Summary: k-bit optimizers and matrix multiplication routines.
Home-page: https://github.com/TimDettmers/bitsandbytes
Author: Tim Dettmers
Author-email: dettmers@cs.washington.edu
License: MIT
Location: /home/kamal/jupyter_env/lib/python3.11/site-packages
Requires:
Required-by:
The error that occurs is show below
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[16], line 1
----> 1 from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, BitsAndBytesConfig
2 from torch import cuda, bfloat16
ImportError: cannot import name 'BitsAndBytesConfig' from 'transformers' (/home/user/jupyter_env/lib/python3.11/site-packages/transformers/__init__.py)
The code is in jupyter notebook, and its running inside virtual environment. Unable to figure out the cause of the issue
Is there a way to push the adapter model using the .push_to_hub method? I’m curious about why the TypeError occurred when I installed the GitHub version of the package.