But is is still too large, falling in Out of Memory. Is there a way to minimize the memory footprint in order to fit available Vram? Some pipeline parameters or something like that. Thanks.
Hi John, I already tried “device_map=auto”, having the following error.
ValueError: You are attempting to load an AWQ model with a device_map that contains a CPU or disk device. This is not supported. Please remove the CPU or disk device from the device_map.
THis is my actual code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
model_id = "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"
quantization_config = AwqConfig(
bits=4,
fuse_max_seq_len=512, # Note: Update this as per your use-case
do_fuse=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
offload_buffers=True,
device_map=0,
quantization_config=quantization_config
)
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])
The other possibility is that the accerelate library is not included or the version is old?
Or it’s not impossible that it’s not quantizing well on the VRAM when loaded and is actually consuming 16bits of VRAM, but it’s the same general idea as the code above, and in my space it’s loading at 4bits actually… 8B with BNB, though.
from transformers import BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)
text_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config, device_map="cuda", torch_dtype=torch.bfloat16).eval()
# text_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config, device_map="auto" torch_dtype=torch.bfloat16).eval() # In your case, this is how it should be.
Maybe it’s not possible because AWQ can’t be On the fly quantization…?
I think it says us can’t offload to CPU either. (Maybe I’m looking at the table wrong.) GGUF seems to be able to offload to CPU, but the GGUF support in transformers is still incomplete, maybe that means we should use Llamacpp-python for that purpose.