Llama 3.1 70-B run on 32 GB Vram?

I’m using Nvidia V100 GPU having 32 gb of ram. I would like to run a llama 3.1 70B model, and I found this:

https://huggingface.co/hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4

But is is still too large, falling in Out of Memory. Is there a way to minimize the memory footprint in order to fit available Vram? Some pipeline parameters or something like that. Thanks.

To do this, it seems to be sufficient to specify device_map=“auto”.

However, HF libraries in general are still not very good at dealing with quantised files, so it would not be surprising if there were errors.

Hi John, I already tried “device_map=auto”, having the following error.

ValueError: You are attempting to load an AWQ model with a device_map that contains a CPU or disk device. This is not supported. Please remove the CPU or disk device from the device_map.

THis is my actual code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

model_id = "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512, # Note: Update this as per your use-case
    do_fuse=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  offload_buffers=True,
  device_map=0,
  quantization_config=quantization_config
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]

inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

Hi. I think you are right in the main. But…

  torch_dtype=torch.float16, # bfloat16 is better for this usage
  device_map=0, # CUDA:0 is specified...

.to("cuda") # Wouldn't everything go to VRAM...?

Still OOF Memory :confused:

pip install accelerate

The other possibility is that the accerelate library is not included or the version is old?
Or it’s not impossible that it’s not quantizing well on the VRAM when loaded and is actually consuming 16bits of VRAM, but it’s the same general idea as the code above, and in my space it’s loading at 4bits actually… 8B with BNB, though.

from transformers import BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                                        bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)
   
text_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config, device_map="cuda", torch_dtype=torch.bfloat16).eval()
# text_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config, device_map="auto" torch_dtype=torch.bfloat16).eval() # In your case, this is how it should be.

Maybe it’s not possible because AWQ can’t be On the fly quantization…?

I think it says us can’t offload to CPU either. (Maybe I’m looking at the table wrong.) GGUF seems to be able to offload to CPU, but the GGUF support in transformers is still incomplete, maybe that means we should use Llamacpp-python for that purpose.