Llama 3.1 70-B run on 32 GB Vram?

gfatigati · September 20, 2024, 9:17am

I’m using Nvidia V100 GPU having 32 gb of ram. I would like to run a llama 3.1 70B model, and I found this:

https://huggingface.co/hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4

But is is still too large, falling in Out of Memory. Is there a way to minimize the memory footprint in order to fit available Vram? Some pipeline parameters or something like that. Thanks.

John6666 · September 20, 2024, 9:25am

To do this, it seems to be sufficient to specify device_map=“auto”.

However, HF libraries in general are still not very good at dealing with quantised files, so it would not be surprising if there were errors.

gfatigati · September 20, 2024, 9:27am

Hi John, I already tried “device_map=auto”, having the following error.

ValueError: You are attempting to load an AWQ model with a device_map that contains a CPU or disk device. This is not supported. Please remove the CPU or disk device from the device_map.

THis is my actual code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

model_id = "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"

quantization_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512, # Note: Update this as per your use-case
    do_fuse=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  low_cpu_mem_usage=True,
  offload_buffers=True,
  device_map=0,
  quantization_config=quantization_config
)

prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]

inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0])

John6666 · September 20, 2024, 9:32am

Hi. I think you are right in the main. But…

  torch_dtype=torch.float16, # bfloat16 is better for this usage
  device_map=0, # CUDA:0 is specified...

.to("cuda") # Wouldn't everything go to VRAM...?

gfatigati · September 20, 2024, 9:44am

Still OOF Memory

John6666 · September 20, 2024, 9:51am

pip install accelerate

The other possibility is that the accerelate library is not included or the version is old?
Or it’s not impossible that it’s not quantizing well on the VRAM when loaded and is actually consuming 16bits of VRAM, but it’s the same general idea as the code above, and in my space it’s loading at 4bits actually… 8B with BNB, though.

from transformers import BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                                        bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)
   
text_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config, device_map="cuda", torch_dtype=torch.bfloat16).eval()
# text_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config, device_map="auto" torch_dtype=torch.bfloat16).eval() # In your case, this is how it should be.

Maybe it’s not possible because AWQ can’t be On the fly quantization…?

I think it says us can’t offload to CPU either. (Maybe I’m looking at the table wrong.) GGUF seems to be able to offload to CPU, but the GGUF support in transformers is still incomplete, maybe that means we should use Llamacpp-python for that purpose.

Topic		Replies	Views
Llama 3.1 8b Instruct - Memory Usage More than Reported Models	5	462	February 18, 2025
Fine Tuning LLama 3.2 1B Quantized Memory Requirements Models	6	1420	June 16, 2025
Local HW specs for Hosting meta-llama/Llama-3.2-11B-Vision-Instruct 🤗Transformers	4	1675	October 28, 2024
LLaMA 7B GPU Memory Requirement 🤗Transformers	19	152197	February 23, 2025
Running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine Intermediate	5	11051	December 21, 2023

Llama 3.1 70-B run on 32 GB Vram?

Related topics