How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM?

srivassid · September 25, 2024, 3:49am

I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?

John6666 · September 25, 2024, 4:49am

You mean you dare to use a quantized model expanded into a float? In any case, those models are not gated, and even a single GPU should be enough. My GPU is much crappier than yours even when compared to one GPU to another. Still, the quantized one actually worked.

pip install bitsandbytes
pip install accelerate

from transformers import BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                               bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config, device_map="auto", torch_dtype=torch.bfloat16).eval()

model_bf16 = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16).eval()

srivassid · September 26, 2024, 3:52am

Ok i will give it a try…thanks

Topic		Replies	Views
Llama 3.1 70-B run on 32 GB Vram? 🤗Transformers	5	3791	September 20, 2024
Best LLMs that can run on 4gb VRAM Beginners	2	3114	January 22, 2025
Recommended hardware for running LLMs locally Beginners	2	33118	December 18, 2023
Llama 3.1 8b Instruct - Memory Usage More than Reported Models	5	468	February 18, 2025
How to deploy larger model inference on multiple machine with multiple GPU？ 🤗Transformers	1	2546	December 19, 2023

How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM?

Related topics