How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM?

I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?

1 Like

You mean you dare to use a quantized model expanded into a float? In any case, those models are not gated, and even a single GPU should be enough. My GPU is much crappier than yours even when compared to one GPU to another. Still, the quantized one actually worked.

pip install bitsandbytes
pip install accelerate
from transformers import BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                               bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config, device_map="auto", torch_dtype=torch.bfloat16).eval()

model_bf16 = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16).eval()

Ok i will give it a tryā€¦thanks