Best LLMs that can run on 4gb VRAM

safouene99999 · January 22, 2025, 9:41am

What large language model should i choose to run locally on my pc?

After viewing many ressources i noticed that mistral 7b was the most recommended as it can be run on small GPUs .

My goal is to finetune the model on alerts / reports related to cybersecurity incidents and i expect the model to generate a report. Any advice ?

John6666 · January 22, 2025, 10:00am

First of all, let’s assume that the model size is 4-bit quantized because there is not enough VRAM. Up to 4-bit quantization, if the quantization algorithm is not bad, it is difficult to recognize the decrease in accuracy during inference. If this becomes 3-bit quantization, it suddenly becomes a gamble.

In addition to the size of the model itself, a little extra VRAM is needed for inference.
If you want to fine-tune, it is preferable to have several times the amount of VRAM as you need for inference…
The missing amount can be made up for with RAM, but the speed of the supplemented part will drop by several digits.

Regarding models, the Mistral 7B model and other 7B-9B class models are certainly recommended, but even with 4GB of VRAM, it is slightly insufficient to load just the 7B model itself. However, since the shortfall is slight, I think it will still run, albeit slowly.
Models with good performance are limited if they are below 7B. Models from the relatively new generation, such as the Qwen 2.5 2B and Llama 3.2 3B, are excellent.

Transformers is useful for fine-tuning, but the quantization procedure is complicated and it is easy to fail due to VRAM-related issues, so I recommend trying Ollama first to choose a base model.
After choosing a model in the GGUF format that can be used with Ollama, you can find most of the models in the format for Transformers by searching for HF.

system · January 22, 2025, 10:01pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM? Beginners	2	1700	September 26, 2024
Recommended hardware for running LLMs locally Beginners	2	33476	December 18, 2023
Determining if a model will run locally Beginners	4	582	April 7, 2025
How much VRAM and how many GPUs to fine-tune a 70B parameter model like LLaMA 3.1 locally? Models	1	328	April 17, 2025
LoRa fine tuning a chatbot on 6GB VRAM GPU Beginners	1	322	January 21, 2025

Best LLMs that can run on 4gb VRAM

Related topics