Best LLMs that can run on 4gb VRAM

What large language model should i choose to run locally on my pc?

After viewing many ressources i noticed that mistral 7b was the most recommended as it can be run on small GPUs .

My goal is to finetune the model on alerts / reports related to cybersecurity incidents and i expect the model to generate a report. Any advice ? :slight_smile:

1 Like

First of all, let’s assume that the model size is 4-bit quantized because there is not enough VRAM. Up to 4-bit quantization, if the quantization algorithm is not bad, it is difficult to recognize the decrease in accuracy during inference. If this becomes 3-bit quantization, it suddenly becomes a gamble.

In addition to the size of the model itself, a little extra VRAM is needed for inference.
If you want to fine-tune, it is preferable to have several times the amount of VRAM as you need for inference…
The missing amount can be made up for with RAM, but the speed of the supplemented part will drop by several digits.

Regarding models, the Mistral 7B model and other 7B-9B class models are certainly recommended, but even with 4GB of VRAM, it is slightly insufficient to load just the 7B model itself. However, since the shortfall is slight, I think it will still run, albeit slowly.
Models with good performance are limited if they are below 7B. Models from the relatively new generation, such as the Qwen 2.5 2B and Llama 3.2 3B, are excellent.

Transformers is useful for fine-tuning, but the quantization procedure is complicated and it is easy to fail due to VRAM-related issues, so I recommend trying Ollama first to choose a base model.
After choosing a model in the GGUF format that can be used with Ollama, you can find most of the models in the format for Transformers by searching for HF.

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.