Recommended hardware for running LLMs locally

robot1125 · December 16, 2023, 1:23pm

Not sure if this question is bad form given HF sells compute, but here goes…

I tried running Mistral-7B-Instruct-v0.2 with this example code on my modest 16GB Macbook Air M2, although I replaced CUDA with MPS as my GPU device.

Given the gushing praise for the model’s performance vs it’s small size, I thought this would work. However I get out of memory errors with just the CPU or using the MPS GPU.

Can someone tell me if I’m just doing it wrong ™ or if I need better hardware? If hardware, can someone recommend a good system configuration? I can afford an RTX 4090, but looking for holistic advice.

zzif · December 18, 2023, 4:08am

I am not sure why it does not work here, I have a m1 pro with 16gb so similar setup.

I’ve seen posts on r/locallama where they run 7b models just fine: Reddit - Dive into anything

But for some reason on huggingface transformers, the models take forever. I’ve even downloaded ollama.ai and it works very quickly for 7b models.

My issue is not CPU timeouts, but the fact that the pipeline command just runs forever. Either way, it is very frustrating but please let me know if you figure out whats going on!

luisfrentzen · December 18, 2023, 7:29am

@robot1125 7b models in bfloat16 takes approx 14-15 gig of memory, you should check your memory usage after loading the model and while on inference. Then it’ll require more ram resource to process your prompt, the larger your prompt the more memory it takes. I think it’ll be okay If you only run small prompts, also consider clearing cache after each generation, it helps to avoid buildups. If you really want to run the model locally on that budget, try running quantized version of the model instead. 4090 with 24gb vram would be ok, but quite tight if you are planning to try out half precision 13Bs in the future.

it is stated on the post that @zzif shared that they are running quantized version of the 7b model with llama.cpp which will require less memory. Yes, ollama will also help as they are running int8 (if I remember correctly) quantized models by default, which will take approx half the memory it takes to run bfloat16.

Topic		Replies	Views
Best LLMs that can run on 4gb VRAM Beginners	2	5563	January 22, 2025
How much memory is needed for mbart-large-cc25? 🤗Transformers	1	1090	August 29, 2020
How to run llama2 7b chat locally with 3060 6GB Ram Beginners	0	507	March 9, 2024
Poor performance from Mistral-7B-Instruct-v0.1 Beginners	1	1635	March 1, 2024
Find LLM to run on single gpu with only 8 GB ram Models	10	8441	March 22, 2024

Recommended hardware for running LLMs locally

Related topics