Recommended hardware for running LLMs locally

Not sure if this question is bad form given HF sells compute, but here goes…

I tried running Mistral-7B-Instruct-v0.2 with this example code on my modest 16GB Macbook Air M2, although I replaced CUDA with MPS as my GPU device.

Given the gushing praise for the model’s performance vs it’s small size, I thought this would work. However I get out of memory errors with just the CPU or using the MPS GPU.

Can someone tell me if I’m just doing it wrong ™ or if I need better hardware? If hardware, can someone recommend a good system configuration? I can afford an RTX 4090, but looking for holistic advice.

I am not sure why it does not work here, I have a m1 pro with 16gb so similar setup.

I’ve seen posts on r/locallama where they run 7b models just fine: Reddit - Dive into anything

But for some reason on huggingface transformers, the models take forever. I’ve even downloaded and it works very quickly for 7b models.

My issue is not CPU timeouts, but the fact that the pipeline command just runs forever. Either way, it is very frustrating but please let me know if you figure out whats going on!

@robot1125 7b models in bfloat16 takes approx 14-15 gig of memory, you should check your memory usage after loading the model and while on inference. Then it’ll require more ram resource to process your prompt, the larger your prompt the more memory it takes. I think it’ll be okay If you only run small prompts, also consider clearing cache after each generation, it helps to avoid buildups. If you really want to run the model locally on that budget, try running quantized version of the model instead. 4090 with 24gb vram would be ok, but quite tight if you are planning to try out half precision 13Bs in the future.

it is stated on the post that @zzif shared that they are running quantized version of the 7b model with llama.cpp which will require less memory. Yes, ollama will also help as they are running int8 (if I remember correctly) quantized models by default, which will take approx half the memory it takes to run bfloat16.