Running an LLM with high output quality locally

Hi everyone,

I would need to run an LLM locally due to confidentiality concerns. I thought of running DeepSeekV3, but I don’t have more than approx. 3.5K$ to spend on a GPU. Do you think that I could have options that perform as well as ChatGPT 4o (ish)?

Thanks a lot for your time!!

2 Likes

Ok I’ve been told that “performing as well as” is not specific enough :sneezing_face: I would need approx. 7-10 T/s, and a high quality of output

1 Like

Let me say this first. It’s reckless…:sob: the amount of money needed is off by one digit.

Possible compromises would be to use a model with weaker but similar characteristics, or to accept the hopelessly slow speed and load it into RAM or SSD instead of VRAM and infer from there. RAM is cheap now. There is a 128GB limit for consumer-use Windows, but…
Hugging Face’s accelerate library can also run models stored on SSD.

For example, models with characteristics similar to o1 (reasoning-focused models) are as follows.

1 Like

https://unsloth.ai/blog/deepseekr1-dynamic

1 Like

Thanks a lot for your reply!!

Two questions: 1) Is there a way to test the quality of output of deepseekr1-dynamic? Or the only way to do this is deploying the model on a Google Cloud VM, for instance? 2) If I set my budget to 6k$, would I be able to run this model comfortably?

Also I found this thread on Twitter, explaining how a guy was able to run R1 on a 6k$ machine using solely RAM and 2 CPUs. Do you know if I could apply this logic for a config that would run DeepSeek V3? :thinking:

Thanks again for your time, and sorry for my very ignorant questions :')

2 Likes

It’s not easy to evaluate the quality of the output. I think someone is probably checking whether it’s usable or not. You should rely on that.

Anyway, the more you try to do it on a low budget, the more knowledge and experience you need…
It’s a juggling act.
6k dollars is a huge amount, but even 60k dollars wouldn’t be enough…:sob:

That said, the expensive part is the GPU, so if you’re assuming you’ll be using a CPU and RAM, 3k dollars should be enough. The problem is purely accuracy and speed. It’s a question of how much you’re willing to compromise.

Originally, the largest models for both V3 and R1 are provided with 8-bit precision, but those who run them on CPU and RAM are forced to reduce the precision to 4-bit, 2-bit, or 1.5-bit, so of course the output quality is sacrificed. However, if you don’t reduce the precision, it will not be possible to run it on a home PC. That is the size of the models.

In the case of practical use, not for research or demonstration purposes, it is often the case that better results can be obtained by using a smaller model with 4-bit or 8-bit precision from the start.