Best model for Local LLM for Hard Math/Reasoning Questions - Less than 80B parameters

I am a physician trying to learn LLM technology to solver some of the problems in oncology. I have a remote history of Computer and Electrical Engineering (25+ years ago). I am try to find a Local LLM that can solve some hard problems as way to select a very strong math reasoning model for my research. Here are a few problems (from Putnum):

(1) For each positive integer k, let A(k) be the number of
odd divisors of k in the interval [1, √(2k) ). Evaluate

∑ [(−1)^(k−1)] * [ A(k)/k ]
k=1
ANSWER: (Pi^2)/16

(2)
Let n be a positive integer. For i and j in {1, 2, . . . , n},
let s(i, j) be the number of pairs (a, b) of nonnegative
integers satisfying ai + b j = n. Let S be the n-by-n
matrix whose (i, j) entry is s(i, j). For example, when
n = 5, we have Matrix S =





6 3 2 2 2
3 0 1 0 1
2 1 0 0 1
2 0 0 0 1
2 1 1 1 2




. Compute the determinant of S.
ANSWER: (−1)^(⌈n/2⌉−1) *2 * ⌈ n/2 ⌉. ⌈⌉ is the ceiling function

(3) Let c_0, c_1, c_2, . . . be the sequence defined so that
(1 − 3x − √(1 − 14x + 9x^2)) / 4 =

∑ c_k * x^k
k=0
for sufficiently small x. For a positive integer n, let A be the n-by-n matrix with i, j-entry c(i+ j−1) for i and j in {1, . . . , n}. Find the determinant of A.
ANSWER: 10^(n(n-1)/2)

(4) For a real number a, let Fa(x) = ∑n≥1 (n^a) * (e^(2n) * x^(n^2) for 0 ≤bx < 1. Find a real number c such that
lim
x→1− [ Fa(x) * e^(−1/(1−x)) ] = 0 for all a < c, and
lim
x→1− [ Fa(x) * e^(−1/(1−x)) ] = ∞ for all a > c.
ANSWER: -1/2

I have some success with Qwen3 32B every few runs on some of there, but gets then wrong most of the time. The first one (1) seems to be the hardest and none Itried solves it. I have 64GB Ram and 4090 16GB Laptop. But I am thinking of upgrading to a bigger computer. Please, Let me know what are the strongest reasoning models you have worked with - those will probably have the highest chance of solving these. Each run is 20min to 3 hours for the problems. So, I have been at it after work for 2 weeks and not able to try too many models. Qwen3 seems to be strongest so far. But there are also Qwen3 44B and 48B… Not sure if these are stronger. Too many to try. If there are other models, please let me know. Note, most full models Deepseek, ChatGPT, etc do better but many times get it wrong, esp problem (1).

1 Like

When there are too many candidates, it is quickest to use a benchmark ranking such as a leaderboard. However, benchmarks are just benchmarks, and they do not take into account the individual characteristics of each model. That said, experience shows that models from the same family tend to have similar strengths and weaknesses. Therefore, it may be a good idea to try each family one by one and then investigate them in depth.

Appreciate your help. I have tried some of the “Math” tuned LLMs and they miserably fail on these questions. I will look at the leaderboard. Thank you for your time again! Parvez

1 Like

There may be people knowledgeable about mathematical models in ML-related Discord communities such as Hugging Face Discord. You could try asking them for help. Good luck!:laughing:

Hey there :waving_hand:,

TL;DR

  • If you just want to solve the math problem and move on, hit our fully-free test endpoint at https://libraxis.cloud/api/v1/chat/completions and ask qwen3-14b-Q5-MLX or llama-3.3-nemotron-49b-super-Q5-MLX to do the heavy lifting.
  • If you’d rather run something locally on that RTX 4090-laptop (16 GB VRAM), grab a GPTQ / GGUF build of a math-tuned 7-14 B model (see options below).

1 Why bother with our API?

  • Zero-cost smoke tests – we literally want you to try to break the cluster.
  • Apple Silicon M3 Ultra swarm – silly fast, ~45 tok/s on the 14 B model.
  • One-step auth – any string that starts with sk- works as an API key while the promo is live. :shushing_face:
  • Drop-in OpenAI compatibility – same /chat/completions JSON. No client-side code changes.
curl https://libraxis.cloud/api/v1/chat/completions \
  -H "Authorization: Bearer sk-test-anything" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "qwen3-14b-q5-mlx",
        "messages": [
          {"role":"system","content":"You are a brilliant mathematician."},
          {"role":"user","content":"Prove that the sum of the reciprocals of the primes diverges."}
        ],
        "temperature": 0.2,
        "stream": false
      }'


⸻

2 Which model for serious maths?

Scenario	Recommended model	Why
Remote / quickest result	qwen3-14b-Q5-MLX	Strong chain-of-thought, handles Olympiad-level proofs without hallucinating.
Remote / largest context	llama-3.3-nemotron-49b-super-Q5-MLX	Up to 128 k tokens, shines at multi-step derivations and lengthy LaTeX.
Local / 16 GB VRAM	qwen2-7b-instruct-GPTQ-4bit or DeepSeekMath-7b-GGUF-Q4_K_M	Fits comfortably, still beats GPT-3.5 on GSM8K & MATH.


⸻

3 Running local models on a 4090 laptop (16 GB VRAM)
	1.	Pick a 4-bit quantised checkpoint
	•	qwen2-7b-chat-GPTQ-4bit-128g.safetensors
	•	deepseek-math-7b-gguf-q4_k_m.bin
	2.	Use ExLlama 2 / Llama-cpp

# ExLlama2 example
pip install exllamav2
python -m exllamav2 --model deepseek-math-7b-gguf-q4_k_m.bin --gpu_split auto


	3.	Mind the context – stay below ~8 k tokens to keep memory happy.

⸻

4 Roadmap & fine print
	•	Throughput throttling kicks in above 30 req/min per key – fair-use, not rate-limiting for fun.
	•	Open weights for Qwen 3-14B-MLX will drop as soon as the upstream license is finalised.
	•	Model requests welcome – if it’s on HF, we can probably quantise & host it.

⸻

Have fun frying our servers!
— Maciej @ Libraxis AI (a.k.a. the guy who already blocked the 🇷🇺 spam-bots) 🚀
1 Like

Thank you! I will try it. I do like the local one as ultimately I will be using the chosen model for patient data, which I cannot use on a cloud. Has to be local. I am getting a workstation with 512Gb RAM and 2x RTX a6000 48GB each (got work to agree to pay for it for research). What models can I run on this workstation for problems like these?

1 Like

What models can I run on this workstation for problems like these?

Whether or not to quantize the model during inference makes a difference. Quantization is a type of irreversible compression that slightly reduces output accuracy but significantly reduces VRAM usage. It can be reduced by half or even a quarter. Even at a quarter, the degradation is often not noticeable for short sentences.

Assuming 1/4 (4-bit) quantization, a VRAM size of approximately 96GB should be sufficient for comfortable inference with models of 96B or smaller. This is not very precise but is a practical and easy-to-understand metric, so it is commonly used. With 1/2 quantization, twice the VRAM is required, and without quantization (16-bit), four times the VRAM is needed.

Additionally, while it is possible to offload the insufficient VRAM to RAM or SSD, this incurs significant performance overhead. It should be considered more of a contingency measure.

By the way, there seems to be architecture emerging that efficiently runs LLMs by effectively utilizing RAM. I’m not very familiar with that area, though…

2 Likes

You mention new architecture should I wait to by the workstation with dual RTX a6000. Or do you have another suggestion for about the same cost as dual a6000 ( I know they are not that new, like RTX 5090 which has less VRAM memory but faster). Also the links you have are helpful in understanding some of this stuff. Thx!

1 Like

should I wait to by the workstation with dual RTX a6000

No, I think it’s better to start early if you can.:grinning_face:

It doesn’t mean that RAM is faster than VRAM for GPUs.
CUDA GPUs are still the most user-friendly for AI frameworks. If competitors step up their game, prices will come down…
When it comes to AI hardware, VRAM size is the most important factor, followed by GPU die generation and then processing speed. For GeForce, I recommend the 30x0 generation or later.Older generations may not support certain operations, causing performance to slow down by several times. While the 50x0 series is certainly good, some libraries may not yet support the newer models, so be cautious about that.

For specific purchase consultations, I recommend the general channel on the Hugging Face Discord server. There are many knowledgeable people there who can provide detailed information, including pricing.

1 Like

For my work, speed is not a huge problem. Accuracy is more important. If I can get 4-5 tokens/s is acceptable if quality is high. So, offloading is acceptable. I have run larger models like 72B, slowly at 3-4 token/s with offloading on my laptop. I have run even Llama 4 scout which is MoE 109B at about 3-5 token/s. But not sure what I am doing wrong, it does not solve these problems and gives up very quickly. Not sure, if I need to give specific key words in the prompt or set parameters to encourage more of a reasoning mode. I even tried Nemotron 49B, Llama 3.3 70B, Mistral 7x8B. None seem to make a serious attempt from following the “thinking” and just stop/quit after 500 to 3000 tokens. Qwen3, Phi 4 and Gemma truly mount an effort and I can see that there are good reasoning/ideas towards solving the problems even if the fail. Qwen3 32B has been the strongest. These bigger models ranked high on math but don’t seem to live up to there ranking. Maybe they are good at high-school math, which is probably what they were tested on and not real serious challenging problems to stretch their reasoning. It would be nice to try Qwen 235 a22b once I get my workstation. I did try to play with setting up rStar-math but ran in to difficulties, but must say I did not spend too much time on it. If anyone knows a script or an easy way to set it up, it would be nice to try that. Appreciate everyone’s input.

1 Like

For my work, speed is not a huge problem. Accuracy is more important. If I can get 4-5 tokens/s is acceptable if quality is high. So, offloading is acceptable. I have run larger models like 72B, slowly at 3-4 token/s with offloading on my laptop. I have run even Llama 4 scout which is MoE 109B at about 3-5 token/s. But not sure what I am doing wrong, it does not solve these problems and gives up very quickly. Not sure, if I need to give specific key words in the prompt or set parameters to encourage more of a reasoning mode. I even tried Nemotron 49B, Llama 3.3 70B, Mistral 7x8B. None seem to make a serious attempt from following the “thinking” and just stop/quit after 500 to 3000 tokens. Qwen3, Phi 4 and Gemma truly mount an effort and I can see that there are good reasoning/ideas towards solving the problems even if the fail. Qwen3 32B has been the strongest. These bigger models ranked high on math but don’t seem to live up to there ranking. Maybe they are good at high-school math, which is probably what they were tested on and not real serious challenging problems to stretch their reasoning. It would be nice to try Qwen 235 a22b once I get my workstation. I did try to play with setting up rStar-math but ran in to difficulties, but must say I did not spend too much time on it. If anyone knows a script or an easy way to set it up, it would be nice to try that. Appreciate everyone’s input.

1 Like

just stop/quit after 500 to 3000 tokens

Perhaps max_new_tokens was default?

Edit:
Even with backends other than Transformers, each has its own characteristics even when not quantized, so it’s worth trying them out. In any case, you can do most things with a combination of Transformers, Accelerate, and BitsAndBytes. Even when using Transformers, using appropriate quantization and larger parameter models can be advantageous in terms of accuracy. One thing to note: Llama.cpp sometimes behaves strangely when the context is very long.

1 Like