I don't get it why Llama.cpp / GGML is so much faster than PyTorch

lorenzocc · September 26, 2025, 7:09pm

PyTorch offers a Python API, but the bulk of the processing is executed by the underlying C++ implementation (LibTorch).

GGML / Llama.cpp claims to be much faster because it was written in C/C++.

Why is that the case? I don’t think the Python binding is adding too much overhead, shouldn’t they perform similarly?

John6666 · September 26, 2025, 10:28pm

Rather than PyTorch being slow, I think the key to speed in Llama.cpp is likely its optimization of the generation strategy for CPU and GGUF quantized model weights. Hugging Face TGI, for example, uses PyTorch as one of its backend yet remains fast. Also, Python alone is slow and struggles with multi-core handling, but in scenarios where only the backend speed matters, it’s often not much of an issue.

Pimpcat-AU · September 27, 2025, 5:28am

It is not about Python. It is about an inference only stack that is laser focused on CPU and cache behavior.

What llama dot cpp does that PyTorch usually does not on CPU

Uses very aggressive quantization like four bit and five bit GGUF with per block scales and a layout that matches the matmul kernels. Fewer bytes moved is the main win on CPU.
Ships hand tuned kernels that use SIMD like AVX2 or AVX512 on x86 and NEON on ARM with careful cache tiling and prefetch. These kernels are written for the model shapes that matter.
Avoids framework overhead. No autograd no shape polymorphism checks no dispatcher hops. Static shapes and static graph for inference.
Memory maps weights so cold start is faster and working sets stream in as needed. Very little extra copying.
Threads are pinned and scheduled for cache locality. The KV cache layout and rope math are optimized for batch size one and small batches.
Fuses small ops so fewer passes over memory. Think dequantize and matmul in one sweep.

Why PyTorch can look slower on CPU

It is a general platform. The CPU path carries checks allocs layout conversions and dispatcher cost that help many models but cost cycles here.
Its quantized CPU kernels are improving but are not yet as specialized as llama dot cpp for this exact workload.
Many PyTorch setups keep weights in eight bit or sixteen bit and that alone moves two to four times more data through memory.

When PyTorch wins

On GPU with cuBLAS and Tensor Cores a PyTorch model that uses half precision or better can outrun a CPU build by a large margin.
With large batches or complex pipelines where the framework graph and kernels are already well optimized.

Rule of thumb
On CPU and small batch inference with strong quantization llama dot cpp usually wins. On GPU or with larger batches PyTorch often wins.

Reply generated by TD Ai.

system · October 12, 2025, 8:00pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimize response time of model output 🤗Transformers	0	680	December 23, 2021
Advice to speed and performance 🤗Transformers	4	7250	December 7, 2020
Why Tensorflow Models are way slower than Pytorch models, for autoregressive modeling? 🤗Transformers	10	2118	July 25, 2022
Fused Kernel Operations Intermediate	0	639	July 26, 2022
Speed up the prediction in transformers models 🤗Transformers	0	674	November 23, 2021