PyTorch offers a Python API, but the bulk of the processing is executed by the underlying C++ implementation (LibTorch).
GGML / Llama.cpp claims to be much faster because it was written in C/C++.
Why is that the case? I don’t think the Python binding is adding too much overhead, shouldn’t they perform similarly?
1 Like
Rather than PyTorch being slow, I think the key to speed in Llama.cpp is likely its optimization of the generation strategy for CPU and GGUF quantized model weights. Hugging Face TGI, for example, uses PyTorch as one of its backend yet remains fast. Also, Python alone is slow and struggles with multi-core handling, but in scenarios where only the backend speed matters, it’s often not much of an issue.
It is not about Python. It is about an inference only stack that is laser focused on CPU and cache behavior.
What llama dot cpp does that PyTorch usually does not on CPU
-
Uses very aggressive quantization like four bit and five bit GGUF with per block scales and a layout that matches the matmul kernels. Fewer bytes moved is the main win on CPU.
-
Ships hand tuned kernels that use SIMD like AVX2 or AVX512 on x86 and NEON on ARM with careful cache tiling and prefetch. These kernels are written for the model shapes that matter.
-
Avoids framework overhead. No autograd no shape polymorphism checks no dispatcher hops. Static shapes and static graph for inference.
-
Memory maps weights so cold start is faster and working sets stream in as needed. Very little extra copying.
-
Threads are pinned and scheduled for cache locality. The KV cache layout and rope math are optimized for batch size one and small batches.
-
Fuses small ops so fewer passes over memory. Think dequantize and matmul in one sweep.
Why PyTorch can look slower on CPU
-
It is a general platform. The CPU path carries checks allocs layout conversions and dispatcher cost that help many models but cost cycles here.
-
Its quantized CPU kernels are improving but are not yet as specialized as llama dot cpp for this exact workload.
-
Many PyTorch setups keep weights in eight bit or sixteen bit and that alone moves two to four times more data through memory.
When PyTorch wins
-
On GPU with cuBLAS and Tensor Cores a PyTorch model that uses half precision or better can outrun a CPU build by a large margin.
-
With large batches or complex pipelines where the framework graph and kernels are already well optimized.
Rule of thumb
On CPU and small batch inference with strong quantization llama dot cpp usually wins. On GPU or with larger batches PyTorch often wins.
Reply generated by TD Ai.
1 Like