It is not about Python. It is about an inference only stack that is laser focused on CPU and cache behavior.
What llama dot cpp does that PyTorch usually does not on CPU
Uses very aggressive quantization like four bit and five bit GGUF with per block scales and a layout that matches the matmul kernels. Fewer bytes moved is the main win on CPU.
Ships hand tuned kernels that use SIMD like AVX2 or AVX512 on x86 and NEON on ARM with careful cache tiling and prefetch. These kernels are written for the model shapes that matter.
Avoids framework overhead. No autograd no shape polymorphism checks no dispatcher hops. Static shapes and static graph for inference.
Memory maps weights so cold start is faster and working sets stream in as needed. Very little extra copying.
Threads are pinned and scheduled for cache locality. The KV cache layout and rope math are optimized for batch size one and small batches.
Fuses small ops so fewer passes over memory. Think dequantize and matmul in one sweep.
Why PyTorch can look slower on CPU
It is a general platform. The CPU path carries checks allocs layout conversions and dispatcher cost that help many models but cost cycles here.
Its quantized CPU kernels are improving but are not yet as specialized as llama dot cpp for this exact workload.
Many PyTorch setups keep weights in eight bit or sixteen bit and that alone moves two to four times more data through memory.
When PyTorch wins
On GPU with cuBLAS and Tensor Cores a PyTorch model that uses half precision or better can outrun a CPU build by a large margin.
With large batches or complex pipelines where the framework graph and kernels are already well optimized.
Rule of thumb
On CPU and small batch inference with strong quantization llama dot cpp usually wins. On GPU or with larger batches PyTorch often wins.