KAIO: Rust-native GPU kernel framework, 92.5% of cuBLAS sgemm (custom ops without CUDA C++)

Hey everyone! I just released v0.2.0 of KAIO, a framework for writing custom GPU kernels in Rust. Sharing here because it’s directly relevant to anyone hitting the “I need a custom op but Triton is Python-only and CUDA C++ is painful” wall.

The problem KAIO solves: if you’re building inference pipelines with candle, burn, or any Rust-based ML stack, and you need a kernel that doesn’t exist: a fused activation, a custom attention variant, a quantization op, your only option today is writing CUDA C++, building FFI bindings, and losing Windows support. Python developers reach for triton.jit and move on. Rust developers had no equivalent. KAIO is an attempt to fill that gap.

You write a kernel in Rust:

#[gpu_kernel(block_size = 256)]
fn fused_silu_gate(x: &[f32], gate: &[f32], out: &mut [f32], n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    if idx < n {
        let xi = x[idx];
        let sig = 1.0f32 / (1.0f32 + exp(-xi));
        out[idx] = xi * sig * gate[idx];
    }
}

The proc macro lowers it to PTX at build time. The runtime dispatches via the NVIDIA driver. No CUDA toolkit needed to build, works on Windows and Linux.

Performance: The tensor-core matmul hits 92.5% of cuBLAS sgemm at 4096² on an RTX 4090 (fp16 inputs, fp32 accumulation, cp.async double-buffered). That’s not a wrapper around cuBLAS, it’s a from-scratch tiled matmul with mma.sync fragment mapping, multi-warp blocks, and edge-tile handling, all generated from Rust-authored IR.

Where this fits in the HF ecosystem: candle is excellent for inference but doesn’t have a custom kernel authoring story. If you’re extending candle with a new op today, you’re writing CUDA C++ and binding it through candle’s FFI layer. KAIO could serve as the kernel-authoring layer underneath, write the GPU op in Rust, call it from candle, stay in one language with one toolchain. That integration doesn’t exist yet, but the kernel primitives are ready.

Limitations (being upfront): NVIDIA-only (SM 7.0+), inference-focused (no autograd), pre-1.0 API. The kernel DSL is a Rust subset — no closures, traits, or generics inside kernel bodies. Small-shape matmul trails cuBLAS significantly. The tensor-core comparison is fp16→fp32 vs cuBLAS sgemm f32→f32, which is a project-local baseline, not an apples-to-apples precision claim.

Links:

Curious what custom kernels people in the HF community would find most useful, fused attention variants? Quantization ops for GPTQ/AWQ? Custom normalization layers? That feedback would directly shape the roadmap.

1 Like