9 releases (4 breaking)
Uses new Rust 2024
| 0.4.1 | Apr 24, 2026 |
|---|---|
| 0.4.0 | Apr 17, 2026 |
| 0.3.0 | Apr 14, 2026 |
| 0.2.2 | Apr 13, 2026 |
| 0.0.4 | Apr 11, 2026 |
#2230 in Hardware support
Used in 4 crates
(via kaio)
275KB
5K
SLoC
KAIO
Write custom GPU kernels in Rust — no CUDA C++, no Python, no toolkit. KAIO compiles a Rust-subset DSL to PTX at build time and JIT-loads it on the GPU at launch. Works on Windows and Linux with just the NVIDIA display driver installed.
KAIO (from the Greek καιω — to burn, to ignite) is for Rust engineers who need custom GPU kernels today — fused attention variants, quantization ops, novel activations — and would otherwise be writing CUDA C++ because their framework doesn't support them.
Key highlights
- No CUDA toolkit required — just the NVIDIA display driver. Build
in CI on a standard GitHub runner; host tests pass without a GPU.
This is KAIO's single biggest differentiator vs CUDA C++,
rust-cuda, and Triton. - Meets or beats cuBLAS sgemm at 4096² — on RTX 4090 across 10 consecutive benchmark runs, KAIO tensor-core matmul (async, fp16 inputs with fp32 accumulation) reached 58.74 TFLOPS worst observed and 65.12 TFLOPS median. Against the corresponding cuBLAS sgemm baseline, the worst observed ratio was 115% of cuBLAS. These are floors, not peaks. Full distribution →
- Windows and Linux native. No WSL2, no Triton's Linux-only
runtime, no Python.
cargo buildworks everywhere. - Pure-Rust kernel authorship. The
#[gpu_kernel]proc macro lowers Rust to a PTX IR module at compile time; at launch the module is validated against the current GPU's SM target, emitted to PTX text, and handed to the CUDA driver for JIT compilation. Type-safe kernel signatures catch dtype mismatches at compile time, not as silent GPU corruption at runtime.
Try KAIO in 30 seconds
Clone the repo, run one command, see real ML kernels build and execute on your GPU:
git clone https://github.com/dmriding/kaio.git
cd kaio
cargo xtask showcase
You'll see fused_silu_gate, gelu_comparison, rms_norm, layer_norm, softmax, int8_dequant, int8_matmul, int4_matmul, and quantized_attention compile, launch, verify correctness against a CPU reference, and report median latency. The nine examples span activations, normalizations, reductions, the quantize → matmul pipeline, and end-to-end quantized attention — the canonical transformer-primitive arc plus the W8A8 / W4A16 / fused-QKV headline ops.
Want the performance pitch instead? cargo xtask bench runs the tensor-core matmul benchmark against cuBLAS sgemm across five sizes. Or cargo xtask all for both. cargo xtask --help for the full tooling surface.
Requires an NVIDIA GPU with an installed display driver (NVIDIA 525 or newer — any standard Game Ready or Studio driver works). No CUDA toolkit install needed.
Quick Start
Requires an NVIDIA GPU (SM 7.0+, Volta or newer) with an installed display driver — NVIDIA 525 or newer (any standard Game Ready or Studio driver works; no Tesla/TCC-specific drivers needed).
cargo add kaio
use kaio::prelude::*;
#[gpu_kernel(block_size = 256)]
fn saxpy(x: *const [f32], y: *mut [f32], alpha: f32, n: u32) {
let idx = thread_idx_x() + block_idx_x() * block_dim_x();
if idx < n {
y[idx] = alpha * x[idx] + y[idx];
}
}
fn main() -> Result<()> {
let device = KaioDevice::new(0)?;
let n = 1024u32;
let x = device.alloc_from(&vec![1.0f32; n as usize])?;
let mut y = device.alloc_from(&vec![2.0f32; n as usize])?;
saxpy::launch(&device, &x, &mut y, 2.5f32, n)?;
let result = y.to_host(&device)?;
println!("result: {:?}", &result[..8]);
Ok(())
}
$ cargo run
result: [4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5]
Kernel buffer parameters use *const [T] / *mut [T] to reflect
device-pointer semantics (thousands of threads, no aliasing contract);
&[T] / &mut [T] are accepted as ergonomic sugar and lower
identically — see RFC-0001.
The real pitch — fused ML kernels
SAXPY is for learning the DSL. The actual value looks like this:
use kaio::prelude::*;
// Gated SiLU — the feedforward activation in every LLaMA / Mistral /
// Qwen block. llama.cpp, vLLM, and TensorRT-LLM all ship hand-written
// CUDA for it. With KAIO it's 7 lines of Rust, lowered to a PTX IR
// module at compile time and JIT-loaded at launch.
#[gpu_kernel(block_size = 256)]
fn fused_silu_gate(x: *const [f32], gate: *const [f32], out: *mut [f32], n: u32) {
let idx = thread_idx_x() + block_idx_x() * block_dim_x();
if idx < n {
let xi = x[idx];
let sig = 1.0f32 / (1.0f32 + exp(-xi));
out[idx] = xi * sig * gate[idx];
}
}
Run it from the repo root with no directory changes:
$ cargo xtask showcase silu
=== fused_silu_gate ===
Input size: 1048576 elements
Correctness: PASS (max_abs_err = 1.49e-8)
Median latency: 188.8 μs (of 100 timed runs, 5 warm-ups skipped)
Or run all nine showcases in sequence with cargo xtask showcase:
fused SiLU-gate,
exact vs fast GELU,
single-block RMSNorm,
single-block LayerNorm,
single-block softmax,
INT8 dequantization,
INT8 matmul,
INT4 matmul,
quantized attention. Each is a
complete standalone project with correctness + timing (with its own
Cargo.toml so you can copy the directory out of the repo as a
reference for your own kernel); the GELU comparison's README explains
why kernel fusion matters more than arithmetic optimization for ML
workloads (the bandwidth-bound teaching moment).
When to use KAIO
Reach for KAIO when:
- Your framework can't support a custom op (novel attention, fused activation, quantization) and you don't want to drop into CUDA C++ for one kernel.
- You need GPU inference on Windows without WSL2 or Triton's Linux-only runtime.
- Your CI runs on standard GitHub runners without GPU or CUDA
toolkit access. KAIO's host tests pass without a GPU; only the
#[ignore]-gated integration tests need one. Flip the matrix on later. - You need deterministic VRAM usage, explicit buffer reuse, or zero-copy transfers that high-level frameworks abstract away.
- You're prototyping GPU code in a language you already know (Rust) without learning CUDA C++.
KAIO is not a replacement for Candle or Burn. It is the layer you use when you need more control than they provide.
| KAIO | cudarc | Candle / Burn | Triton (Python) | Raw CUDA | |
|---|---|---|---|---|---|
| Write kernels in Rust | Yes | No (load PTX) | No | No (Python) | No |
| Automatic PTX generation | Yes | No | N/A | Yes (runtime) | No |
| Windows support | Yes | Yes | Partial | No | Yes |
| No CUDA toolkit needed | Yes | Yes | Varies | No | No |
| Compile-time kernel codegen | Yes | N/A | N/A | No (runtime JIT) | Yes |
| Type-safe kernel signatures | Yes | No | N/A | No | No |
| ML framework integration | candle (via kaio-candle) |
Standalone | Built-in | PyTorch | Manual |
What this is not
- Not compiled Rust.
#[gpu_kernel]bodies use Rust syntax but are parsed into KAIO's own IR and lowered directly to PTX. rustc's backend (LLVM, MIR, borrow checker) never sees the kernel body. You cannot call Rust functions declared outside the kernel from inside it. - Not CUDA bindings. KAIO generates PTX itself. It does not wrap cuDNN, cuBLAS, CUTLASS, or any CUDA C++ library. The comparison to cuBLAS sgemm in this README is a measurement reference, not a dependency.
- Not a full ML framework. No autograd (beyond the handful of
kaio-candlebackward bindings), no model zoo, no training loop. KAIO is the layer you use when Candle or Burn don't have the op you need.
Performance
Performance is optimized for large ML workloads (transformer-scale shapes). Small sizes are launch-overhead dominated and will lag cuBLAS — see the apples-to-apples notes below.
Numbers below are worst observed across 10 consecutive
cargo xtask bench runs on RTX 4090 sm_89, release build, warm GPU
(median of 20 timed iterations per run, then minimum across runs).
These are the conservative local numbers used for README claims.
Thermally throttled or non-steady-state cards may see lower; the
methodology is in docs/benchmarks.md.
Tensor-core matmul (f16 × f16 → f32):
| Size | TC sync worst | TC async worst | cuBLAS sgemm worst | sync vs cuBLAS | async vs cuBLAS |
|---|---|---|---|---|---|
| 256³ | 0.09 TF | 0.09 TF | 1.71 TF | 5.3% | 5.3% |
| 512³ | 0.72 TF | 0.53 TF | 10.74 TF | 6.7% | 4.9% |
| 1024³ | 5.14 TF | 4.98 TF | 31.86 TF | 16.1% | 15.6% |
| 2048³ | 27.11 TF | 27.90 TF | 43.12 TF | 62.9% | 64.7% |
| 4096³ | 54.63 TF | 58.74 TF | 51.05 TF | 107.0% | 115.1% |
Quantized matmul at 4096³ (same bench, worst-of-10):
| Kernel | Worst TOPS | Median TOPS | Best TOPS |
|---|---|---|---|
matmul_int8 (W8A8) |
84.07 | 92.58 | 93.38 |
matmul_int4 (W4A16) |
52.02 | 57.52 | 58.04 |
Apples-to-apples disclaimer: KAIO tensor-core matmul uses fp16 inputs
with fp32 accumulation; cuBLAS sgemm is f32 in / f32 out. The INT8 and
INT4 columns compare against cuBLAS sgemm because cublasGemmEx INT8
is not cleanly exposed by cudarc 0.19 — weight bandwidth alone
(0.5 B/weight for INT4 vs 4 B for sgemm) dominates these ratios.
Compare these numbers sprint-over-sprint for regression detection; the
"vs cuBLAS sgemm" column is a project-local baseline, not a
precision-identity claim. See docs/performance.md
for the full distribution (min / median / max across all sizes).
The problem KAIO solves
The Rust ML ecosystem can't keep up with Python. Every time a new model architecture drops with a custom operation — a novel attention variant, a fused activation, a custom quantization kernel — frameworks like candle and burn can't support it until someone writes the GPU function. Today, that means writing CUDA C++, fighting FFI bindings, and giving up on Windows.
Meanwhile, Python developers write a Triton kernel in an afternoon and move on. Triton doesn't support Windows, requires Python, and JIT-compiles at runtime — but it works, and Rust has no equivalent.
KAIO is that equivalent.
Patterns
Copy these skeletons, fill in your logic.
Bounds-checked element-wise:
#[gpu_kernel(block_size = 256)]
fn my_kernel(input: *const [f32], output: *mut [f32], n: u32) {
let idx = thread_idx_x() + block_idx_x() * block_dim_x();
if idx < n {
output[idx] = input[idx] * 2.0; // your logic here
}
}
Shared memory tiling:
#[gpu_kernel(block_size = 256)]
fn tiled(data: *const [f32], out: *mut [f32], n: u32) {
let tid = thread_idx_x();
let idx = tid + block_idx_x() * block_dim_x();
let tile = shared_mem![f32; 256];
if idx < n { tile[tid] = data[idx]; }
bar_sync();
if idx < n { out[idx] = tile[tid]; } // read from shared
}
Block reduction:
#[gpu_kernel(block_size = 256)]
fn reduce(input: *const [f32], out: *mut [f32], n: u32) {
let idx = thread_idx_x() + block_idx_x() * block_dim_x();
let val = if idx < n { input[idx] } else { 0.0f32 };
let sum = block_reduce_sum(val);
if thread_idx_x() == 0u32 { out[block_idx_x()] = sum; }
}
Feature summary
| Feature | Notes |
|---|---|
#[gpu_kernel] proc macro |
Rust → PTX IR at compile time; PTX emitted + JIT-loaded at launch. Type-safe launch wrapper auto-generated. |
| Shared memory + reductions + warp shuffles | shared_mem![], bar_sync(), block_reduce_sum/max/min, warp_reduce_sum/max/min, shfl_sync_*. |
| 2D blocks, FMA, math builtins | block_size = (16,16), fma, sqrt, exp, log, tanh, abs, min, max. |
| Scalar tiled matmul | kaio_ops::matmul / matmul_auto — 31% of cuBLAS sgemm. Any SM. |
| Fused attention + FlashAttention | kaio_ops::attention, attention_flash (O(d_k) memory). Any SM. |
| Tensor-core matmul | kaio_ops::matmul_tc / matmul_tc_async / matmul_auto_tc — f16 → f32, SM 8.0+, worst-of-10 at 4096³ on RTX 4090: sync 107% / async 115% of cuBLAS sgemm. |
| INT8 dequantize-matmul (W8A8) | kaio_ops::matmul_int8 — symmetric i8 × i8 → f32 with single-scalar scale, SM 8.0+, K%32==0. Worst-of-10 at 4096³: 84.07 TOPS (median 92.58, best 93.38). |
| INT4 dequantize-matmul (W4A16, GPTQ-style) | kaio_ops::matmul_int4 — packed signed-INT4 weights × f16 activations → f32, f16 group scales (group_size=128), DEQUANT-F16 via mma.sync.m16n8k16, SM 8.0+, K%128==0. Worst-of-10 at 4096³: 52.02 TOPS (median 57.52, best 58.04). |
| Fused tri-output QKV projection (INT8, W8A16) | kaio_ops::qkv_project_int8 — f16 activations × i8 weights × per-projection scalar scales → three f16 outputs (Q, K, V). Decode tier ~3× faster than three standalone matmuls; prefill performance varies by shape. SM 8.0+, K%16==0, N%2==0. |
| Fused tri-output QKV projection (INT4, W4A16) | kaio_ops::qkv_project_int4 — packed INT4 weights × f16 activations × f16 group scales → three f16 outputs. Decode tier ~3× faster than three standalone calls; for prefill-heavy workloads at M≥2048, three separate matmul_int4 calls may be faster. SM 8.0+, K%128==0, group_size=128. |
| Auto-tuner + cache | tune_matmul, matmul_auto, matmul_auto_tc with JSON cache. |
| PTX inspection | KAIO_DUMP_PTX=1, KAIO_PTX_STATS=1, KAIO_PTX_ANNOTATE=1. |
See docs.rs/kaio for the full API surface and the internal IR types.
Project status and constraints
KAIO is pre-1.0 software. Current engineering constraints:
- NVIDIA only. SM 7.0+ (Volta, Turing, Ampere, Ada Lovelace, Hopper). No AMD, no Intel, no Apple Silicon.
- Matmul performance is size-dependent. Tensor-core matmul meets or beats cuBLAS sgemm at 4096² on RTX 4090 (worst-of-10: 107% sync / 115% async) but lags heavily at ≤1024² (5–17%) because a 64×64 multi-warp block tile doesn't fill the SM array until the grid is large. Scalar matmul tops out at 31% of cuBLAS. For small shapes prefer cuBLAS or the scalar path. Details →
- Mostly inference. The
kaio-candlebridge ships 8 forward ops;matmul_tcandmatmul_tc_asyncsupport backward via mixed-precision autograd (gradients are computed in f16, matching the forward-pass precision). Attention and quantized-op backward are not yet implemented. - DSL is a Rust subset, not compiled Rust.
#[gpu_kernel]function bodies use Rust syntax but are parsed into KAIO's own IR and lowered directly to PTX. The kernel body never reaches rustc's backend — no LLVM, no MIR, no borrow checker. Kernel parameters are written as*const [T]/*mut [T](primary, per RFC-0001) to signal the on-device reality: thousands of GPU threads access the same buffer concurrently, and correctness depends on disjoint access patterns (e.g.if idx < nguards), not on compiler-enforced uniqueness.&[T]/&mut [T]are accepted as permanent sugar and lower identically. You cannot call Rust functions declared outside the kernel inside the kernel body. No closures, traits, generics, method calls, or string operations. Arithmetic, comparisons, bitwise operators (&|^<<>>!), short-circuit&&/||, and compound assignment (including bitwise&=/|=/<<=/ etc.) all supported. - FlashAttention d_k limit.
attention_flash()requires d_k ≤ 256 (one thread per output dimension). - Single-device. No multi-GPU support.
- First-call PTX module load. The
#[gpu_kernel]macro generates PTX at Rust compile time, but CUDA-driver module loading still happens at first launch. First call pays the module-load latency; subsequent launches are dispatch-only. (The in-process PTX cache was removed in v0.2.1 to simplify the load path and run kernels throughPtxModule::validate()on every launch.) - API will change before 1.0. Breaking changes documented in CHANGELOG per release.
Architecture
Four layers, bottom to top:
+-------------------------------------------+
| Layer 4: Block-Level Operations | tiled matmul, fused attention, TC matmul
+-------------------------------------------+
| Layer 3: #[gpu_kernel] Proc Macro | Rust syntax → PTX automatically
+-------------------------------------------+
| Layer 2: Runtime (kaio-runtime) | device memory, kernel launch, SM validation
+-------------------------------------------+
| Layer 1: PTX Codegen (kaio-core) | IR types, instruction emitters, fragments
+-------------------------------------------+
| Crate | Description |
|---|---|
kaio |
Umbrella crate — re-exports everything via prelude |
kaio-macros |
#[gpu_kernel] proc macro |
kaio-core |
PTX IR, instruction emitters, fragment containers, zero external deps |
kaio-runtime |
CUDA driver wrapper via cudarc |
kaio-ops |
Pre-built GPU operations (matmul, attention, TC matmul, auto-tuner) |
kaio-candle |
candle bridge — 8 forward ops + 2 backward (matmul_tc, matmul_tc_async), event-based stream sync. Standalone crate at kaio-candle/ |
Candle integration
The kaio-candle crate bridges KAIO's GPU kernels into
candle's tensor graph. 8 forward
ops + 2 backward ops, event-based stream sync (CUDA Graph compatible).
use std::sync::Arc;
use kaio::prelude::KaioDevice;
let kd = Arc::new(KaioDevice::new(0)?);
let c = kaio_candle::matmul_tc(&kd, &a, &b)?; // candle Tensor in, candle Tensor out
See kaio-candle/README.md for the full op
surface and usage guide.
Target hardware
- GPUs: NVIDIA SM 7.0+ (Volta, Turing, Ampere, Ada Lovelace,
Hopper). Tensor-core kernels (
matmul_tc*) require SM 8.0+ (Ampere or newer). - Platforms: Windows 10/11 and Linux (Ubuntu 22.04+).
- Driver: NVIDIA 525+ (CUDA 12.0+ compatible). Standard Game Ready or Studio drivers work on consumer cards; Tesla/TCC drivers are not required and not needed for KAIO's dynamic-loading path.
- Tested on: RTX 4090 (sm_89) under Windows.
Building
cargo build --workspace
cargo test --workspace # host tests (no GPU needed)
cargo test --workspace -- --ignored # GPU tests (requires NVIDIA GPU)
KAIO_DUMP_PTX=1 cargo test # inspect generated PTX
Requires Rust 1.94+ (pinned via rust-toolchain.toml). The version
floor reflects edition-2024 features and const-evaluation patterns used
in kernel tile-layout computation. No CUDA toolkit is needed to build
— KAIO resolves the NVIDIA driver at runtime via dynamic loading
(nvcuda.dll on Windows, libcuda.so on Linux).
Debugging
When something goes wrong — launch errors, silent NaN, unexpectedly slow
performance — docs/debugging.md is the single
entry point. It covers the env vars (KAIO_DUMP_PTX, KAIO_PTX_STATS,
KAIO_PTX_ANNOTATE, KAIO_SM_TARGET, KAIO_TUNE_CACHE,
KAIO_SUPPRESS_DEBUG_WARNING), the async-launch error model,
compute-sanitizer usage for silent-corruption diagnosis, tolerance
guidance for numerical verification, and a troubleshooting flowchart
for "did it compile → launch → produce right output?"
Test coverage
88.39% line coverage across the 21,569-line workspace (2,504 lines
uncovered, mostly host-side parser error paths, the xtask repo-tooling
binary, and the unreachable-by-design host stubs for GPU builtins in
kaio/src/gpu_builtins.rs). Shipped kernel crates remain well above
the workspace average — kaio-ops/src/matmul_int8_kernel.rs at 97.18%,
matmul_tc_kernel.rs at 97.94%, matmul_tc_async_kernel.rs at 99.87%,
attention_tc_kernel.rs at 97.70%. Measured on RTX 4090 sm_89 via
cargo llvm-cov with the host test suite and the full GPU-only
--ignored test suite merged:
cargo install cargo-llvm-cov # one-time
cargo llvm-cov clean --workspace
cargo llvm-cov --workspace --no-report
cargo llvm-cov --workspace --no-report -- --ignored
cargo llvm-cov report --summary-only
The number is static (updated per release, not per CI run) because
the GPU-ignored tests require actual NVIDIA hardware and can't run on
standard GitHub Actions runners. See
docs/testing-strategy.md for the full
testing model (host tests, GPU integration tests, ptxas_verify
structural checks, and the matmul_tc_bench performance harness).
How it works
The #[gpu_kernel] macro:
- Parses your Rust function body into a kernel IR.
- Lowers expressions to PTX instruction sequences.
- Generates a
build_ptx()function that constructs the IR at runtime (first call only, cached viaOnceLock). - Emits PTX text through
PtxWriter. - Wraps everything in a typed
launch()function that handles device memory and kernel dispatch.
The generated PTX is validated against nvcc output and passes
ptxas --verify on every tested kernel.
IR API (advanced)
You can also build kernels directly via the Layer 1 IR API for maximum control:
use kaio_core::emit::{Emit, PtxWriter};
use kaio_core::ir::*;
use kaio_core::types::PtxType;
let mut alloc = RegisterAllocator::new();
let mut kernel = PtxKernel::new("my_kernel");
kernel.add_param(PtxParam::pointer("data", PtxType::F32));
// ... build instructions ...
let mut module = PtxModule::new("sm_89");
module.add_kernel(kernel);
let mut w = PtxWriter::new();
module.emit(&mut w).unwrap();
println!("{}", w.finish()); // valid PTX assembly
See kaio-runtime/tests/vector_add_e2e.rs for a complete end-to-end example.
Roadmap
- Phase 1 — PTX codegen + runtime (IR → PTX → GPU execution)
- Phase 2 —
#[gpu_kernel]proc macro (arithmetic, control flow, memory access, math builtins) - Phase 3 — Loops, shared memory, reductions, softmax
- Phase 4 — Tiled matmul (31% of cuBLAS),
kaio-opscrate, 2D blocks, FMA, PTX inspection tools - Phase 5 — Fused attention + FlashAttention + auto-tuning, crates.io v0.1.0
- Phase 6 — Tensor cores (
mma.syncfp16/bf16), async copies (cp.async), bank-conflict padding. Initial v0.2.0 publish shipped at 82.3% sync / 92.5% async of cuBLAS sgemm; subsequent warm-steady 10-run worst-case is 107% / 115% (see Performance). Three standalone showcase examples. crates.io v0.2.0. - Phase 7 — Quantized kernels (INT8/INT4, fused QKV
projection), candle integration (
kaio-candlebridge — 8 forward ops, 2 backward ops, event-based stream sync). - Phase 8 — PyO3 bindings (Python access to
kaio-ops).
See CHANGELOG.md for per-release detail and docs/phases.md for deeper phase plans.
Common pitfalls
- Always bounds-check array writes.
if idx < nbefore every global memory access — out-of-bounds GPU writes corrupt memory silently. - Shared memory must fit within block limits. Default is 48 KB
per block.
shared_mem![f32; 12288]= 48 KB = the limit. - Thread indexing errors are the most common bug. Double-check your row / col math, especially with 2D blocks. Off-by-one in a kernel doesn't panic — it writes to the wrong address.
Feedback
If something is confusing, awkward, or broken — open an issue. Even small friction matters. This project is actively developed and feedback directly shapes what gets built next.
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Dependencies
~10MB
~175K SLoC