9 releases (4 breaking)
Uses new Rust 2024
| 0.4.1 | Apr 24, 2026 |
|---|---|
| 0.4.0 | Apr 17, 2026 |
| 0.3.0 | Apr 14, 2026 |
| 0.2.2 | Apr 13, 2026 |
| 0.0.4 | Apr 11, 2026 |
#284 in Programming languages
Used in 3 crates
320KB
5.5K
SLoC
KAIO
Rust-native GPU kernel authoring framework.
Write GPU compute kernels in Rust, compile to PTX, run on NVIDIA GPUs. No CUDA C++, no Python, no CUDA toolkit required. Windows + Linux.
Quick Start
use kaio::prelude::*;
#[gpu_kernel(block_size = 256)]
fn saxpy(x: *const [f32], y: *mut [f32], alpha: f32, n: u32) {
let idx = thread_idx_x() + block_idx_x() * block_dim_x();
if idx < n {
y[idx] = alpha * x[idx] + y[idx];
}
}
fn main() -> Result<()> {
let device = KaioDevice::new(0)?;
let n = 1024u32;
let x = device.alloc_from(&vec![1.0f32; n as usize])?;
let mut y = device.alloc_from(&vec![2.0f32; n as usize])?;
saxpy::launch(&device, &x, &mut y, 2.5f32, n)?;
let result = y.to_host(&device)?;
println!("result: {:?}", &result[..8]);
// prints: result: [4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5]
Ok(())
}
Requires an NVIDIA GPU with driver installed. No CUDA toolkit needed.
When to Use KAIO
KAIO is not a replacement for ML frameworks like Candle or Burn. It is the layer you use when you need more control than they provide.
- Write custom GPU kernels your framework doesn't support
- Control GPU memory explicitly — deterministic VRAM, buffer reuse
- Ship GPU binaries without Python or Triton in your dependency chain
- Run on Windows (Triton is Linux-only)
- Prototype GPU code in Rust without learning CUDA C++
Features
| Feature | Syntax | Status |
|---|---|---|
| Arithmetic | +, -, *, /, %, +=, -=, *=, /= |
Supported |
| Comparisons | <, <=, >, >=, ==, != |
Supported |
| Control flow | if/else, for, while |
Supported |
| Array access | a[idx] (global memory) |
Supported |
| Shared memory | shared_mem![f32; 256] |
Supported |
| Synchronization | bar_sync() |
Supported |
| Warp shuffle | shfl_sync_down/up/bfly() |
Supported |
| Reductions | block_reduce_sum(), block_reduce_max() |
Supported |
| Type casts | x as f32 |
Supported |
| Math builtins | sqrt, exp, log, tanh, abs, min, max |
Supported |
| FMA | fma(a, b, c) |
Supported |
| 2D blocks | block_size = (16, 16), thread_idx_y() |
Supported |
| Tiled matmul | kaio_ops::matmul() (31% of cuBLAS) |
Supported |
| Attention | kaio_ops::attention(), attention_causal() |
Supported |
| FlashAttention | kaio_ops::attention_flash() — O(d_k) memory |
Supported |
| Auto-tuner | kaio_ops::tune_matmul(), matmul_auto() |
Supported |
Architecture
| Crate | Description |
|---|---|
kaio |
Umbrella crate — re-exports everything via prelude |
kaio-macros |
#[gpu_kernel] proc macro |
kaio-core |
PTX IR, instruction emitters, zero external dependencies |
kaio-runtime |
CUDA driver wrapper via cudarc |
kaio-ops |
Pre-built GPU operations (matmul, attention, auto-tuner) |
Limitations
- NVIDIA only (SM 7.0+) — no AMD, no Intel
- Not cuBLAS-level performance (matmul reaches 31%)
- DSL subset of Rust — no closures, traits, generics, or
&&/|| - FlashAttention requires d_k <= 256
- No autograd, no multi-GPU
- API may change
Status
Phase 5 complete — attention (standard + FlashAttention), causal masking, auto-tuner, Windows CI. v0.1.0.
See the repository for full documentation, runnable examples, copy-paste patterns, and development logs.
License
Licensed under either of Apache-2.0 or MIT at your option.
Dependencies
~10MB
~175K SLoC