5 releases (breaking)
| new 0.33.0 | May 14, 2026 |
|---|---|
| 0.32.0 | May 5, 2026 |
| 0.31.2 | Apr 19, 2026 |
| 0.30.0 | Apr 15, 2026 |
| 0.29.0 | Apr 7, 2026 |
#21 in #cuda-ptx
Used in 11 crates
(9 directly)
4MB
76K
SLoC
Contains (ELF exe/lib, 5KB) elf.o
trueno-gpu
Pure Rust PTX generation for NVIDIA CUDA - no LLVM, no nvcc, no external dependencies.
Philosophy
Own the Stack - Build everything from first principles for complete control, auditability, and reproducibility.
Features
- Pure Rust PTX Generation: Generate PTX assembly directly from Rust code
- No External Dependencies: No LLVM, nvcc, or CUDA toolkit required for code generation
- Builder Pattern API: Ergonomic API for constructing PTX modules and kernels
- Hand-Optimized Kernels: Pre-built kernels for common ML operations
Quick Start
use trueno_gpu::ptx::{PtxModule, PtxKernel, PtxType};
// Build a vector addition kernel
let module = PtxModule::new()
.version(8, 0)
.target("sm_70")
.address_size(64);
let ptx_source = module.emit();
assert!(ptx_source.contains(".version 8.0"));
Available Kernels
| Kernel | Description |
|---|---|
| GEMM | Matrix multiplication (naive, tiled, tensor core) |
| GEMV | Matrix-vector multiply with warp shuffle reduction |
| Softmax | Numerically stable softmax with warp shuffle |
| LayerNorm | Fused layer normalization |
| Attention | FlashAttention-style tiled attention |
| BiasActivation | Fused bias + activation epilogue (None/ReLU/GELU) |
| Quantize | Q4_K/Q5_K/Q6_K dequantization fused with matmul |
Usage
use trueno_gpu::kernels::{GemmKernel, Kernel};
// Create a tiled GEMM kernel
let kernel = GemmKernel::tiled(1024, 1024, 1024);
let ptx = kernel.emit_ptx();
// The PTX can be loaded by CUDA driver API
println!("{}", ptx);
Examples
# PTX quickstart - basic vector addition
cargo run -p trueno-gpu --example ptx_quickstart
# GEMM kernel variants (naive, tiled, tensor core)
cargo run -p trueno-gpu --example gemm_kernel
# Bias + Activation epilogue kernel (ReLU, GELU)
cargo run -p trueno-gpu --example bias_activation
# Quantized GEMM (Q5_K, Q6_K formats)
cargo run -p trueno-gpu --example q5k_q6k_gemm
# FlashAttention (requires CUDA)
cargo run -p trueno-gpu --example flash_attention_cuda --features cuda
# Register allocation visualization
cargo run -p trueno-gpu --example register_allocation
Modules
ptx- PTX code generation (builder pattern)kernels- Hand-optimized GPU kernelsdriver- CUDA driver API (minimal FFI, optional)memory- GPU memory managementbackend- Multi-backend abstraction
Requirements
- Rust 1.70+
- For GPU execution: NVIDIA CUDA driver (optional, only needed to run generated PTX)
License
MIT License - see LICENSE for details.
Part of Trueno
This crate is part of the Trueno high-performance compute library.
Part of the Aprender monorepo — 70 workspace crates.
Dependencies
~3–54MB
~867K SLoC