5 releases (breaking)

new 0.33.0	May 14, 2026
0.32.0	May 5, 2026
0.31.2	Apr 19, 2026
0.30.0	Apr 15, 2026
0.29.0	Apr 7, 2026

#21 in #cuda-ptx

Used in 11 crates (9 directly)

MIT license

4MB
76K SLoC

Contains (ELF exe/lib, 5KB) elf.o

trueno-gpu

Pure Rust PTX generation for NVIDIA CUDA - no LLVM, no nvcc, no external dependencies.

Philosophy

Own the Stack - Build everything from first principles for complete control, auditability, and reproducibility.

Features

Pure Rust PTX Generation: Generate PTX assembly directly from Rust code
No External Dependencies: No LLVM, nvcc, or CUDA toolkit required for code generation
Builder Pattern API: Ergonomic API for constructing PTX modules and kernels
Hand-Optimized Kernels: Pre-built kernels for common ML operations

Quick Start

use trueno_gpu::ptx::{PtxModule, PtxKernel, PtxType};

// Build a vector addition kernel
let module = PtxModule::new()
    .version(8, 0)
    .target("sm_70")
    .address_size(64);

let ptx_source = module.emit();
assert!(ptx_source.contains(".version 8.0"));

Available Kernels

Kernel	Description
GEMM	Matrix multiplication (naive, tiled, tensor core)
GEMV	Matrix-vector multiply with warp shuffle reduction
Softmax	Numerically stable softmax with warp shuffle
LayerNorm	Fused layer normalization
Attention	FlashAttention-style tiled attention
BiasActivation	Fused bias + activation epilogue (None/ReLU/GELU)
Quantize	Q4_K/Q5_K/Q6_K dequantization fused with matmul

Usage

use trueno_gpu::kernels::{GemmKernel, Kernel};

// Create a tiled GEMM kernel
let kernel = GemmKernel::tiled(1024, 1024, 1024);
let ptx = kernel.emit_ptx();

// The PTX can be loaded by CUDA driver API
println!("{}", ptx);

Examples

# PTX quickstart - basic vector addition
cargo run -p trueno-gpu --example ptx_quickstart

# GEMM kernel variants (naive, tiled, tensor core)
cargo run -p trueno-gpu --example gemm_kernel

# Bias + Activation epilogue kernel (ReLU, GELU)
cargo run -p trueno-gpu --example bias_activation

# Quantized GEMM (Q5_K, Q6_K formats)
cargo run -p trueno-gpu --example q5k_q6k_gemm

# FlashAttention (requires CUDA)
cargo run -p trueno-gpu --example flash_attention_cuda --features cuda

# Register allocation visualization
cargo run -p trueno-gpu --example register_allocation

Modules

ptx - PTX code generation (builder pattern)
kernels - Hand-optimized GPU kernels
driver - CUDA driver API (minimal FFI, optional)
memory - GPU memory management
backend - Multi-backend abstraction

Requirements

Rust 1.70+
For GPU execution: NVIDIA CUDA driver (optional, only needed to run generated PTX)

License

MIT License - see LICENSE for details.

Part of Trueno

This crate is part of the Trueno high-performance compute library.

Part of the Aprender monorepo — 70 workspace crates.

Dependencies

~3–54MB
~867K SLoC