Sweet GPU compute kernels in CUDA, wrapped in Python via CuPy.
uv sync && uv pip install -e . && python examples/mandelbrot.pyYou need CUDA Toolkit 11.8 (well, newer version may not work) and an NVIDIA GPU (sm_75+ for the HMMA kernels; Turing or anything newer). CuPy's bundled nvrtc compiles each kernel at first use, picking up mma.h and friends from CUDA_PATH.
| Kernel | What it does |
|---|---|
reduction_v2 |
sum-reduce a 1D float array |
matrix_transpose |
2D fp32 transpose, 32×33 padded smem tile |
softmax_vec4 |
row-wise softmax with float4 loads |
flash_attention_hmma |
FA-2 forward, fp16 in / fp32 accum, wmma 16×16×16 |
gemm_tiled |
dense fp32 GEMM, 128×128 macro / 8×8 register micro, manual smem double-buffer |
gemm_int8 |
W8A8 INT8 GEMM, per-row act scale × per-channel weight scale |
gemm_int4 |
W4A16 weight-only (AWQ/GPTQ-shaped), G=64 asymmetric, dequant-fuse-matmul |
fused_ffn_tail |
RMSNorm → ×γ → +bias → GELU/SiLU → +residual, one kernel |
rope |
rotary positional embedding, Llama half-rotation (pair separation D/2), in-place safe |
rmsnorm |
RMSNorm forward + backward, per-row fused reductions, dgamma in fp32 accum |
MIT