Skip to content

poyea/lollipop

Repository files navigation

lollipop

Sweet GPU compute kernels in CUDA, wrapped in Python via CuPy.

uv sync && uv pip install -e . && python examples/mandelbrot.py

You need CUDA Toolkit 11.8 (well, newer version may not work) and an NVIDIA GPU (sm_75+ for the HMMA kernels; Turing or anything newer). CuPy's bundled nvrtc compiles each kernel at first use, picking up mma.h and friends from CUDA_PATH.

Example Kernels

Kernel What it does
reduction_v2 sum-reduce a 1D float array
matrix_transpose 2D fp32 transpose, 32×33 padded smem tile
softmax_vec4 row-wise softmax with float4 loads
flash_attention_hmma FA-2 forward, fp16 in / fp32 accum, wmma 16×16×16
gemm_tiled dense fp32 GEMM, 128×128 macro / 8×8 register micro, manual smem double-buffer
gemm_int8 W8A8 INT8 GEMM, per-row act scale × per-channel weight scale
gemm_int4 W4A16 weight-only (AWQ/GPTQ-shaped), G=64 asymmetric, dequant-fuse-matmul
fused_ffn_tail RMSNorm → ×γ → +bias → GELU/SiLU → +residual, one kernel
rope rotary positional embedding, Llama half-rotation (pair separation D/2), in-place safe
rmsnorm RMSNorm forward + backward, per-row fused reductions, dgamma in fp32 accum

License

MIT

About

🍭 Sweet GPU compute kernels in CUDA, wrapped via CuPy

Topics

Resources

License

Stars

Watchers

Forks

Contributors