lollipop

Sweet GPU compute kernels in CUDA, wrapped in Python via CuPy.

uv sync && uv pip install -e . && python examples/mandelbrot.py

You need CUDA Toolkit 11.8 (well, newer version may not work) and an NVIDIA GPU (sm_75+ for the HMMA kernels; Turing or anything newer). CuPy's bundled nvrtc compiles each kernel at first use, picking up mma.h and friends from CUDA_PATH.

Example Kernels

Kernel	What it does
`reduction_v2`	sum-reduce a 1D float array
`matrix_transpose`	2D fp32 transpose, 32×33 padded smem tile
`softmax_vec4`	row-wise softmax with `float4` loads
`flash_attention_hmma`	FA-2 forward, fp16 in / fp32 accum, `wmma` 16×16×16
`gemm_tiled`	dense fp32 GEMM, 128×128 macro / 8×8 register micro, manual smem double-buffer
`gemm_int8`	W8A8 INT8 GEMM, per-row act scale × per-channel weight scale
`gemm_int4`	W4A16 weight-only (AWQ/GPTQ-shaped), G=64 asymmetric, dequant-fuse-matmul
`fused_ffn_tail`	RMSNorm → ×γ → +bias → GELU/SiLU → +residual, one kernel
`rope`	rotary positional embedding, Llama half-rotation (pair separation D/2), in-place safe
`rmsnorm`	RMSNorm forward + backward, per-row fused reductions, dgamma in fp32 accum

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
bench		bench
examples		examples
lollipop		lollipop
tests		tests
tools/diagrams		tools/diagrams
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lollipop

Example Kernels

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lollipop

Example Kernels

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages