HazyResearch / ThunderKittens
Tile primitives for speedy kernels
See what the GitHub community is most excited about this month.
Tile primitives for speedy kernels
LLM training in simple, raw C/CUDA
CUDA accelerated rasterization of gaussian splatting
cuVS - a library for vector search and clustering on the GPU
FlashInfer: Kernel Library for LLM Serving
Causal depthwise conv1d in CUDA, with a PyTorch interface
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without lossing end-to-end metrics across language, image, and video models.
Fast CUDA matrix multiplication from scratch
Sample codes for my CUDA programming book
CUDA Library Samples
GPU accelerated decision optimization
CUDA Kernel Benchmarking Library
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
RCCL Performance Benchmark Tests