Lists (17)
Sort Name ascending (A-Z)
Stars
Instant neural graphics primitives: lightning fast NeRF and more
DeepEP: an efficient expert-parallel communication library
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashInfer: Kernel Library for LLM Serving
CUDA accelerated rasterization of gaussian splatting
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
how to optimize some algorithm in cuda.
Sample codes for my CUDA programming book
Learn CUDA Programming, published by Packt
Flash Attention in ~100 lines of CUDA (forward pass only)
A simple high performance CUDA GEMM implementation.
A shift-window based transformer for 3D sparse tasks
The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)