Lists (5)
Sort Name ascending (A-Z)
Stars
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
how to optimize some algorithm in cuda.
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
CUDA Matrix Multiplication Optimization
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.