Stars
My learning notes for ML SYS.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
Ring attention implementation with flash attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
FlagGems is an operator library for large language models implemented in the Triton Language.
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Development repository for the Triton language and compiler