-
Tsinghua University
- Beijing, China
Stars
DeepEP: an efficient expert-parallel communication library
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashInfer: Kernel Library for LLM Serving
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
how to optimize some algorithm in cuda.
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Flash Attention in ~100 lines of CUDA (forward pass only)
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
Causal depthwise conv1d in CUDA, with a PyTorch interface
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
flash attention tutorial written in python, triton, cuda, cutlass
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
An implementation of the transformer architecture onto an Nvidia CUDA kernel
A lightweight design for computation-communication overlap.
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.