-
Peking University
- Shenzhen
-
07:24
(UTC +08:00)
Lists (9)
Sort Name ascending (A-Z)
Stars
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
This package contains the original 2012 AlexNet code.
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…
A lightweight design for computation-communication overlap.
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
Benchmark code for the "Online normalizer calculation for softmax" paper
Batch computation of the linear assignment problem on GPU.
Source code for the CPU-Free model - a fully autonomous execution model for multi-GPU applications that completely excludes the involvement of the CPU beyond the initial kernel launch.