Highlights
- Pro
Stars
Optimized FP16/BF16 x FP4 GPU kernels for AMD GPUs
FlashMLA: Efficient Multi-head Latent Attention Kernels
🚀🚀 Efficient implementations of Native Sparse Attention
中国科学技术大学数字电路实验入门指南,2022年由马子睿助教创建。本仓库旨在让各位后续助教能够不断对其进行迭代
CUDA Templates and Python DSLs for High-Performance Linear Algebra
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
🥢像老乡鸡🐔那样做饭。主要部分于2024年完工,非老乡鸡官方仓库。文字来自《老乡鸡菜品溯源报告》,并做归纳、编辑与整理。CookLikeHOC.
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
A high-throughput and memory-efficient inference and serving engine for LLMs
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Development repository for the Triton language and compiler
2025年12月更新,目前国内可用Docker镜像源汇总,DockerHub国内镜像加速列表,🚀DockerHub镜像加速器
Fast and memory-efficient exact attention
Flash Attention in ~100 lines of CUDA (forward pass only)
Stanford computer networking lab, an elegant TCP/IP implementation
A wrapper script to build whole-program LLVM bitcode files
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
Fully open reproduction of DeepSeek-R1