Stars
My learning notes for ML SYS.
A NCCL extension library, designed to efficiently offload GPU memory allocated by the NCCL communication library.
An early research stage expert-parallel load balancer for MoE models based on linear programming.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
FlagTree is a unified compiler supporting multiple AI chip backends for custom Deep Learning operations, which is forked from triton-lang/triton.
🧑🏫 60+ Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, sophia, ...), ga…
A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…
AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。
Supercharge Your LLM with the Fastest KV Cache Layer
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
CUDA Python: Performance meets Productivity
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
FlashMLA: Efficient Multi-head Latent Attention Kernels
A framework for few-shot evaluation of language models.
AIGC-interview/CV-interview/LLMs-interview面试问题与答案集合仓,同时包含工作和科研过程中的新想法、新问题、新资源与新项目
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
VPTQ, A Flexible and Extreme low-bit quantization algorithm
[ICLR'25] ARB-LLM: Alternating Refined Binarizations for Large Language Models
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models