Highlights
- Pro
Stars
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Awesome LLM compression research papers and tools.
Understanding Deep Learning - Simon J.D. Prince
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
My learning notes for ML SYS.
Code for the paper "Evaluating Large Language Models Trained on Code"
A framework for the evaluation of autoregressive code generation language models.
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
A low-latency & high-throughput serving engine for LLMs
A throughput-oriented high-performance serving framework for LLMs
ROCm / flash-attention
Forked from Dao-AILab/flash-attentionFast and memory-efficient exact attention
8-bit CUDA functions for PyTorch
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
Running large language models on a single GPU for throughput-oriented scenarios.
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
An acceleration library that supports arbitrary bit-width combinatorial quantization operations