Stars
Puzzles for learning Triton
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
Achieve state of the art inference performance with modern accelerators on Kubernetes
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
Distributed Compiler based on Triton for Parallel Systems
Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
A Datacenter Scale Distributed Inference Serving Framework
My learning notes for ML SYS.
A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
verl: Volcano Engine Reinforcement Learning for LLMs
SGLang is a high-performance serving framework for large language models and multimodal models.
Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali
A high-throughput and memory-efficient inference and serving engine for LLMs
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
[DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
A high-performance inference system for large language models, designed for production environments.
TVM Documentation in Chinese Simplified / TVM 中文文档
how to optimize some algorithm in cuda.
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Ring attention implementation with flash attention
Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,mlops算法链路全流程,算力租赁平台,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU虚拟化,边缘计算,标注平台自动化标注,deepseek等大模型sft微调/奖励模型/强化学习训练,vllm/ollama/mindie大模型多机推理,私有知识库,AI模型市场…