Stars
A high-throughput and memory-efficient inference and serving engine for LLMs
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Running large language models on a single GPU for throughput-oriented scenarios.
A Datacenter Scale Distributed Inference Serving Framework
Training and serving large-scale neural networks with auto parallelization.
A unified inference and post-training framework for accelerated video generation.
An end-to-end PyTorch framework for image and video classification
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Resource-adaptive cluster scheduler for deep learning training.
[ICML 2024] CLLMs: Consistency Large Language Models
Swarm training framework using Haiku + JAX + Ray for layer parallel transformer language models on unreliable, heterogeneous nodes
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
PMLS-Caffe: Distributed Deep Learning Framework for Parallel ML System
GPU-specialized parameter server for GPU machine learning.
Automatic Photo Adjustment Using Deep Neural Networks
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)
The source of LMSYS website and blogs
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
[NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning
Multi-Turn RL Training System with AgentTrainer for Language Model Game Reinforcement Learning
A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and caching, etc.
(NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.