Stars
[NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning
Multi-Turn RL Training System with AgentTrainer for Language Model Game Reinforcement Learning
A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and caching, etc.
A Datacenter Scale Distributed Inference Serving Framework
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
A unified inference and post-training framework for accelerated video generation.
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
[ICML 2024] CLLMs: Consistency Large Language Models
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
A high-throughput and memory-efficient inference and serving engine for LLMs
Training and serving large-scale neural networks with auto parallelization.
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
The source of LMSYS website and blogs
Swarm training framework using Haiku + JAX + Ray for layer parallel transformer language models on unreliable, heterogeneous nodes
Running large language models on a single GPU for throughput-oriented scenarios.
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)
(NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.
Cavs: An Efficient Runtime System for Dynamic Neural Networks
zhisbug / ray
Forked from ray-project/rayAn open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyp…
Code for "BayesAdapter: Being Bayesian, Inexpensively and Robustly, via Bayeisan Fine-tuning"
Resource-adaptive cluster scheduler for deep learning training.
An end-to-end PyTorch framework for image and video classification