Mooncake is the serving platform for Kimi, a leading LLM service provided by
Moonshot AI.
Now both the Transfer Engine and Mooncake Store are open-sourced!
This repository also hosts its technical report and the open-sourced traces.
- May 7, 2026: π vLLM officially features Mooncake Store β a deep dive into how Mooncake's distributed KVCache engine supercharges vLLM inference with high-throughput, memory-efficient, cross-instance KV cache sharing!
- Apr 29, 2026: SGLang introduces RDMA-based P2P weight transfer for large-scale distributed RL using Mooncake TransferEngine, achieving 7x faster weight updates for the 1T-parameter Kimi-K2 model (53s β 7.2s) with zero-copy RDMA transfer across thousands of GPUs.
- Mar 19, 2026: TorchSpec: Speculative Decoding Training at Scale is open sourced, using Mooncake to decouple inference and training via efficient hidden states management.
- Mar 5, 2026: LightX2V now supports disaggregated deployment based on Mooncake, enabling encoder/transformer service decoupling with Mooncake Transfer Engine for high-performance cross-device and cross-machine data transfer.
- Feb 25, 2026: SGLang merged Encoder Global Cache Manager, introducing a Mooncake-powered global multimodal embedding cache that enables cross-instance sharing of ViT embeddings to avoid redundant GPU computation.
More
- Feb 24, 2026: vLLM-Omni introduces disaggregated inference connectors with support for both
MooncakeStoreConnectorandMooncakeTransferEngineConnectorfor multi-node omni-modality pipelines. - Feb 12, 2026: Mooncake Joins PyTorch Ecosystem We are thrilled to announce that Mooncake has officially joined the PyTorch Ecosystem!
- Jan 28, 2026: FlexKV, a distributed KV store and cache system from Tencent and NVIDIA in collaboration with the community, now supports distributed KVCache reuse with the Mooncake Transfer Engine.
- Dec 27, 2025: Collaboration with ROLL! Check out the paper here.
- Dec 23, 2025: SGLang introduces Encode-Prefill-Decode (EPD) Disaggregation with Mooncake as a transfer backend. This integration allows decoupling compute-intensive multimodal encoders (e.g., Vision Transformers) from language model nodes, utilizing Mooncake's RDMA engine for zero-copy transfer of large multimodal embeddings.
- Dec 19, 2025: Mooncake Transfer Engine has been integrated into TensorRT LLM for KVCache transfer in PD-disaggregated inference.
- Dec 19, 2025: Mooncake Transfer Engine has been directly integrated into vLLM v1 as a KV Connector in PD-disaggregated setups.
- Nov 07, 2025: RBG + SGLang HiCache + Mooncake, a role-based out-of-the-box solution for cloud native deployment, which is elastic, scalable, and high-performance.
- Sept 18, 2025: Mooncake Store empowers vLLM Ascend by serving as the distributed KV cache pool backend.
- Sept 10, 2025: SGLang officially supports Mooncake Store as a hierarchical KV caching storage backend. The integration extends RadixAttention with multi-tier KV cache storage across device, host, and remote storage layers.
- Sept 10, 2025: The official & high-performance version of Mooncake P2P Store is open-sourced as checkpoint-engine. It has been successfully applied in K1.5 and K2 production training, updating Kimi-K2 model (1T parameters) across thousands of GPUs in ~20s.
- Aug 23, 2025: xLLM high-performance inference engine builds hybrid KV cache management based on Mooncake, supporting global KV cache management with intelligent offloading and prefetching.
- Aug 18, 2025: vLLM-Ascend integrates Mooncake Transfer Engine for KV cache register and disaggregate prefill, enabling efficient distributed inference on Ascend NPUs.
- Jul 20, 2025: Mooncake powers the deployment of Kimi K2 on 128 H200 GPUs with PD disaggregation and large-scale expert parallelism, achieving 224k tokens/sec prefill throughput and 288k tokens/sec decode throughput.
- Jun 20, 2025: Mooncake becomes a PD disaggregation backend for LMDeploy.
- May 9, 2025: NIXL officially supports Mooncake Transfer Engine as a backend plugin.
- May 8, 2025: Mooncake x LMCache unite to pioneer KVCache-centric LLM serving system.
- May 5, 2025: Supported by Mooncake Team, SGLang release guidance to deploy DeepSeek with PD Disaggregation on 96 H100 GPUs.
- Apr 22, 2025: LMCache officially supports Mooncake Store as a remote connector.
- Apr 10, 2025: SGLang officially supports Mooncake Transfer Engine for disaggregated prefilling and KV cache transfer.
- Mar 7, 2025: We open-sourced the Mooncake Store, a distributed KVCache based on Transfer Engine. vLLM's xPyD disaggregated prefilling & decoding based on Mooncake Store will be released soon.
- Feb 25, 2025: Mooncake receives the Best Paper Award at FAST 2025!
- Feb 21, 2025: The updated traces used in our FAST'25 paper have been released.
- Dec 16, 2024: vLLM officially supports Mooncake Transfer Engine for disaggregated prefilling and KV cache transfer.
- Nov 28, 2024: We open-sourced the Transfer Engine, the central component of Mooncake. We also provide two demonstrations of Transfer Engine: a P2P Store and vLLM integration.
- July 9, 2024: We open-sourced the trace as a JSONL file.
- June 27, 2024: We present a series of Chinese blogs with more discussions on zhihu 1, 2, 3, 4, 5, 6, 7.
- June 26, 2024: Initial technical report release.
Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated KVCache pool.
The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges in highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncakeβs innovative architecture enables Kimi to handle 75% more requests.
The core of Mooncake is the Transfer Engine (TE), a high-performance data transfer framework. TE offers a unified interface for batched data movement across diverse storage, network, and accelerator environments. By supporting multiple transport protocols, topology-aware routing, multi-NIC bandwidth aggregation, and automatic failover, TE delivers low-latency, scalable, and robust data transmission for distributed AI workloads. See the Transfer Engine guide for details.
Highlights
-
Efficient use of multiple RDMA NIC devices. Transfer Engine supports the use of multiple RDMA NIC devices to achieve the aggregation of transfer bandwidth.
-
Topology-aware path selection. Transfer Engine can select optimal devices based on the location (NUMA affinity, etc.) of both source and destination.
-
Robust against temporary network errors. Once transmission fails, Transfer Engine will try to use alternative paths for data delivery automatically.
-
Superior performance at scale. With 40 GB of data (equivalent to the size of the KVCache generated by 128k tokens in the LLaMA3-70B model), Mooncake Transfer Engine delivers up to 87 GB/s and 190 GB/s of bandwidth in 4Γ200 Gbps and 8Γ400 Gbps RoCE networks respectively, which are about 2.4x and 4.6x faster than the TCP protocol.
-
Broad support for heterogeneous transports and accelerators. Transfer Engine provides unified data transfer across diverse protocols, including TCP, RDMA, AWS EFA, NVMe-oF, NVLink, HIP, Barex, CXL, and Ascend-family transports. When built with the corresponding runtime, Transfer Engine can detect accelerator memory and select suitable transport paths for efficient data movement across CUDA, MUSA, HIP, MACA, Cambricon MLU, and Ascend-enabled environments. For a complete list of supported protocols and configuration guide, see the Supported Protocols Documentation.
-
Widely adopted across the LLM ecosystem. TE is used in production inference stacks such as SGLang, vLLM, TensorRT-LLM, vLLM-Ascend, checkpoint-engine, and NIXL, among others, to efficiently transfer KV cache, embeddings, model weights, and other data.
Mooncake Store is built on the Transfer Engine and provides distributed key/value caching. It is a distributed KVCache storage engine specialized for LLM inference. The goal of Mooncake Store is to store the reusable KV caches across various locations in an inference cluster. Mooncake Store has been supported in SGLang's Hierarchical KV Caching, vLLM's prefill serving and is now integrated with LMCache to provide enhanced KVCache management capabilities. See the Mooncake Store guide.
-
High bandwidth utilization: Mooncake Store supports large-object striping, parallel I/O, and end-to-end zero-copy data transfer, fully utilizing aggregated bandwidth across multiple NICs.
-
Multi-tier cache hierarchy. Mooncake Store supports a multi-level cache design across DRAM and SSD/NVMe, enabling larger cache capacity.
-
Elastic and disaggregated storage: Mooncake Store decouples KVCache storage from inference engines, allowing storage nodes to be dynamically added or removed while keeping cached data independent from engine restarts, upgrades, and scheduling decisions.
Mooncake adds elasticity and fault tolerance support for MoE model inference, enabling inference systems to remain responsive and recoverable in the event of GPU failures or changes in resource configuration. This functionality includes automatic faulty rank detection and can work with the EPLB module to dynamically route tokens to healthy ranks during inference.
Mooncake establishes a full-stack, Tensor-oriented AI infrastructure where Tensors serve as the fundamental data carrier. The ecosystem spans from the Transfer Engine, which accelerates Tensor data movement across heterogeneous storage (DRAM/VRAM/NVMe), to Mooncake Store for distributed management of Tensor objects (e.g., KVCache and model weight), up to the Mooncake Backend enabling Tensor-based elastic distributed computing. This architecture is designed to maximize Tensor processing efficiency for large-scale model inference and training.
SGLang Integration (Guide)
SGLang officially supports Mooncake Store as a HiCache storage backend. This integration enables scalable KV cache retention and high-performance access for large-scale LLM serving scenarios.
- Hierarchical KV Caching: Mooncake Store serves as an external storage backend in SGLang's HiCache system, extending RadixAttention with multi-level KV cache storage across device, host, and remote storage layers.
- Flexible Cache Management: Supports multiple cache policies including write-through, write-through-selective, and write-back modes, with intelligent prefetching strategies for optimal performance.
- Comprehensive Optimizations: Features advanced data plane optimizations including page-first memory layout for improved I/O efficiency, zero-copy mechanisms for reduced memory overhead, GPU-assisted I/O kernels delivering fast CPU-GPU transfers, and layer-wise overlapping for concurrent KV cache loading while computation executes.
- Elastic Expert Parallel: Mooncake's collective communication backend and expert parallel kernels are integrated into SGLang to enable fault-tolerant expert parallel inference (sglang#11657).
- Significant Performance Gains: The multi-turn benchmark demonstrates substantial performance improvements over the non-HiCache setting. See our benchmark report for more details.
- Community Feedback: Effective KV caching significantly reduces TTFT by eliminating redundant and costly re-computation. Integrating SGLang HiCache with the Mooncake service enables scalable KV cache retention and high-performance access. In our evaluation, we tested the DeepSeek-R1-671B model under PD-disaggregated deployment using in-house online requests sampled from a general QA scenario. On average, cache hits achieved an 84% reduction in TTFT compared to full re-computation. β Ant Group
vLLM Integration (Guide v0.2)
To optimize LLM inference, the vLLM community is working on supporting disaggregated prefilling (PR 10502). This feature allows separating the prefill phase from the decode phase in different processes. vLLM uses nccl and gloo as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.
We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of nccl and gloo, to support inter-node KVCache transfer (PR 10884). Transfer Engine provides simpler interfaces and more efficient use of RDMA devices.
We will soon release the new vLLM integration based on Mooncake Store, which supports xPyD prefill/decode disaggregation.
Update[Dec 16, 2024]: Here is the latest vLLM Integration (Guide v0.2) that is based on vLLM's main branch.
By supporting Topology-Aware Path Selection and multi-card bandwidth aggregation, the mean TTFT of vLLM with Transfer Engine is up to 25% lower than traditional TCP-based transports. In the future, we will further improve TTFT through GPUDirect RDMA and zero-copy.
| Backend/Setting | Output Token Throughput (tok/s) | Total Token Throughput (tok/s) | Mean TTFT (ms) | Median TTFT (ms) | P99 TTFT (ms) |
|---|---|---|---|---|---|
| Transfer Engine (RDMA) | 12.06 | 2042.74 | 1056.76 | 635.00 | 4006.59 |
| TCP | 12.05 | 2041.13 | 1414.05 | 766.23 | 6035.36 |
- Click here to access detailed benchmark results.
More advanced features are coming soon, so stay tuned!
Mooncake supports hardware backends across accelerator vendors, cloud fabrics, and standard datacenter interconnects.
The following hardware partners and cloud platforms are supported by the Mooncake, covering GPUs, specialized AI accelerators, and cloud-native interconnects:
For complete protocol behavior, SDK requirements, and vendor-specific configuration, see the supported protocols, build guide, and Transfer Engine design docs.
Mooncake is designed and optimized for high-speed RDMA networks. Though Mooncake supports TCP-only data transfer, we strongly recommend users to evaluate the functionality and performance of Mooncake with RDMA network support.
The following need to be installed before running any component of Mooncake:
- RDMA Driver & SDK, such as Mellanox OFED.
- Python 3.10, virtual environment is recommended.
- CUDA 12.1 and above, including NVIDIA GPUDirect Storage Support, if the package is built with
-DUSE_CUDA(disabled by default). You may install them from here. - Cambricon Neuware, if the package is built with
-DUSE_MLU. By default Mooncake looks for Neuware underNEUWARE_HOMEor/usr/local/neuware. - Hygon DTK SDK, if the package is built with
-DUSE_HYGON. By default Mooncake looks for DTK underDTK_HOMEor/opt/dtk. - Iluvatar CoreX SDK, if the package is built with
-DUSE_COREX. By default Mooncake looks for CoreX underCOREX_HOMEor/usr/local/corex.
The simplest way to use Mooncake Transfer Engine is using pip:
For CUDA-enabled systems:
- CUDA < 13.0
pip install mooncake-transfer-engine- CUDA >= 13.0
pip install mooncake-transfer-engine-cuda13For non-CUDA systems:
pip install mooncake-transfer-engine-non-cudaImportant
- The CUDA version (
mooncake-transfer-engine) includes Mooncake-EP and GPU topology detection, requiring CUDA 12.1+. - The non-CUDA version (
mooncake-transfer-engine-non-cuda) is for environments without CUDA dependencies. - MLU support is currently available through source builds with
-DUSE_MLU=ON; there is no dedicated prebuilt MLU wheel yet. - If users encounter problems such as missing
lib*.so, they should uninstall the package they installed and build the binaries manually.
For the default source build, use the automatic dependency script and standard CMake flow:
git clone https://github.com/kvcache-ai/Mooncake.git
cd Mooncake
sudo bash dependencies.sh
mkdir build
cd build
cmake ..
make -j
sudo make install # optional, make it ready to be used by vLLM/SGLangFor custom accelerator backends, Docker deployment, NVMe-oF, EFA, CXL, Redis / HTTP metadata, Rust bindings, or other advanced build options, see the Build Guide.
{
"timestamp": 27482,
"input_length": 6955,
"output_length": 52,
"hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354]
}
{
"timestamp": 30535,
"input_length": 6472,
"output_length": 26,
"hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2366]
}The above presents two samples from our trace dataset. The trace includes the timing of request arrivals, the number of input tokens, the number of output tokens, and the remapped block hash. To protect our customers' privacy, we applied several mechanisms to remove user-related information while preserving the dataset's utility for simulated evaluation. More descriptions of the trace (e.g., up to 50% cache hit ratio) can be found in Section 4 of the technical report.
Update[Feb 21, 2025]: The updated traces used in our FAST'25 paper have been released! Please refer to the paper's appendix (found here) for more details.
Please kindly cite our papers if you find the papers or the traces are useful:@article{sun2026survivingpartialrankfailures,
title = {Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference},
author = {Xun Sun and Shaoyuan Chen and Pingchuan Ma and Yue Chen and Ziwei Yuan and Zhanhao Cao and Han Han and Shangming Cai and Teng Ma and Xuchun Shang and Xinpeng Zhao and Ke Yang and Junlin Wei and Lianzhi Lin and Yuji Liu and Feng Ren and Haoran Hu and Cheng Wan and Yingdi Shan and Yongwei Wu and Mingxing Zhang},
year = {2026},
url = {https://arxiv.org/abs/2605.10670},
}
@article{qin2025mooncake_tos,
author = {Qin Ruoyu and Li Zheming and He Weiran and Cui Jialei and Tang Heyi and Ren Feng and Ma Teng and Cai Shangming and Zhang Yineng and Zhang Mingxing and Wu Yongwei and Zheng Weimin and Xu Xinran},
title = {Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving},
year = {2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1553-3077},
url = {https://doi.org/10.1145/3773772},
doi = {10.1145/3773772},
journal = {ACM Trans. Storage},
month = {nov},
keywords = {Machine learning system, LLM serving, KVCache},
}
@inproceedings{qin2025mooncake,
author = {Ruoyu Qin and Zheming Li and Weiran He and Jialei Cui and Feng Ren and Mingxing Zhang and Yongwei Wu and Weimin Zheng and Xinran Xu},
title = {Mooncake: Trading More Storage for Less Computation {\textemdash} A {KVCache-centric} Architecture for Serving {LLM} Chatbot},
booktitle = {23rd USENIX Conference on File and Storage Technologies (FAST 25)},
year = {2025},
isbn = {978-1-939133-45-8},
address = {Santa Clara, CA},
pages = {155--170},
url = {https://www.usenix.org/conference/fast25/presentation/qin},
publisher = {USENIX Association},
month = {feb},
}
@article{qin2024mooncake_arxiv,
title = {Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving},
author = {Ruoyu Qin and Zheming Li and Weiran He and Mingxing Zhang and Yongwei Wu and Weimin Zheng and Xinran Xu},
year = {2024},
url = {https://arxiv.org/abs/2407.00079},
}