kvcached is a new KV cache management system that enables elastic KV cache memory for autoregressive LLMs. It brings the benefits of virtual memory in operating systems to LLM serving: instead of statically reserving large blocks of GPU memory at startup (the common practice today), it allows serving engines to allocate and release KV cache on demand based on actual workload needs.
- 🔄 Elastic KV cache: allocate and reclaim KV memory dynamically to match live load.
- 🗺️ GPU virtual memory: decouple logical KV from physical GPU memory via runtime mapping.
- 🛠️ Memory control CLI: enforce memory limits with kvcached CLI.
- 🧭 Frontend router and sleep manager: route requests to the corresponding backend and put models to sleep when idle.
- 🤝 Support SGLang and vLLM: integrate with SGLang and vLLM.
- 🔀 Multi‑LLM serving: kvcached allows multiple LLMs to share a GPU's memory elastically, enabling concurrent deployment without the rigid memory partitioning used today. This improves GPU utilization and saves serving costs.
- ⚡ Serverless LLM: By allocating KV cache only when needed, kvcached supports serverless deployments where models can spin up and down on demand.
- 🧩 Compound AI systems: kvcached makes compound AI systems practical on limited hardware by elastically allocating memory across specialized models in a pipeline (e.g., retrieval, reasoning, and summarization).
- 🖥️ GPU workload colocation: kvcached allows LLM inference to coexist with other GPU workloads such as training jobs, fine-tuning, or vision models.
See concrete example here: kvcached/examples.
kvcached enables dynamic memory sharing between LLMs, allowing them to share the same GPU memory elastically. As a comparison, the current serving engines need to statically reserve GPU memory at startup.
This benchmark shows the performance benefits of kvcached when serving three Llama-3.1-8B
models on an A100-80G GPU under workloads with intermittent peaks. Details can be found in benchmarks/bench_latency_benefit.
- Python (tested with 3.9 - 3.12)
- PyTorch (compatible with SGLang and vLLM)
kvcached can be installed as a plugin with SGLang and vLLM.
cd engine_integration/scripts
# install kvcached with SGLang v0.4.9
./setup.sh --engine sglang --engine-method source --engine-version 0.4.9
# install kvcached with vLLM v0.9.2
./setup.sh --engine vllm --engine-method source --engine-version 0.9.2
This script will download the specified versions of SGLang and vLLM, create separate venv environments (using uv
), compile the code, apply the necessary patches, and install kvcached.
You can test or develop kvcached with Docker.
To test kvcached with SGLang or VLLM.
docker pull ghcr.io/ovg-project/[kvcached-sglang|kvcached-vllm]:latest
For developmenet:
docker pull ghcr.io/ovg-project/kvcached-dev:latest
More instructions can be found here.
kvcached can be enabled or disabled by export ENABLE_KVCACHED=true
or false
. To verify the successful installation and benchmark the performance of SGLang/vLLM with kvcached, run:
cd benchmarks/simple_bench
export VENV_PATH=../../engine_integration/[sglang|vllm]-kvcached-venv
./start_server.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B
# Wait until LLM server is ready
./start_client.sh [sglang|vllm] --venv-path $VENV_PATH --model meta-llama/Llama-3.2-1B
The benchmark scripts automatically set ENABLE_KVCACHED=true
. Please refer to each script for instructions on how to run SGLang/vLLM with kvcached.
kvcached includes a built-in CLI tool that allows you to monitor GPU memory usage and manage memory limits across different applications. A command kvctl
is installed along with kvcached package:
kvctl
Once inside the CLI, type help
to view all supported commands:
kvcached> help
Available commands:
list [ipc ...] List IPC segments and usage
limit <ipc> <size> Set absolute limit (e.g. 512M, 2G)
limit-percent <ipc> <pct> Set limit as percentage of total GPU RAM
watch [-n sec] [ipc ...] Continuously display usage table
kvtop [ipc ...] [--refresh r] Launch curses kvtop UI (q to quit)
!<shell cmd> Run command in system shell
help Show this help message
delete <ipc> Delete IPC segment and its limit entry
exit | quit Exit the shell
kvcached>
Use the kvtop
command for real-time visualization of memory usage:
KVCache Memory Usage IPC: SGLANG [==##################----------------------------------------] Prealloc: 792.0 MB | Used: 11.2 GB / 39.9 GB (30.1%) | Free: 27.9 GB IPC: VLLM [==#######--------------------------------------------------- ] Prealloc: 768.0 MB | Used: 3.6 GB / 37.4 GB (11.7%) | Free: 33.0 GB GPU Memory Usage [########################################--------------------] Used: 52.9 GB / 79.2 GB (66.8%) | Free: 26.3 GB Press 'q' to quit
We are grateful for and open to contributions and collaborations of any kind.
We use pre-commit to ensure a consistent coding style. You can set it up by
pip install pre-commit
pre-commit install
Before pushing your code, please run the following check and make sure your code passes all checks.
pre-commit run --all-files
Feel free to contact us for contributions and collaborations.
Jiarong Xing (jxing@rice.edu)
Yifan Qiao (yifanqiao@berkeley.edu)
Shan Yu (shanyu1@g.ucla.edu)