DéjàVu

Overview

With DéjàVu, we aim to achieve fault-tolerant and resource-efficient serving of LLMs. We observe that distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges:

Bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing
GPU memory overprovisioning
Long recovery times in case of failures

DéjàVu addresses all these challenges using a versatile and efficient KV cache streaming library: DéjàVuLib. Using DéjàVuLib, we propose and implement:

Efficient prompt-token disaggregation to reduce pipeline bubbles
Microbatch swapping for efficient GPU memory management
State replication for fault-tolerance

DéjàVu is implemented on top of NVIDIA FasterTransformer. Like the original FasterTransformer implementation, it supports both tensor and pipeline parallelism.

Supported Features - DéjàVuLib

DéjàVuLib is a library built to handle KV cache streaming to and from GPU We support the following: (currently tested for the GPT, OPT and BLOOM models)

Streaming of the KV cache to/from CPU memory and flushing local disk
Streaming of KV cache to/from another GPU (in a different machine) via NCCL
Streaming of KV cache to local CPU, and then flushing to another machine's CPU over the network, via MPI or BOOST

Supported Features - DéjàVu

Disaggregation of Prompt and Token processing
Fault Tolerance support with cache replication
Swapping to CPU for pipeline parallelism

Documentation

Installation: Check docs/install.md
DéjàVuLib documentation and microbenchmarks: Check docs/dejavulib.md
DéjàVu serving system documentation and benchmarks: Check docs/dejavu.md
DéjàVu Planner documentation: Check docs/dejavu_planner.md
DéjàVu simulator: Check docs/dejavu_simulator.md
For FasterTransformer original documentation: Check docs/original_ft

Paper

If you use DéjàVu or DéjàVuLib in your research, please cite our paper:

@misc{strati2024dejavu,
      title={D\'ej\`aVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving},
      author={Foteini Strati and Sara Mcallister and Amar Phanishayee and Jakub Tarnawski and Ana Klimovic},
      year={2024},
      eprint={2403.01876},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
3rdparty		3rdparty
benchmarks		benchmarks
cmake		cmake
docker		docker
docs		docs
examples		examples
scripts		scripts
src		src
templates/adding_a_new_model		templates/adding_a_new_model
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
debug_qwen235b.sh		debug_qwen235b.sh
kill_all.sh		kill_all.sh
process_output.py		process_output.py
run_all.py		run_all.py
run_qwen235b_moe_ft.sh		run_qwen235b_moe_ft.sh
run_with_gdb.sh		run_with_gdb.sh
test_nf4_loading.py		test_nf4_loading.py
test_nf4_quantization.py		test_nf4_quantization.py
test_simple.py		test_simple.py
test_weight_loading_only.py		test_weight_loading_only.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DéjàVu

Overview

Supported Features - DéjàVuLib

Supported Features - DéjàVu

Documentation

Paper

About

Uh oh!

Releases

Packages

Languages

License

SlugLab/dejavu-cxl

Folders and files

Latest commit

History

Repository files navigation

DéjàVu

Overview

Supported Features - DéjàVuLib

Supported Features - DéjàVu

Documentation

Paper

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages