LLM Inference server Developing my own LLM Inference server like vLLM.
- Online and Offline mode.
- KV Caching
- Multiple requests processing.
- Scheduler to schedule requests.
- Seperate prefill and Decode.
- Prefix Caching
- Continuous batching
- Chunked prefill
- Torch compilation
- CUDA Graphs
- Speculative decoding
- Quantization
- Distributed inference etc.
| total request | tokn | vLLM |
|---|---|---|
| 102 | 457 tok/sec | 30,000 tok/sec |