A lightweight yet high-performance inference framework for Large Language Models.
Mini-SGLang is a compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems. With a compact codebase of ~5,000 lines of Python, it serves as both a capable inference engine and a transparent reference for researchers and developers.
- High Performance: Achieves state-of-the-art throughput and latency with advanced optimizations.
- Lightweight & Readable: A clean, modular, and fully type-annotated codebase that is easy to understand and modify.
- Advanced Optimizations:
- Radix Cache: Reuses KV cache for shared prefixes across requests.
- Chunked Prefill: Reduces peak memory usage for long-context serving.
- Overlap Scheduling: Hides CPU scheduling overhead with GPU computation.
- Tensor Parallelism: Scales inference across multiple GPUs.
- Optimized Kernels: Integrates FlashAttention and FlashInfer for maximum efficiency.
- ...
We recommend using uv for a fast and reliable installation (note that uv does not conflict with conda).
# Create a virtual environment (Python 3.10+ recommended)
uv venv --python=3.12
source .venv/bin/activatePrerequisites: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure you have the NVIDIA CUDA Toolkit installed and that its version matches your driver's version. You can check your driver's CUDA capability with nvidia-smi.
Install Mini-SGLang directly from the source:
git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
uv pip install -e .Launch an OpenAI-compatible API server with a single command.
# Deploy Qwen/Qwen3-0.6B-Instruct on a single GPU
python -m minisgl --model "Qwen/Qwen3-0.6B-Instruct"
# Deploy meta-llama/Llama-3.1-70B-Instruct on 4 GPUs with Tensor Parallelism, on port 30000
python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --port 30000Once the server is running, you can send requests using standard tools like curl or any OpenAI-compatible client.
Chat with your model directly in the terminal by adding the --shell flag.
python -m minisgl --model "Qwen/Qwen3-0.6B" --shellYou can also use /reset to clear the chat history.
See bench_nanovllm.py for more details. Set MINISGL_DISABLE_OVERLAP_SCHEDULING=1 for ablation study on overlap scheduling.
Test Configuration:
- Hardware: 1xH200 GPU.
- Model: Qwen3-0.6B, Qwen3-14B
- Total Requests: 256 sequences
- Input Length: Randomly sampled between 100-1024 tokens
- Output Length: Randomly sampled between 100-1024 tokens
See benchmark_qwen.py for more details.
Test Configuration:
- Hardware: 4xH200 GPU, connected by NVLink.
- Model: Qwen3-32B
- Dataset: Qwen trace, replaying first 1000 requests.
Launch command:
# Mini-SGLang
python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive
# SGLang
python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \
--disable-radix --port 1919 --decode-attention flashinfer- Detailed Features: Explore all available features and command-line arguments.
- System Architecture: Dive deep into the design and data flow of Mini-SGLang.