Skip to content

joyehuang/mini-sglang

 
 

Repository files navigation

Mini-SGLang

A lightweight yet high-performance inference framework for Large Language Models.


Mini-SGLang is a compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems. With a compact codebase of ~5,000 lines of Python, it serves as both a capable inference engine and a transparent reference for researchers and developers.

✨ Key Features

  • High Performance: Achieves state-of-the-art throughput and latency with advanced optimizations.
  • Lightweight & Readable: A clean, modular, and fully type-annotated codebase that is easy to understand and modify.
  • Advanced Optimizations:
    • Radix Cache: Reuses KV cache for shared prefixes across requests.
    • Chunked Prefill: Reduces peak memory usage for long-context serving.
    • Overlap Scheduling: Hides CPU scheduling overhead with GPU computation.
    • Tensor Parallelism: Scales inference across multiple GPUs.
    • Optimized Kernels: Integrates FlashAttention and FlashInfer for maximum efficiency.
    • ...

🚀 Quick Start

1. Environment Setup

We recommend using uv for a fast and reliable installation (note that uv does not conflict with conda).

# Create a virtual environment (Python 3.10+ recommended)
uv venv --python=3.12
source .venv/bin/activate

Prerequisites: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure you have the NVIDIA CUDA Toolkit installed and that its version matches your driver's version. You can check your driver's CUDA capability with nvidia-smi.

2. Installation

Install Mini-SGLang directly from the source:

git clone https://github.com/sgl-project/mini-sglang.git
cd mini-sglang
uv pip install -e .

3. Online Serving

Launch an OpenAI-compatible API server with a single command.

# Deploy Qwen/Qwen3-0.6B on a single GPU
python -m minisgl --model "Qwen/Qwen3-0.6B"

# Deploy meta-llama/Llama-3.1-70B-Instruct on 4 GPUs with Tensor Parallelism, on port 30000
python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4 --port 30000

Once the server is running, you can send requests using standard tools like curl or any OpenAI-compatible client.

4. Interactive Shell

Chat with your model directly in the terminal by adding the --shell flag.

python -m minisgl --model "Qwen/Qwen3-0.6B" --shell

shell-example

You can also use /reset to clear the chat history.

Benchmark

Offline inference

See bench.py for more details. Set MINISGL_DISABLE_OVERLAP_SCHEDULING=1 for ablation study on overlap scheduling.

Test Configuration:

  • Hardware: 1xH200 GPU.
  • Model: Qwen3-0.6B, Qwen3-14B
  • Total Requests: 256 sequences
  • Input Length: Randomly sampled between 100-1024 tokens
  • Output Length: Randomly sampled between 100-1024 tokens

offline

Online inference

See benchmark_qwen.py for more details.

Test Configuration:

  • Hardware: 4xH200 GPU, connected by NVLink.
  • Model: Qwen3-32B
  • Dataset: Qwen trace, replaying first 1000 requests.

Launch command:

# Mini-SGLang
python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive

# SGLang
python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \
    --disable-radix --port 1919 --decode-attention flashinfer

online

📚 Learn More

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 76.2%
  • Cuda 8.9%
  • C 8.3%
  • C++ 6.6%