Skip to content

m4r1k/vllm-1mtps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

1 Million Tokens Per Second with vLLM on GKE

Kubernetes manifests and benchmark scripts for serving Qwen 3.5 27B at 1M+ total tokens per second on GKE Autopilot with NVIDIA B200 GPUs.

Companion repo for the Medium blog post: 1 Million Tokens Per Second: Qwen 3.5 27B on GKE with B200 GPUs

Results

Nodes GPUs Total tok/s Per-node Scaling efficiency
1 8 B200 95,317 95,317 100%
2 16 B200 190,000 95,000 99.7%
4 32 B200 376,074 94,019 98.6%
8 64 B200 740,192 92,524 97.1%
12 96 B200 1,103,941 91,995 96.5%

Benchmark: InferenceMAX methodology (ISL=1024, OSL=512, 0% prefix cache hit). vLLM v0.18.0, DP=8 with MTP-1 speculative decoding, FP8 KV cache.

Repo structure

k8s/
  single-replica-qwen35-27b.yaml   # Single-node deployment (DP=8, MTP-1, FP8 KV)
  multi-replica-qwen35-27b.yaml    # Multi-node with ReadOnlyMany PVC
  benchmark-pod.yaml               # 16 vCPU C4 benchmark pod
  hyperdisk-ml.yaml                # StorageClass + PVC for model weights
  hyperdisk-ml-readonly.yaml       # ReadOnlyMany PVC from disk image snapshot
  inference-gateway-qwen35.yaml    # GKE Inference Gateway + HTTPRoute
  model-download-job.yaml          # Git LFS download job on C4A (Arm)
  disagg-qwen35-27b.yaml           # Disaggregated P/D manifest (experimental)
scripts/
  parallel-bench.sh                # Parallel benchmark clients (synthetic + ShareGPT)

Quick start

# 1. Create cluster
gcloud container clusters create-auto vllm-inference-cluster \
    --project="${PROJECT_ID}" \
    --region=europe-west4 \
    --release-channel=rapid

# 2. Create HF token secret
kubectl create secret generic hf-token \
    --from-literal=token="${HF_TOKEN}"

# 3. Deploy single replica
kubectl apply -f k8s/single-replica-qwen35-27b.yaml

# 4. Deploy benchmark pod
kubectl apply -f k8s/benchmark-pod.yaml

# 5. Scale out
kubectl apply -f k8s/hyperdisk-ml-readonly.yaml
kubectl apply -f k8s/multi-replica-qwen35-27b.yaml

# 6. Run parallel benchmark (16 clients x 1K concurrency)
kubectl cp scripts/parallel-bench.sh vllm-benchmark:/usr/local/bin/parallel-bench.sh
kubectl exec -it vllm-benchmark -- parallel-bench.sh qwen35-server 8000 16 1000

Key findings

  • DP=8 beats TP=8 for small models on large GPUs (96K vs 22K tok/s)
  • MTP-1 speculative decoding is the single biggest throughput lever (~1.9 tokens per decode step)
  • FP8 KV cache triples capacity (959K vs 288K tokens/engine)
  • ClusterIP round-robin scales to 96.5% efficiency at 12 nodes
  • Inference Gateway adds ~35% overhead from ext_proc (single EPP bottleneck)
  • Benchmark client becomes the bottleneck before servers do -- use parallel clients

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages