Kubernetes manifests and benchmark scripts for serving Qwen 3.5 27B at 1M+ total tokens per second on GKE Autopilot with NVIDIA B200 GPUs.
Companion repo for the Medium blog post: 1 Million Tokens Per Second: Qwen 3.5 27B on GKE with B200 GPUs
| Nodes | GPUs | Total tok/s | Per-node | Scaling efficiency |
|---|---|---|---|---|
| 1 | 8 B200 | 95,317 | 95,317 | 100% |
| 2 | 16 B200 | 190,000 | 95,000 | 99.7% |
| 4 | 32 B200 | 376,074 | 94,019 | 98.6% |
| 8 | 64 B200 | 740,192 | 92,524 | 97.1% |
| 12 | 96 B200 | 1,103,941 | 91,995 | 96.5% |
Benchmark: InferenceMAX methodology (ISL=1024, OSL=512, 0% prefix cache hit). vLLM v0.18.0, DP=8 with MTP-1 speculative decoding, FP8 KV cache.
k8s/
single-replica-qwen35-27b.yaml # Single-node deployment (DP=8, MTP-1, FP8 KV)
multi-replica-qwen35-27b.yaml # Multi-node with ReadOnlyMany PVC
benchmark-pod.yaml # 16 vCPU C4 benchmark pod
hyperdisk-ml.yaml # StorageClass + PVC for model weights
hyperdisk-ml-readonly.yaml # ReadOnlyMany PVC from disk image snapshot
inference-gateway-qwen35.yaml # GKE Inference Gateway + HTTPRoute
model-download-job.yaml # Git LFS download job on C4A (Arm)
disagg-qwen35-27b.yaml # Disaggregated P/D manifest (experimental)
scripts/
parallel-bench.sh # Parallel benchmark clients (synthetic + ShareGPT)
# 1. Create cluster
gcloud container clusters create-auto vllm-inference-cluster \
--project="${PROJECT_ID}" \
--region=europe-west4 \
--release-channel=rapid
# 2. Create HF token secret
kubectl create secret generic hf-token \
--from-literal=token="${HF_TOKEN}"
# 3. Deploy single replica
kubectl apply -f k8s/single-replica-qwen35-27b.yaml
# 4. Deploy benchmark pod
kubectl apply -f k8s/benchmark-pod.yaml
# 5. Scale out
kubectl apply -f k8s/hyperdisk-ml-readonly.yaml
kubectl apply -f k8s/multi-replica-qwen35-27b.yaml
# 6. Run parallel benchmark (16 clients x 1K concurrency)
kubectl cp scripts/parallel-bench.sh vllm-benchmark:/usr/local/bin/parallel-bench.sh
kubectl exec -it vllm-benchmark -- parallel-bench.sh qwen35-server 8000 16 1000- DP=8 beats TP=8 for small models on large GPUs (96K vs 22K tok/s)
- MTP-1 speculative decoding is the single biggest throughput lever (~1.9 tokens per decode step)
- FP8 KV cache triples capacity (959K vs 288K tokens/engine)
- ClusterIP round-robin scales to 96.5% efficiency at 12 nodes
- Inference Gateway adds ~35% overhead from ext_proc (single EPP bottleneck)
- Benchmark client becomes the bottleneck before servers do -- use parallel clients
MIT