1 Million Tokens Per Second with vLLM on GKE

Kubernetes manifests and benchmark scripts for serving Qwen 3.5 27B at 1M+ total tokens per second on GKE Autopilot with NVIDIA B200 GPUs.

Companion repo for the Medium blog post: 1 Million Tokens Per Second: Qwen 3.5 27B on GKE with B200 GPUs

Results

Nodes	GPUs	Total tok/s	Per-node	Scaling efficiency
1	8 B200	95,317	95,317	100%
2	16 B200	190,000	95,000	99.7%
4	32 B200	376,074	94,019	98.6%
8	64 B200	740,192	92,524	97.1%
12	96 B200	1,103,941	91,995	96.5%

Benchmark: InferenceMAX methodology (ISL=1024, OSL=512, 0% prefix cache hit). vLLM v0.18.0, DP=8 with MTP-1 speculative decoding, FP8 KV cache.

Repo structure

k8s/
  single-replica-qwen35-27b.yaml   # Single-node deployment (DP=8, MTP-1, FP8 KV)
  multi-replica-qwen35-27b.yaml    # Multi-node with ReadOnlyMany PVC
  benchmark-pod.yaml               # 16 vCPU C4 benchmark pod
  hyperdisk-ml.yaml                # StorageClass + PVC for model weights
  hyperdisk-ml-readonly.yaml       # ReadOnlyMany PVC from disk image snapshot
  inference-gateway-qwen35.yaml    # GKE Inference Gateway + HTTPRoute
  model-download-job.yaml          # Git LFS download job on C4A (Arm)
  disagg-qwen35-27b.yaml           # Disaggregated P/D manifest (experimental)
scripts/
  parallel-bench.sh                # Parallel benchmark clients (synthetic + ShareGPT)

Quick start

# 1. Create cluster
gcloud container clusters create-auto vllm-inference-cluster \
    --project="${PROJECT_ID}" \
    --region=europe-west4 \
    --release-channel=rapid

# 2. Create HF token secret
kubectl create secret generic hf-token \
    --from-literal=token="${HF_TOKEN}"

# 3. Deploy single replica
kubectl apply -f k8s/single-replica-qwen35-27b.yaml

# 4. Deploy benchmark pod
kubectl apply -f k8s/benchmark-pod.yaml

# 5. Scale out
kubectl apply -f k8s/hyperdisk-ml-readonly.yaml
kubectl apply -f k8s/multi-replica-qwen35-27b.yaml

# 6. Run parallel benchmark (16 clients x 1K concurrency)
kubectl cp scripts/parallel-bench.sh vllm-benchmark:/usr/local/bin/parallel-bench.sh
kubectl exec -it vllm-benchmark -- parallel-bench.sh qwen35-server 8000 16 1000

Key findings

DP=8 beats TP=8 for small models on large GPUs (96K vs 22K tok/s)
MTP-1 speculative decoding is the single biggest throughput lever (~1.9 tokens per decode step)
FP8 KV cache triples capacity (959K vs 288K tokens/engine)
ClusterIP round-robin scales to 96.5% efficiency at 12 nodes
Inference Gateway adds ~35% overhead from ext_proc (single EPP bottleneck)
Benchmark client becomes the bottleneck before servers do -- use parallel clients

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
k8s		k8s
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1 Million Tokens Per Second with vLLM on GKE

Results

Repo structure

Quick start

Key findings

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1 Million Tokens Per Second with vLLM on GKE

Results

Repo structure

Quick start

Key findings

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages