GPU-accelerated NVIDIA NIM inference on Google Kubernetes Engine
Production-grade reference implementation for deploying NVIDIA NIM microservices on GKE with L4 GPUs, autoscaling, and cost optimization.
Based on: Google Codelabs - Deploy an AI model on GKE with NVIDIA NIM
This repository extends the official Google Codelabs tutorial with production-grade enhancements:
Operational Excellence:
- Comprehensive error handling (
set -euo pipefailin all scripts) - Idempotent operations (safe to run multiple times)
- 60-minute deployment monitoring script
- Troubleshooting runbook (465 lines, 6 failure modes)
- Cost tracking and optimization strategies
Automation & Testing:
- Environment validation script (prerequisites, quotas, NGC key)
- Integration test suite with load testing
- CI/CD validation (shellcheck, yamllint, security scanning)
- Automated cleanup with verification
Production Features:
- Autoscaling GPU node pool (0-2 nodes)
- Cost optimization ($1.36/hour vs. tutorial's fixed deployment)
- Persistent volume for model caching (faster restarts)
- Resource limits and requests defined
- Production Helm values configuration
Documentation:
- Architecture deep-dive (517 lines: GPU memory layout, autoscaling mechanics)
- Interview preparation guide (357 lines: design decisions, talking points)
- Operational runbooks (troubleshooting, monitoring, incident response)
- Quick reference guide (one-page ops commands)
- Script documentation (usage, security, examples)
Developer Experience:
- Structured repository (charts, scripts, docs, runbooks separated)
- GitHub templates (PR, issues)
- Contributing guidelines
- Verification script (validates complete setup)
Tutorial Compatibility: All core deployment steps from the Google Codelabs tutorial are preserved and enhanced, not replaced.
NIM container → TensorRT-LLM → vLLM backend → L4 GPU → GKE node pool
Components:
- Model: Meta Llama 3 8B Instruct
- Runtime: NVIDIA NIM 1.0.0 (TensorRT-LLM + vLLM)
- Orchestration: Kubernetes StatefulSet + Helm
- Compute: GKE with g2-standard-4 nodes (L4 GPU, 24GB VRAM)
- API: OpenAI-compatible REST (
/v1/chat/completions)
Autoscaling: GPU node pool scales 0→2 based on pod requests.
Cost: ~$1.36/hour when active. $0/hour when scaled to zero.
| Requirement | Version | Purpose |
|---|---|---|
gcloud CLI |
Latest | GCP authentication, cluster management |
kubectl |
1.28+ | Kubernetes operations |
helm |
3.0+ | Chart deployment |
| NGC API Key | — | NIM image registry auth |
| GCP Project | — | Billing enabled |
| GPU Quota | 1× L4 | us-central1 or compatible region |
GPU quota approval: Required before deployment. See /docs/GPU_QUOTA_GUIDE.md.
# 1. Set NGC API key
export NGC_CLI_API_KEY='your-key-here'
# 2. Configure project
export PROJECT_ID="your-gcp-project"
export REGION="us-central1"
export ZONE="us-central1-a"
# 3. Deploy
./scripts/deploy_nim_gke.sh
# 4. Verify
kubectl get pods -n nim
kubectl port-forward service/my-nim-nim-llm 8000:8000 -n nim# Validate environment
./scripts/setup_environment.sh
# Deploy with production values
./scripts/deploy_nim_production.sh
# Run integration tests
./scripts/test_nim_production.shExpected duration: 25-35 minutes (cluster creation + model loading).
# Health check
curl http://localhost:8000/v1/health/ready
# List models
curl http://localhost:8000/v1/models
# Inference test
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"messages": [{"role": "user", "content": "What is TensorRT?"}],
"model": "meta/llama3-8b-instruct",
"max_tokens": 100
}'Expected response time: 3-6 seconds.
# Pod status
kubectl get pods -n nim -w
# Logs
kubectl logs -f my-nim-nim-llm-0 -n nim
# GPU utilization
kubectl exec -n nim my-nim-nim-llm-0 -- nvidia-smi
# Resource usage
kubectl top pod -n nim# Manual scale (StatefulSet)
kubectl scale statefulset my-nim-nim-llm --replicas=2 -n nim
# GPU node pool resize
gcloud container node-pools resize gpupool \
--cluster=nim-demo \
--zone=us-central1-a \
--num-nodes=2# Remove deployment (keep cluster)
helm uninstall my-nim -n nim
# GPU nodes auto-scale to 0
# Delete cluster (stop all costs)
./scripts/cleanup.shPod stuck in Pending:
kubectl describe pod -n nim my-nim-nim-llm-0
# Check: GPU availability, node readiness, quotasImagePullBackOff:
# Verify NGC secret
kubectl get secret ngc-api -n nim -o yaml
# Recreate if needed
kubectl delete secret ngc-api -n nim
kubectl create secret generic ngc-api \
--from-literal=NGC_API_KEY=$NGC_CLI_API_KEY \
-n nimModel loading slow:
- Expected: 10-15 minutes on first deployment
- Monitor:
kubectl logs -f my-nim-nim-llm-0 -n nim
See /runbooks/troubleshooting.md for complete procedures.
nim-gke/
├── charts/ # Helm charts and values
│ ├── nim-llm-1.3.0.tgz # NVIDIA NIM chart
│ └── values-production.yaml # Production config
├── scripts/ # Deployment and ops scripts
│ ├── deploy_nim_gke.sh # Main deployment
│ ├── setup_environment.sh # Prerequisite validation
│ ├── test_nim_production.sh # Integration tests
│ ├── cleanup.sh # Resource deletion
│ └── monitor_deployment.sh # Status monitoring
├── docs/ # Documentation
│ ├── DEPLOYMENT_SUCCESS.md # Deployment guide
│ ├── PRODUCTION_GUIDE.md # Operations manual
│ ├── GPU_QUOTA_GUIDE.md # Quota request process
│ └── interview/ # Interview preparation materials
├── runbooks/ # Operational procedures
│ └── troubleshooting.md # Incident response
├── examples/ # Configuration templates
│ └── set_ngc_key.sh.template # NGC key setup
└── README.md # This file
Edit charts/values-production.yaml:
image:
repository: "nvcr.io/nim/meta/llama3-8b-instruct"
tag: "1.0.0"
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
persistence:
enabled: true
size: 50Gi| Variable | Default | Purpose |
|---|---|---|
PROJECT_ID |
your-gcp-project |
GCP project |
REGION |
us-central1 |
GCP region |
ZONE |
us-central1-a |
GKE zone |
CLUSTER_NAME |
nim-demo |
Cluster identifier |
GPU_TYPE |
nvidia-l4 |
GPU accelerator type |
NODE_POOL_MACHINE_TYPE |
g2-standard-4 |
Node instance type |
| Metric | Value | Notes |
|---|---|---|
| First token latency | 2-3s | Cold start |
| Throughput | 15-20 tokens/s | L4 GPU, FP16 |
| Batch size | Dynamic | vLLM continuous batching |
| Context length | 8192 tokens | Llama 3 limit |
| GPU memory | ~12GB used | Of 24GB available |
Baseline (no load):
- Control plane: $0.13/hour
- Total: $0.13/hour
Active (1 GPU node):
- Control plane: $0.13/hour
- GPU node (g2-standard-4): $0.50/hour
- L4 GPU: $0.73/hour
- Total: $1.36/hour (~$980/month)
Optimization strategies:
- Autoscaling to zero when idle
- Preemptible nodes (-80% cost, accepts interruption)
- Committed use discounts (-37% for 3-year)
- Regional vs. zonal deployment tradeoffs
- ✅ NGC API key stored as Kubernetes Secret
- ✅ Image pull secrets for nvcr.io registry
- ✅ Service exposed via ClusterIP (internal only)
- ✅ TLS for production (configure Ingress + cert-manager)
⚠️ Authentication: Implement API gateway for production workloads
- Single GPU: Multi-GPU tensor parallelism requires code changes
- Model size: Llama 3 8B fits L4. Larger models need A100/H100
- Persistence: Model cached on PV. Deletion triggers re-download
- Regional availability: L4 not in all GCP zones
- Google Codelabs - Deploy AI on GKE with NVIDIA NIM - Original tutorial this repository is based on
- NVIDIA NIM Documentation - Official NIM microservices documentation
- GKE GPU Guide - Google Cloud GPU setup and configuration
- TensorRT-LLM - NVIDIA's optimized inference engine (FP16 precision, fused kernels)
- vLLM - High-throughput LLM serving framework (continuous batching, PagedAttention)
- Kubernetes - Container orchestration platform
- Helm - Kubernetes package manager
- NVIDIA AI Enterprise - Enterprise AI software platform
- GCP GPU Regions - GPU availability by region
- Llama 3 Model Card - Model documentation
Provided as-is for educational and reference purposes. NVIDIA NIM requires acceptance of NVIDIA AI Enterprise EULA.
Status: Production-ready ✅
Last validated: October 2025
GKE version: 1.34+
NIM version: 1.0.0