nim-gke

GPU-accelerated NVIDIA NIM inference on Google Kubernetes Engine

Production-grade reference implementation for deploying NVIDIA NIM microservices on GKE with L4 GPUs, autoscaling, and cost optimization.

Based on: Google Codelabs - Deploy an AI model on GKE with NVIDIA NIM

What This Adds to the Tutorial

This repository extends the official Google Codelabs tutorial with production-grade enhancements:

Operational Excellence:

Comprehensive error handling (set -euo pipefail in all scripts)
Idempotent operations (safe to run multiple times)
60-minute deployment monitoring script
Troubleshooting runbook (465 lines, 6 failure modes)
Cost tracking and optimization strategies

Automation & Testing:

Environment validation script (prerequisites, quotas, NGC key)
Integration test suite with load testing
CI/CD validation (shellcheck, yamllint, security scanning)
Automated cleanup with verification

Production Features:

Autoscaling GPU node pool (0-2 nodes)
Cost optimization ($1.36/hour vs. tutorial's fixed deployment)
Persistent volume for model caching (faster restarts)
Resource limits and requests defined
Production Helm values configuration

Documentation:

Architecture deep-dive (517 lines: GPU memory layout, autoscaling mechanics)
Interview preparation guide (357 lines: design decisions, talking points)
Operational runbooks (troubleshooting, monitoring, incident response)
Quick reference guide (one-page ops commands)
Script documentation (usage, security, examples)

Developer Experience:

Structured repository (charts, scripts, docs, runbooks separated)
GitHub templates (PR, issues)
Contributing guidelines
Verification script (validates complete setup)

Tutorial Compatibility: All core deployment steps from the Google Codelabs tutorial are preserved and enhanced, not replaced.

Architecture

NIM container → TensorRT-LLM → vLLM backend → L4 GPU → GKE node pool

Components:

Model: Meta Llama 3 8B Instruct
Runtime: NVIDIA NIM 1.0.0 (TensorRT-LLM + vLLM)
Orchestration: Kubernetes StatefulSet + Helm
Compute: GKE with g2-standard-4 nodes (L4 GPU, 24GB VRAM)
API: OpenAI-compatible REST (/v1/chat/completions)

Autoscaling: GPU node pool scales 0→2 based on pod requests.

Cost: ~$1.36/hour when active. $0/hour when scaled to zero.

Prerequisites

Requirement	Version	Purpose
`gcloud` CLI	Latest	GCP authentication, cluster management
`kubectl`	1.28+	Kubernetes operations
`helm`	3.0+	Chart deployment
NGC API Key	—	NIM image registry auth
GCP Project	—	Billing enabled
GPU Quota	1× L4	us-central1 or compatible region

GPU quota approval: Required before deployment. See /docs/GPU_QUOTA_GUIDE.md.

Deployment

Quick Start

# 1. Set NGC API key
export NGC_CLI_API_KEY='your-key-here'

# 2. Configure project
export PROJECT_ID="your-gcp-project"
export REGION="us-central1"
export ZONE="us-central1-a"

# 3. Deploy
./scripts/deploy_nim_gke.sh

# 4. Verify
kubectl get pods -n nim
kubectl port-forward service/my-nim-nim-llm 8000:8000 -n nim

Production Deployment

# Validate environment
./scripts/setup_environment.sh

# Deploy with production values
./scripts/deploy_nim_production.sh

# Run integration tests
./scripts/test_nim_production.sh

Expected duration: 25-35 minutes (cluster creation + model loading).

Verify

# Health check
curl http://localhost:8000/v1/health/ready

# List models
curl http://localhost:8000/v1/models

# Inference test
curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "messages": [{"role": "user", "content": "What is TensorRT?"}],
    "model": "meta/llama3-8b-instruct",
    "max_tokens": 100
  }'

Expected response time: 3-6 seconds.

Operate

Monitor

# Pod status
kubectl get pods -n nim -w

# Logs
kubectl logs -f my-nim-nim-llm-0 -n nim

# GPU utilization
kubectl exec -n nim my-nim-nim-llm-0 -- nvidia-smi

# Resource usage
kubectl top pod -n nim

Scale

# Manual scale (StatefulSet)
kubectl scale statefulset my-nim-nim-llm --replicas=2 -n nim

# GPU node pool resize
gcloud container node-pools resize gpupool \
  --cluster=nim-demo \
  --zone=us-central1-a \
  --num-nodes=2

Cost Control

# Remove deployment (keep cluster)
helm uninstall my-nim -n nim
# GPU nodes auto-scale to 0

# Delete cluster (stop all costs)
./scripts/cleanup.sh

Troubleshoot

Pod stuck in Pending:

kubectl describe pod -n nim my-nim-nim-llm-0
# Check: GPU availability, node readiness, quotas

ImagePullBackOff:

# Verify NGC secret
kubectl get secret ngc-api -n nim -o yaml
# Recreate if needed
kubectl delete secret ngc-api -n nim
kubectl create secret generic ngc-api \
  --from-literal=NGC_API_KEY=$NGC_CLI_API_KEY \
  -n nim

Model loading slow:

Expected: 10-15 minutes on first deployment
Monitor: kubectl logs -f my-nim-nim-llm-0 -n nim

See /runbooks/troubleshooting.md for complete procedures.

Repository Structure

nim-gke/
├── charts/                     # Helm charts and values
│   ├── nim-llm-1.3.0.tgz      # NVIDIA NIM chart
│   └── values-production.yaml  # Production config
├── scripts/                    # Deployment and ops scripts
│   ├── deploy_nim_gke.sh      # Main deployment
│   ├── setup_environment.sh    # Prerequisite validation
│   ├── test_nim_production.sh  # Integration tests
│   ├── cleanup.sh             # Resource deletion
│   └── monitor_deployment.sh   # Status monitoring
├── docs/                       # Documentation
│   ├── DEPLOYMENT_SUCCESS.md   # Deployment guide
│   ├── PRODUCTION_GUIDE.md     # Operations manual
│   ├── GPU_QUOTA_GUIDE.md      # Quota request process
│   └── interview/              # Interview preparation materials
├── runbooks/                   # Operational procedures
│   └── troubleshooting.md      # Incident response
├── examples/                   # Configuration templates
│   └── set_ngc_key.sh.template # NGC key setup
└── README.md                   # This file

Configuration

Helm Values

Edit charts/values-production.yaml:

image:
  repository: "nvcr.io/nim/meta/llama3-8b-instruct"
  tag: "1.0.0"

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

persistence:
  enabled: true
  size: 50Gi

Environment Variables

Variable	Default	Purpose
`PROJECT_ID`	`your-gcp-project`	GCP project
`REGION`	`us-central1`	GCP region
`ZONE`	`us-central1-a`	GKE zone
`CLUSTER_NAME`	`nim-demo`	Cluster identifier
`GPU_TYPE`	`nvidia-l4`	GPU accelerator type
`NODE_POOL_MACHINE_TYPE`	`g2-standard-4`	Node instance type

Performance

Metric	Value	Notes
First token latency	2-3s	Cold start
Throughput	15-20 tokens/s	L4 GPU, FP16
Batch size	Dynamic	vLLM continuous batching
Context length	8192 tokens	Llama 3 limit
GPU memory	~12GB used	Of 24GB available

Cost

Baseline (no load):

Control plane: $0.13/hour
Total: $0.13/hour

Active (1 GPU node):

Control plane: $0.13/hour
GPU node (g2-standard-4): $0.50/hour
L4 GPU: $0.73/hour
Total: $1.36/hour (~$980/month)

Optimization strategies:

Autoscaling to zero when idle
Preemptible nodes (-80% cost, accepts interruption)
Committed use discounts (-37% for 3-year)
Regional vs. zonal deployment tradeoffs

Security

✅ NGC API key stored as Kubernetes Secret
✅ Image pull secrets for nvcr.io registry
✅ Service exposed via ClusterIP (internal only)
✅ TLS for production (configure Ingress + cert-manager)
⚠️ Authentication: Implement API gateway for production workloads

Limitations

Single GPU: Multi-GPU tensor parallelism requires code changes
Model size: Llama 3 8B fits L4. Larger models need A100/H100
Persistence: Model cached on PV. Deletion triggers re-download
Regional availability: L4 not in all GCP zones

References

Primary Sources

Google Codelabs - Deploy AI on GKE with NVIDIA NIM - Original tutorial this repository is based on
NVIDIA NIM Documentation - Official NIM microservices documentation
GKE GPU Guide - Google Cloud GPU setup and configuration

Core Technologies

TensorRT-LLM - NVIDIA's optimized inference engine (FP16 precision, fused kernels)
vLLM - High-throughput LLM serving framework (continuous batching, PagedAttention)
Kubernetes - Container orchestration platform
Helm - Kubernetes package manager

Additional Resources

NVIDIA AI Enterprise - Enterprise AI software platform
GCP GPU Regions - GPU availability by region
Llama 3 Model Card - Model documentation

License

Provided as-is for educational and reference purposes. NVIDIA NIM requires acceptance of NVIDIA AI Enterprise EULA.

Status: Production-ready ✅
Last validated: October 2025
GKE version: 1.34+
NIM version: 1.0.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.cursor/prompts		.cursor/prompts
.github		.github
charts		charts
docs		docs
examples		examples
runbooks		runbooks
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
HOUSEKEEPING_COMPLETE.md		HOUSEKEEPING_COMPLETE.md
LICENSE		LICENSE
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
REFACTOR_SUMMARY.md		REFACTOR_SUMMARY.md
SESSION_STATE.md		SESSION_STATE.md
SESSION_STATE.md.bak		SESSION_STATE.md.bak

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

nim-gke

What This Adds to the Tutorial

Architecture

Prerequisites

Deployment

Quick Start

Production Deployment

Verify

Operate

Monitor

Scale

Cost Control

Troubleshoot

Repository Structure

Configuration

Helm Values

Environment Variables

Performance

Cost

Security

Limitations

References

Primary Sources

Core Technologies

Additional Resources

License

About

Uh oh!

Releases

Packages

Languages

License

frankbesch/nim-gke

Folders and files

Latest commit

History

Repository files navigation

nim-gke

What This Adds to the Tutorial

Architecture

Prerequisites

Deployment

Quick Start

Production Deployment

Verify

Operate

Monitor

Scale

Cost Control

Troubleshoot

Repository Structure

Configuration

Helm Values

Environment Variables

Performance

Cost

Security

Limitations

References

Primary Sources

Core Technologies

Additional Resources

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages