A production-grade, tutorial-rich automation suite for building GPU-ready Kubernetes clusters from scratch. Start with ARM edge devices (NVIDIA Jetson Orin, Raspberry Pi) and scale seamlessly to cloud infrastructure for advanced ML/HPC workloads.
Project Repository: github.com/christimahu/k8s-mlops
Author: Christi Mahu — christimahu.dev
License: GNU General Public License v3.0
This isn't just infrastructure-as-code — it's a guided learning experience. Every script includes extensive inline documentation explaining:
- What is being installed and configured
- Why each step is necessary
- How the components work together
- When to use specific options or alternatives
- Troubleshooting guidance for common issues
The goal is to demystify production Kubernetes from bare metal to cloud, making it accessible to learners while providing the robustness needed for serious workloads.
- Deploy NVIDIA Jetson Orin clusters for GPU-accelerated inference at the edge
- Run distributed PyTorch/TensorFlow training across multiple GPU nodes
- Knative serverless functions that scale-to-zero (critical for power-constrained edge)
- Real-time computer vision pipelines with direct GPU access
- Pattern-proven infrastructure that scales from 3 nodes to 100+ nodes
- Same automation works on AWS/GCP/Azure bare metal instances
- Multi-cluster federation for geo-distributed workloads
- GitOps-ready for enterprise CI/CD pipelines
- Train models in the cloud, deploy optimized inference to edge devices
- Federated learning across edge clusters
- Workload bursting from edge to cloud during peak demand
- Disaster recovery with cloud failover
- Models: Jetson Orin Nano, NX, AGX
- Architecture: ARM64 with NVIDIA Ampere GPU
- Use Case: GPU-accelerated ML inference and training at the edge
- Setup Scripts: Complete automation in
jetson_orin/ - Cluster Recommendation: 3–10 nodes optimal, scales to 20+
- Power: 5–25W per node (ideal for edge/remote deployments)
- Models: Raspberry Pi 4 (8GB), Raspberry Pi 5
- Architecture: ARM64 (no GPU acceleration)
- Use Case: Control plane, lightweight workers, support services
- Setup Scripts: Coming soon in
raspberry_pi/ - Cluster Recommendation: Mix with Jetson for cost-effective scaling
- Power: 5–15W per node
- Platforms: AWS EC2 (bare metal), GCP Compute, Azure VMs, on-prem servers
- Architecture: x86-64 with optional NVIDIA datacenter GPUs
- Use Case: Cloud bursting, training large models, production scale-out
- Setup Scripts: Same Kubernetes scripts work across architectures
- Cluster Recommendation: 10–1000+ nodes
.
├── jetson_orin/ # Jetson Orin hardware setup (GPU nodes)
│ ├── setup/ # Sequential setup scripts (01-06)
│ ├── tools/ # Diagnostics and recovery tools
│ └── README.md # Complete Jetson setup guide
├── raspberry_pi/ # Raspberry Pi setup (coming soon)
├── k8s/ # Kubernetes cluster installation
│ ├── 01_install_deps.sh # Container runtime and kernel config
│ ├── 02_install_kube.sh # Kubernetes components (kubeadm, kubelet, kubectl)
│ ├── 03_bootstrap_cluster/ # Cluster initialization with CNI selection
│ ├── 04_install_ingress_nginx.sh # NGINX Ingress Controller
│ ├── 05_install_cert_manager.sh # Automatic TLS certificate management
│ ├── 06_install_service_mesh/ # Optional Istio or Linkerd
│ ├── addons/ # Additional cluster features (Knative, etc.)
│ ├── deployments/ # Example Kubernetes manifests
│ ├── ops/ # Operational scripts (join nodes, drain, etc.)
│ └── tools/ # Cluster tools (ArgoCD, Prometheus, Chaos Mesh, etc.)
├── support_components/ # Supporting infrastructure
│ ├── 01_tls/ # TLS certificate authority and management
│ └── 02_components/ # Docker registry, Gitea Git server
├── utils/ # Development utilities (Neovim, etc.)
├── LICENSE # GNU GPL v3.0
└── README.md # This file
Direct links to README files referenced above:
jetson_orin/README.md→ https://github.com/christimahu/k8s-mlops/blob/main/jetson_orin/README.mdraspberry_pi/README.md→ https://github.com/christimahu/k8s-mlops/blob/main/raspberry_pi/README.md (when available)- Root
README.md→ https://github.com/christimahu/k8s-mlops/blob/main/README.md
Choose your hardware platform and complete the setup:
For NVIDIA Jetson Orin (GPU Nodes):
cd jetson_orin
# Follow jetson_orin/README.md for complete walkthrough
Jetson Orin Setup Guide — Headless configuration, SSD migration, OS optimization
For Raspberry Pi (Coming Soon):
cd raspberry_pi
# Follow raspberry_pi/README.md when available
After hardware setup is complete on all nodes:
cd k8s
Kubernetes Setup Guide — Complete cluster deployment workflow
Add production-ready supporting services:
cd support_components
Support Components Guide — TLS certificates, private registry, Git server
On each Jetson Orin device (or cloud instance):
cd jetson_orin/setup
# Run scripts sequentially (01 → 05)
sudo ./01_config_headless.sh # Configure for SSH, disable GUI
# Reboot, continue via SSH
sudo ./02_clone_os_to_ssd.sh # Migrate to SSD (optional but recommended)
sudo ./03_set_boot_to_ssd.sh # Configure SSD boot
sudo reboot
sudo ./04_strip_microsd_rootfs.sh # Security hardening
sudo ./05_update_os.sh # System updates
sudo ./06_verify_setup.sh # Validate configuration
Why this order matters: Each script builds on the previous one. Script 06 verifies the complete setup, not individual steps.
On every node (control plane and workers):
cd k8s
sudo ./01_install_deps.sh # Container runtime, kernel modules
sudo ./02_install_kube.sh # kubelet, kubeadm, kubectl
On your designated control plane node:
cd k8s/03_bootstrap_cluster
sudo ./bootstrap.sh # Interactive CNI selection and cluster init
What happens:
- Interactive menu to choose CNI (Calico, Flannel, or Weave)
- Cluster initialization with appropriate pod network CIDR
- Automatic CNI deployment
- Generation of worker join command
CRITICAL: Save the join command displayed at the end! Example:
sudo kubeadm join 192.168.1.100:6443 --token abc123.def456 \
--discovery-token-ca-cert-hash sha256:1234567890abcdef...
On each worker node:
cd k8s/ops
sudo ./join_node.sh # Prompts for join command from Step 3
Essential addons (on control plane):
cd k8s
sudo ./04_install_ingress_nginx.sh # External HTTP/HTTPS access
sudo ./05_install_cert_manager.sh # Automatic TLS certificates
Optional but recommended:
Service mesh for advanced traffic management and observability
cd k8s/06_install_service_mesh
sudo ./install.sh # Interactive: Istio, Linkerd, or skip
Serverless functions with scale-to-zero (ideal for edge)
cd k8s/addons
sudo ./install_knative.sh
GitOps continuous deployment
cd k8s/tools
sudo ./install_argocd.sh
Monitoring stack
sudo ./install_prometheus.sh
kubectl get nodes # All nodes should be Ready
kubectl get pods -A # All system pods should be Running
kubectl cluster-info # Display cluster endpoints
During cluster bootstrap, you'll choose a Container Network Interface (CNI):
| CNI | Best For | Pod CIDR | Key Features |
|---|---|---|---|
| Calico | Production | 192.168.0.0/16 | Network policies, BGP routing, scales to 1000+ nodes |
| Flannel | Simplicity | 10.244.0.0/16 | Lightweight VXLAN overlay, easiest to troubleshoot |
| Weave | Multi-cloud | 10.32.0.0/12 | Automatic mesh networking, built-in encryption |
Recommendation: Start with Calico for production features, or Flannel for learning and resource-constrained environments.
Service meshes add advanced traffic management, security (automatic mTLS), and observability:
| Service Mesh | Resource Usage | Complexity | Best For |
|---|---|---|---|
| Istio | Higher (~50MB/proxy) | More complex | Large clusters, comprehensive features |
| Linkerd | Lower (~10MB/proxy) | Simpler | Resource-constrained, ease of use |
| Skip | None | None | Simple applications (1-5 services) |
Recommendation: Use Linkerd on edge clusters (Jetson) for lower overhead, Istio on cloud for comprehensive features.
cd support_components/01_tls
sudo ./generate_ca.sh # Create your Certificate Authority (one-time)
sudo ./generate_cert.sh --service registry --hostname registry.local --ip 192.168.1.50
sudo ./trust_ca_on_nodes.sh # Install CA on all cluster nodes
cd support_components/02_components
sudo ./install_docker_registry.sh
cd support_components/02_components
sudo ./install_gitea.sh
- Control Plane: 1x Jetson Orin NX or AGX (8GB+)
- Workers: 2–9x Jetson Orin Nano or NX
- CNI: Flannel or Linkerd (lower overhead)
- Storage: NFS on one node or external NAS
- Use Case: Real-time inference, federated learning node
- Edge: 10x Jetson Orin for inference at remote sites
- Cloud: 5x x86 instances for model training
- CNI: Calico with BGP to physical network
- Service Mesh: Istio for cross-cluster traffic management
- Storage: Cloud block storage + edge local NVMe
- Use Case: Train in cloud, deploy to edge, hybrid workloads
- Compute: 50+ bare metal x86 with NVIDIA A100/H100 GPUs
- CNI: Calico in BGP mode with network policies
- Service Mesh: Istio for advanced traffic control and mTLS
- Storage: Distributed (Ceph, MinIO) or cloud object storage
- Monitoring: Full Prometheus + Grafana + Jaeger stack
- Use Case: Large-scale model training, research workloads
apiVersion: v1
kind: Pod
metadata:
name: pytorch-training
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:23.10-py3
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU per pod
command: ["python", "train.py"]
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: yolov8-inference
spec:
template:
spec:
containers:
- image: registry.local/yolov8:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: MODEL_PATH
value: "/models/yolov8n.pt"
# Scales to zero after 30 seconds idle - critical for edge power efficiency.
kubectl get nodes -o wide
kubectl get pods -A
kubectl top nodes # Resource usage
kubectl top pods -A
kubectl apply -f deployment.yaml
kubectl get pods -w # Watch pod creation
kubectl logs <pod-name> -f # Stream logs
cd k8s/ops
sudo ./drain_node.sh # Safely remove node from cluster
See Kubernetes README for detailed upgrade procedures.
cd k8s/tools
sudo ./install_prometheus.sh # Prometheus + Grafana
Access dashboards:
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Open browser: http://localhost:3000
If you installed Istio:
kubectl port-forward -n istio-system svc/kiali 20001:20001
# Open browser: http://localhost:20001
If you installed Linkerd:
linkerd viz dashboard
# Automatically opens browser
# Check CNI pods
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl logs -n kube-system <cni-pod-name>
# Check kubelet
sudo journalctl -u kubelet -n 50
kubectl describe pod <pod-name>
# Look for events indicating resource constraints or node selectors
kubectl describe nodes
# Check available resources
# Check if image exists and is accessible
docker pull <image-name>
# For private registry, ensure nodes trust CA
ls -la /usr/local/share/ca-certificates/
sudo update-ca-certificates
# On Jetson nodes, verify NVIDIA device plugin
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds
# Check GPU availability
kubectl get nodes -o json | jq '.items[].status.capacity'
For comprehensive troubleshooting, see Kubernetes README.
After successfully deploying your cluster:
Deploy GPU Workloads:
- Try example manifests in
k8s/deployments/ - Deploy PyTorch/TensorFlow distributed training
- Set up Knative for serverless inference
Enable Monitoring:
- Install Prometheus and Grafana:
k8s/tools/install_prometheus.sh - Set up alerting for critical metrics
- Integrate with service mesh observability
Configure Storage:
- Set up NFS server:
k8s/tools/install_nfs_server.sh - Configure persistent volumes for stateful workloads
- Consider distributed storage (Longhorn, Rook-Ceph) for larger clusters
Explore Service Mesh:
- Enable traffic splitting for canary deployments
- Configure automatic mTLS between services
- Set up distributed tracing with Jaeger
Implement GitOps:
- Install ArgoCD:
k8s/tools/install_argocd.sh - Set up Gitea for self-hosted Git:
support_components/02_components/install_gitea.sh - Configure continuous deployment from Git
Chaos Engineering:
- Install Chaos Mesh:
k8s/tools/install_chaos_mesh.sh - Test cluster resilience with controlled failures
- Build confidence in production readiness
Contributions are welcome! Areas of particular interest:
- Additional hardware support (other ARM SBCs, cloud platforms)
- Additional CNI plugins (Cilium, Kube-router)
- Enhanced monitoring and logging integrations
- ML/AI workload examples and tutorials
- Documentation improvements
Please see CONTRIBUTING.md and review the GNU GPL v3.0 license.
This project follows the Contributor Covenant Code of Conduct.
This project is licensed under the GNU General Public License v3.0 — see the LICENSE file for details.
You are free to:
- Use this software for any purpose
- Study and modify the source code
- Share the software with others
- Distribute modified versions
Under these terms:
- Modifications must also be licensed under GPL v3.0
- You must include the original copyright notice and license
- You must disclose source code of distributed modifications
This project builds on the excellent work of:
- The Kubernetes project and CNCF community
- NVIDIA for Jetson hardware and JetPack SDK
- The Raspberry Pi Foundation
- CNI plugin maintainers (Calico, Flannel, Weave)
- Service mesh projects (Istio, Linkerd)
- The broader cloud-native and edge computing communities
Christi Mahu is a software engineer specializing in MLOps, edge computing, and Kubernetes on ARM and GPU architectures.
Visit christimahu.dev for more projects and technical writing.
Current Version: 1.0.0
Status: Production-ready for edge deployments, actively maintained
Tested Platforms:
- ✅ NVIDIA Jetson Orin Nano (8GB)
- ✅ NVIDIA Jetson Orin NX (16GB)
- ✅ Kubernetes v1.30
- ✅ JetPack 5.1.2, 6.0
- 🔄 Raspberry Pi 4/5 (coming soon)
- 🔄 Cloud platforms (AWS/GCP/Azure — patterns proven, scripts in progress)
Happy Clustering! 🚀
For questions, issues, or discussion, visit the GitHub repository.