Kubernetes MLOps - GPU-Ready Edge-to-Cloud Platform

A production-grade, tutorial-rich automation suite for building GPU-ready Kubernetes clusters from scratch. Start with ARM edge devices (NVIDIA Jetson Orin, Raspberry Pi) and scale seamlessly to cloud infrastructure for advanced ML/HPC workloads.
Project Repository: github.com/christimahu/k8s-mlops
Author: Christi Mahu — christimahu.dev
License: GNU General Public License v3.0

Project Philosophy

This isn't just infrastructure-as-code — it's a guided learning experience. Every script includes extensive inline documentation explaining:

What is being installed and configured
Why each step is necessary
How the components work together
When to use specific options or alternatives
Troubleshooting guidance for common issues

The goal is to demystify production Kubernetes from bare metal to cloud, making it accessible to learners while providing the robustness needed for serious workloads.

Use Cases

Edge AI/ML (Start Here)

Deploy NVIDIA Jetson Orin clusters for GPU-accelerated inference at the edge
Run distributed PyTorch/TensorFlow training across multiple GPU nodes
Knative serverless functions that scale-to-zero (critical for power-constrained edge)
Real-time computer vision pipelines with direct GPU access

Cloud-Scale HPC (Scale Here)

Pattern-proven infrastructure that scales from 3 nodes to 100+ nodes
Same automation works on AWS/GCP/Azure bare metal instances
Multi-cluster federation for geo-distributed workloads
GitOps-ready for enterprise CI/CD pipelines

Hybrid Edge-Cloud

Train models in the cloud, deploy optimized inference to edge devices
Federated learning across edge clusters
Workload bursting from edge to cloud during peak demand
Disaster recovery with cloud failover

Hardware Platforms

NVIDIA Jetson Orin (Primary Platform)

Models: Jetson Orin Nano, NX, AGX
Architecture: ARM64 with NVIDIA Ampere GPU
Use Case: GPU-accelerated ML inference and training at the edge
Setup Scripts: Complete automation in jetson_orin/
Cluster Recommendation: 3–10 nodes optimal, scales to 20+
Power: 5–25W per node (ideal for edge/remote deployments)

Raspberry Pi (Support/Lightweight Nodes)

Models: Raspberry Pi 4 (8GB), Raspberry Pi 5
Architecture: ARM64 (no GPU acceleration)
Use Case: Control plane, lightweight workers, support services
Setup Scripts: Coming soon in raspberry_pi/
Cluster Recommendation: Mix with Jetson for cost-effective scaling
Power: 5–15W per node

Cloud/x86 (Scale-Out Target)

Platforms: AWS EC2 (bare metal), GCP Compute, Azure VMs, on-prem servers
Architecture: x86-64 with optional NVIDIA datacenter GPUs
Use Case: Cloud bursting, training large models, production scale-out
Setup Scripts: Same Kubernetes scripts work across architectures
Cluster Recommendation: 10–1000+ nodes

Project Structure

.
├── jetson_orin/              # Jetson Orin hardware setup (GPU nodes)
│   ├── setup/                # Sequential setup scripts (01-06)
│   ├── tools/                # Diagnostics and recovery tools
│   └── README.md             # Complete Jetson setup guide
├── raspberry_pi/             # Raspberry Pi setup (coming soon)
├── k8s/                      # Kubernetes cluster installation
│   ├── 01_install_deps.sh    # Container runtime and kernel config
│   ├── 02_install_kube.sh    # Kubernetes components (kubeadm, kubelet, kubectl)
│   ├── 03_bootstrap_cluster/ # Cluster initialization with CNI selection
│   ├── 04_install_ingress_nginx.sh    # NGINX Ingress Controller
│   ├── 05_install_cert_manager.sh     # Automatic TLS certificate management
│   ├── 06_install_service_mesh/       # Optional Istio or Linkerd
│   ├── addons/               # Additional cluster features (Knative, etc.)
│   ├── deployments/          # Example Kubernetes manifests
│   ├── ops/                  # Operational scripts (join nodes, drain, etc.)
│   └── tools/                # Cluster tools (ArgoCD, Prometheus, Chaos Mesh, etc.)
├── support_components/       # Supporting infrastructure
│   ├── 01_tls/               # TLS certificate authority and management
│   └── 02_components/        # Docker registry, Gitea Git server
├── utils/                    # Development utilities (Neovim, etc.)
├── LICENSE                   # GNU GPL v3.0
└── README.md                 # This file

Direct links to README files referenced above:

jetson_orin/README.md → https://github.com/christimahu/k8s-mlops/blob/main/jetson_orin/README.md
raspberry_pi/README.md → https://github.com/christimahu/k8s-mlops/blob/main/raspberry_pi/README.md (when available)
Root README.md → https://github.com/christimahu/k8s-mlops/blob/main/README.md

Quick Start

Phase 1: Hardware Preparation

Choose your hardware platform and complete the setup:

For NVIDIA Jetson Orin (GPU Nodes):

cd jetson_orin
# Follow jetson_orin/README.md for complete walkthrough

Jetson Orin Setup Guide — Headless configuration, SSD migration, OS optimization

For Raspberry Pi (Coming Soon):

cd raspberry_pi
# Follow raspberry_pi/README.md when available

Phase 2: Kubernetes Installation

After hardware setup is complete on all nodes:

cd k8s

Kubernetes Setup Guide — Complete cluster deployment workflow

Phase 3: Supporting Infrastructure (Optional)

Add production-ready supporting services:

cd support_components

Support Components Guide — TLS certificates, private registry, Git server

Complete Installation Workflow

Step 1: Prepare All Hardware Nodes

On each Jetson Orin device (or cloud instance):

cd jetson_orin/setup
# Run scripts sequentially (01 → 05)
sudo ./01_config_headless.sh     # Configure for SSH, disable GUI
# Reboot, continue via SSH
sudo ./02_clone_os_to_ssd.sh     # Migrate to SSD (optional but recommended)
sudo ./03_set_boot_to_ssd.sh     # Configure SSD boot
sudo reboot
sudo ./04_strip_microsd_rootfs.sh  # Security hardening
sudo ./05_update_os.sh           # System updates
sudo ./06_verify_setup.sh        # Validate configuration

Why this order matters: Each script builds on the previous one. Script 06 verifies the complete setup, not individual steps.

Step 2: Install Kubernetes Components (All Nodes)

On every node (control plane and workers):

cd k8s
sudo ./01_install_deps.sh        # Container runtime, kernel modules
sudo ./02_install_kube.sh        # kubelet, kubeadm, kubectl

Step 3: Bootstrap Cluster (Control Plane Only)

On your designated control plane node:

cd k8s/03_bootstrap_cluster
sudo ./bootstrap.sh              # Interactive CNI selection and cluster init

What happens:

Interactive menu to choose CNI (Calico, Flannel, or Weave)
Cluster initialization with appropriate pod network CIDR
Automatic CNI deployment
Generation of worker join command

CRITICAL: Save the join command displayed at the end! Example:

sudo kubeadm join 192.168.1.100:6443 --token abc123.def456 \
--discovery-token-ca-cert-hash sha256:1234567890abcdef...

Step 4: Join Worker Nodes

On each worker node:

cd k8s/ops
sudo ./join_node.sh              # Prompts for join command from Step 3

Step 5: Install Cluster Addons (Control Plane)

Essential addons (on control plane):

cd k8s
sudo ./04_install_ingress_nginx.sh       # External HTTP/HTTPS access
sudo ./05_install_cert_manager.sh        # Automatic TLS certificates

Optional but recommended:

Service mesh for advanced traffic management and observability

cd k8s/06_install_service_mesh
sudo ./install.sh                # Interactive: Istio, Linkerd, or skip

Serverless functions with scale-to-zero (ideal for edge)

cd k8s/addons
sudo ./install_knative.sh

GitOps continuous deployment

cd k8s/tools
sudo ./install_argocd.sh

Monitoring stack

sudo ./install_prometheus.sh

Step 6: Verify Cluster (Control Plane)

kubectl get nodes                # All nodes should be Ready
kubectl get pods -A              # All system pods should be Running
kubectl cluster-info             # Display cluster endpoints

CNI Options Explained

During cluster bootstrap, you'll choose a Container Network Interface (CNI):

CNI	Best For	Pod CIDR	Key Features
Calico	Production	192.168.0.0/16	Network policies, BGP routing, scales to 1000+ nodes
Flannel	Simplicity	10.244.0.0/16	Lightweight VXLAN overlay, easiest to troubleshoot
Weave	Multi-cloud	10.32.0.0/12	Automatic mesh networking, built-in encryption

Recommendation: Start with Calico for production features, or Flannel for learning and resource-constrained environments.

Service Mesh Options

Service meshes add advanced traffic management, security (automatic mTLS), and observability:

Service Mesh	Resource Usage	Complexity	Best For
Istio	Higher (~50MB/proxy)	More complex	Large clusters, comprehensive features
Linkerd	Lower (~10MB/proxy)	Simpler	Resource-constrained, ease of use
Skip	None	None	Simple applications (1-5 services)

Recommendation: Use Linkerd on edge clusters (Jetson) for lower overhead, Istio on cloud for comprehensive features.

Supporting Infrastructure

TLS Certificate Management

cd support_components/01_tls
sudo ./generate_ca.sh            # Create your Certificate Authority (one-time)
sudo ./generate_cert.sh --service registry --hostname registry.local --ip 192.168.1.50
sudo ./trust_ca_on_nodes.sh      # Install CA on all cluster nodes

Private Container Registry

cd support_components/02_components
sudo ./install_docker_registry.sh

Git Server (Gitea)

cd support_components/02_components
sudo ./install_gitea.sh

Architecture Patterns

Small Edge Cluster (3–10 nodes)

Control Plane: 1x Jetson Orin NX or AGX (8GB+)
Workers: 2–9x Jetson Orin Nano or NX
CNI: Flannel or Linkerd (lower overhead)
Storage: NFS on one node or external NAS
Use Case: Real-time inference, federated learning node

Medium Hybrid Cluster (10–50 nodes)

Edge: 10x Jetson Orin for inference at remote sites
Cloud: 5x x86 instances for model training
CNI: Calico with BGP to physical network
Service Mesh: Istio for cross-cluster traffic management
Storage: Cloud block storage + edge local NVMe
Use Case: Train in cloud, deploy to edge, hybrid workloads

Large HPC Cluster (50+ nodes)

Compute: 50+ bare metal x86 with NVIDIA A100/H100 GPUs
CNI: Calico in BGP mode with network policies
Service Mesh: Istio for advanced traffic control and mTLS
Storage: Distributed (Ceph, MinIO) or cloud object storage
Monitoring: Full Prometheus + Grafana + Jaeger stack
Use Case: Large-scale model training, research workloads

GPU Workload Examples

Distributed PyTorch Training

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-training
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:23.10-py3
    resources:
      limits:
        nvidia.com/gpu: 1  # Request 1 GPU per pod
    command: ["python", "train.py"]

Knative Serverless Inference

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: yolov8-inference
spec:
  template:
    spec:
      containers:
      - image: registry.local/yolov8:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: MODEL_PATH
          value: "/models/yolov8n.pt"
# Scales to zero after 30 seconds idle - critical for edge power efficiency.

Operational Tasks

View Cluster Status

kubectl get nodes -o wide
kubectl get pods -A
kubectl top nodes               # Resource usage
kubectl top pods -A

Deploy Application

kubectl apply -f deployment.yaml
kubectl get pods -w             # Watch pod creation
kubectl logs <pod-name> -f      # Stream logs

Node Maintenance

cd k8s/ops
sudo ./drain_node.sh            # Safely remove node from cluster

Cluster Upgrade

See Kubernetes README for detailed upgrade procedures.

Monitoring and Observability

Install Monitoring Stack

cd k8s/tools
sudo ./install_prometheus.sh    # Prometheus + Grafana

Access dashboards:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Open browser: http://localhost:3000

Service Mesh Observability

If you installed Istio:

kubectl port-forward -n istio-system svc/kiali 20001:20001
# Open browser: http://localhost:20001

If you installed Linkerd:

linkerd viz dashboard
# Automatically opens browser

Troubleshooting

Nodes Stuck "NotReady"

# Check CNI pods
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl logs -n kube-system <cni-pod-name>
# Check kubelet
sudo journalctl -u kubelet -n 50

Pods Not Scheduling

kubectl describe pod <pod-name>
# Look for events indicating resource constraints or node selectors
kubectl describe nodes
# Check available resources

ImagePullBackOff

# Check if image exists and is accessible
docker pull <image-name>
# For private registry, ensure nodes trust CA
ls -la /usr/local/share/ca-certificates/
sudo update-ca-certificates

GPU Not Detected

# On Jetson nodes, verify NVIDIA device plugin
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds
# Check GPU availability
kubectl get nodes -o json | jq '.items[].status.capacity'

For comprehensive troubleshooting, see Kubernetes README.

What's Next?

After successfully deploying your cluster:

Deploy GPU Workloads:

Try example manifests in k8s/deployments/
Deploy PyTorch/TensorFlow distributed training
Set up Knative for serverless inference

Enable Monitoring:

Install Prometheus and Grafana: k8s/tools/install_prometheus.sh
Set up alerting for critical metrics
Integrate with service mesh observability

Configure Storage:

Set up NFS server: k8s/tools/install_nfs_server.sh
Configure persistent volumes for stateful workloads
Consider distributed storage (Longhorn, Rook-Ceph) for larger clusters

Explore Service Mesh:

Enable traffic splitting for canary deployments
Configure automatic mTLS between services
Set up distributed tracing with Jaeger

Implement GitOps:

Install ArgoCD: k8s/tools/install_argocd.sh
Set up Gitea for self-hosted Git: support_components/02_components/install_gitea.sh
Configure continuous deployment from Git

Chaos Engineering:

Install Chaos Mesh: k8s/tools/install_chaos_mesh.sh
Test cluster resilience with controlled failures
Build confidence in production readiness

Contributing

Contributions are welcome! Areas of particular interest:

Additional hardware support (other ARM SBCs, cloud platforms)
Additional CNI plugins (Cilium, Kube-router)
Enhanced monitoring and logging integrations
ML/AI workload examples and tutorials
Documentation improvements

Please see CONTRIBUTING.md and review the GNU GPL v3.0 license.

Code of Conduct

This project follows the Contributor Covenant Code of Conduct.

License

This project is licensed under the GNU General Public License v3.0 — see the LICENSE file for details.

You are free to:

Use this software for any purpose
Study and modify the source code
Share the software with others
Distribute modified versions

Under these terms:

Modifications must also be licensed under GPL v3.0
You must include the original copyright notice and license
You must disclose source code of distributed modifications

Acknowledgments

This project builds on the excellent work of:

The Kubernetes project and CNCF community
NVIDIA for Jetson hardware and JetPack SDK
The Raspberry Pi Foundation
CNI plugin maintainers (Calico, Flannel, Weave)
Service mesh projects (Istio, Linkerd)
The broader cloud-native and edge computing communities

About the Author

Christi Mahu is a software engineer specializing in MLOps, edge computing, and Kubernetes on ARM and GPU architectures.
Visit christimahu.dev for more projects and technical writing.

Project Status

Current Version: 1.0.0
Status: Production-ready for edge deployments, actively maintained

Tested Platforms:

✅ NVIDIA Jetson Orin Nano (8GB)
✅ NVIDIA Jetson Orin NX (16GB)
✅ Kubernetes v1.30
✅ JetPack 5.1.2, 6.0
🔄 Raspberry Pi 4/5 (coming soon)
🔄 Cloud platforms (AWS/GCP/Azure — patterns proven, scripts in progress)

Happy Clustering! 🚀
For questions, issues, or discussion, visit the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
jetson_orin		jetson_orin
k8s		k8s
raspberry_pi		raspberry_pi
support_components		support_components
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
k8s.png		k8s.png

License

christimahu/k8s-mlops

Folders and files

Latest commit

History

Repository files navigation

Kubernetes MLOps - GPU-Ready Edge-to-Cloud Platform

Project Philosophy

Use Cases

Edge AI/ML (Start Here)

Cloud-Scale HPC (Scale Here)

Hybrid Edge-Cloud

Hardware Platforms

NVIDIA Jetson Orin (Primary Platform)

Raspberry Pi (Support/Lightweight Nodes)

Cloud/x86 (Scale-Out Target)

Project Structure

Quick Start

Phase 1: Hardware Preparation

Phase 2: Kubernetes Installation

Phase 3: Supporting Infrastructure (Optional)

Complete Installation Workflow

Step 1: Prepare All Hardware Nodes

Step 2: Install Kubernetes Components (All Nodes)

Step 3: Bootstrap Cluster (Control Plane Only)

Step 4: Join Worker Nodes

Step 5: Install Cluster Addons (Control Plane)

Step 6: Verify Cluster (Control Plane)

CNI Options Explained

Service Mesh Options

Supporting Infrastructure

TLS Certificate Management

Private Container Registry

Git Server (Gitea)

Architecture Patterns

Small Edge Cluster (3–10 nodes)

Medium Hybrid Cluster (10–50 nodes)

Large HPC Cluster (50+ nodes)

GPU Workload Examples

Distributed PyTorch Training

Knative Serverless Inference

Operational Tasks

View Cluster Status

Deploy Application

Node Maintenance

Cluster Upgrade

Monitoring and Observability

Install Monitoring Stack

Service Mesh Observability

Troubleshooting

Nodes Stuck "NotReady"

Pods Not Scheduling

ImagePullBackOff

GPU Not Detected

What's Next?

Contributing

Code of Conduct

License

Acknowledgments

About the Author

Project Status

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages