Skip to content

christimahu/k8s-mlops

Kubernetes MLOps - GPU-Ready Edge-to-Cloud Platform

Kubernetes

A production-grade, tutorial-rich automation suite for building GPU-ready Kubernetes clusters from scratch. Start with ARM edge devices (NVIDIA Jetson Orin, Raspberry Pi) and scale seamlessly to cloud infrastructure for advanced ML/HPC workloads.
Project Repository: github.com/christimahu/k8s-mlops
Author: Christi Mahu — christimahu.dev
License: GNU General Public License v3.0


Project Philosophy

This isn't just infrastructure-as-code — it's a guided learning experience. Every script includes extensive inline documentation explaining:

  • What is being installed and configured
  • Why each step is necessary
  • How the components work together
  • When to use specific options or alternatives
  • Troubleshooting guidance for common issues

The goal is to demystify production Kubernetes from bare metal to cloud, making it accessible to learners while providing the robustness needed for serious workloads.


Use Cases

Edge AI/ML (Start Here)

  • Deploy NVIDIA Jetson Orin clusters for GPU-accelerated inference at the edge
  • Run distributed PyTorch/TensorFlow training across multiple GPU nodes
  • Knative serverless functions that scale-to-zero (critical for power-constrained edge)
  • Real-time computer vision pipelines with direct GPU access

Cloud-Scale HPC (Scale Here)

  • Pattern-proven infrastructure that scales from 3 nodes to 100+ nodes
  • Same automation works on AWS/GCP/Azure bare metal instances
  • Multi-cluster federation for geo-distributed workloads
  • GitOps-ready for enterprise CI/CD pipelines

Hybrid Edge-Cloud

  • Train models in the cloud, deploy optimized inference to edge devices
  • Federated learning across edge clusters
  • Workload bursting from edge to cloud during peak demand
  • Disaster recovery with cloud failover

Hardware Platforms

NVIDIA Jetson Orin (Primary Platform)

  • Models: Jetson Orin Nano, NX, AGX
  • Architecture: ARM64 with NVIDIA Ampere GPU
  • Use Case: GPU-accelerated ML inference and training at the edge
  • Setup Scripts: Complete automation in jetson_orin/
  • Cluster Recommendation: 3–10 nodes optimal, scales to 20+
  • Power: 5–25W per node (ideal for edge/remote deployments)

Raspberry Pi (Support/Lightweight Nodes)

  • Models: Raspberry Pi 4 (8GB), Raspberry Pi 5
  • Architecture: ARM64 (no GPU acceleration)
  • Use Case: Control plane, lightweight workers, support services
  • Setup Scripts: Coming soon in raspberry_pi/
  • Cluster Recommendation: Mix with Jetson for cost-effective scaling
  • Power: 5–15W per node

Cloud/x86 (Scale-Out Target)

  • Platforms: AWS EC2 (bare metal), GCP Compute, Azure VMs, on-prem servers
  • Architecture: x86-64 with optional NVIDIA datacenter GPUs
  • Use Case: Cloud bursting, training large models, production scale-out
  • Setup Scripts: Same Kubernetes scripts work across architectures
  • Cluster Recommendation: 10–1000+ nodes

Project Structure

.
├── jetson_orin/              # Jetson Orin hardware setup (GPU nodes)
│   ├── setup/                # Sequential setup scripts (01-06)
│   ├── tools/                # Diagnostics and recovery tools
│   └── README.md             # Complete Jetson setup guide
├── raspberry_pi/             # Raspberry Pi setup (coming soon)
├── k8s/                      # Kubernetes cluster installation
│   ├── 01_install_deps.sh    # Container runtime and kernel config
│   ├── 02_install_kube.sh    # Kubernetes components (kubeadm, kubelet, kubectl)
│   ├── 03_bootstrap_cluster/ # Cluster initialization with CNI selection
│   ├── 04_install_ingress_nginx.sh    # NGINX Ingress Controller
│   ├── 05_install_cert_manager.sh     # Automatic TLS certificate management
│   ├── 06_install_service_mesh/       # Optional Istio or Linkerd
│   ├── addons/               # Additional cluster features (Knative, etc.)
│   ├── deployments/          # Example Kubernetes manifests
│   ├── ops/                  # Operational scripts (join nodes, drain, etc.)
│   └── tools/                # Cluster tools (ArgoCD, Prometheus, Chaos Mesh, etc.)
├── support_components/       # Supporting infrastructure
│   ├── 01_tls/               # TLS certificate authority and management
│   └── 02_components/        # Docker registry, Gitea Git server
├── utils/                    # Development utilities (Neovim, etc.)
├── LICENSE                   # GNU GPL v3.0
└── README.md                 # This file

Direct links to README files referenced above:


Quick Start

Phase 1: Hardware Preparation

Choose your hardware platform and complete the setup:

For NVIDIA Jetson Orin (GPU Nodes):

cd jetson_orin
# Follow jetson_orin/README.md for complete walkthrough

Jetson Orin Setup Guide — Headless configuration, SSD migration, OS optimization

For Raspberry Pi (Coming Soon):

cd raspberry_pi
# Follow raspberry_pi/README.md when available

Phase 2: Kubernetes Installation

After hardware setup is complete on all nodes:

cd k8s

Kubernetes Setup Guide — Complete cluster deployment workflow

Phase 3: Supporting Infrastructure (Optional)

Add production-ready supporting services:

cd support_components

Support Components Guide — TLS certificates, private registry, Git server


Complete Installation Workflow

Step 1: Prepare All Hardware Nodes

On each Jetson Orin device (or cloud instance):

cd jetson_orin/setup
# Run scripts sequentially (01 → 05)
sudo ./01_config_headless.sh     # Configure for SSH, disable GUI
# Reboot, continue via SSH
sudo ./02_clone_os_to_ssd.sh     # Migrate to SSD (optional but recommended)
sudo ./03_set_boot_to_ssd.sh     # Configure SSD boot
sudo reboot
sudo ./04_strip_microsd_rootfs.sh  # Security hardening
sudo ./05_update_os.sh           # System updates
sudo ./06_verify_setup.sh        # Validate configuration

Why this order matters: Each script builds on the previous one. Script 06 verifies the complete setup, not individual steps.

Step 2: Install Kubernetes Components (All Nodes)

On every node (control plane and workers):

cd k8s
sudo ./01_install_deps.sh        # Container runtime, kernel modules
sudo ./02_install_kube.sh        # kubelet, kubeadm, kubectl

Step 3: Bootstrap Cluster (Control Plane Only)

On your designated control plane node:

cd k8s/03_bootstrap_cluster
sudo ./bootstrap.sh              # Interactive CNI selection and cluster init

What happens:

  • Interactive menu to choose CNI (Calico, Flannel, or Weave)
  • Cluster initialization with appropriate pod network CIDR
  • Automatic CNI deployment
  • Generation of worker join command

CRITICAL: Save the join command displayed at the end! Example:

sudo kubeadm join 192.168.1.100:6443 --token abc123.def456 \
--discovery-token-ca-cert-hash sha256:1234567890abcdef...

Step 4: Join Worker Nodes

On each worker node:

cd k8s/ops
sudo ./join_node.sh              # Prompts for join command from Step 3

Step 5: Install Cluster Addons (Control Plane)

Essential addons (on control plane):

cd k8s
sudo ./04_install_ingress_nginx.sh       # External HTTP/HTTPS access
sudo ./05_install_cert_manager.sh        # Automatic TLS certificates

Optional but recommended:

Service mesh for advanced traffic management and observability

cd k8s/06_install_service_mesh
sudo ./install.sh                # Interactive: Istio, Linkerd, or skip

Serverless functions with scale-to-zero (ideal for edge)

cd k8s/addons
sudo ./install_knative.sh

GitOps continuous deployment

cd k8s/tools
sudo ./install_argocd.sh

Monitoring stack

sudo ./install_prometheus.sh

Step 6: Verify Cluster (Control Plane)

kubectl get nodes                # All nodes should be Ready
kubectl get pods -A              # All system pods should be Running
kubectl cluster-info             # Display cluster endpoints

CNI Options Explained

During cluster bootstrap, you'll choose a Container Network Interface (CNI):

CNI Best For Pod CIDR Key Features
Calico Production 192.168.0.0/16 Network policies, BGP routing, scales to 1000+ nodes
Flannel Simplicity 10.244.0.0/16 Lightweight VXLAN overlay, easiest to troubleshoot
Weave Multi-cloud 10.32.0.0/12 Automatic mesh networking, built-in encryption

Recommendation: Start with Calico for production features, or Flannel for learning and resource-constrained environments.


Service Mesh Options

Service meshes add advanced traffic management, security (automatic mTLS), and observability:

Service Mesh Resource Usage Complexity Best For
Istio Higher (~50MB/proxy) More complex Large clusters, comprehensive features
Linkerd Lower (~10MB/proxy) Simpler Resource-constrained, ease of use
Skip None None Simple applications (1-5 services)

Recommendation: Use Linkerd on edge clusters (Jetson) for lower overhead, Istio on cloud for comprehensive features.


Supporting Infrastructure

TLS Certificate Management

cd support_components/01_tls
sudo ./generate_ca.sh            # Create your Certificate Authority (one-time)
sudo ./generate_cert.sh --service registry --hostname registry.local --ip 192.168.1.50
sudo ./trust_ca_on_nodes.sh      # Install CA on all cluster nodes

Private Container Registry

cd support_components/02_components
sudo ./install_docker_registry.sh

Git Server (Gitea)

cd support_components/02_components
sudo ./install_gitea.sh

Architecture Patterns

Small Edge Cluster (3–10 nodes)

  • Control Plane: 1x Jetson Orin NX or AGX (8GB+)
  • Workers: 2–9x Jetson Orin Nano or NX
  • CNI: Flannel or Linkerd (lower overhead)
  • Storage: NFS on one node or external NAS
  • Use Case: Real-time inference, federated learning node

Medium Hybrid Cluster (10–50 nodes)

  • Edge: 10x Jetson Orin for inference at remote sites
  • Cloud: 5x x86 instances for model training
  • CNI: Calico with BGP to physical network
  • Service Mesh: Istio for cross-cluster traffic management
  • Storage: Cloud block storage + edge local NVMe
  • Use Case: Train in cloud, deploy to edge, hybrid workloads

Large HPC Cluster (50+ nodes)

  • Compute: 50+ bare metal x86 with NVIDIA A100/H100 GPUs
  • CNI: Calico in BGP mode with network policies
  • Service Mesh: Istio for advanced traffic control and mTLS
  • Storage: Distributed (Ceph, MinIO) or cloud object storage
  • Monitoring: Full Prometheus + Grafana + Jaeger stack
  • Use Case: Large-scale model training, research workloads

GPU Workload Examples

Distributed PyTorch Training

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-training
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:23.10-py3
    resources:
      limits:
        nvidia.com/gpu: 1  # Request 1 GPU per pod
    command: ["python", "train.py"]

Knative Serverless Inference

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: yolov8-inference
spec:
  template:
    spec:
      containers:
      - image: registry.local/yolov8:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: MODEL_PATH
          value: "/models/yolov8n.pt"
# Scales to zero after 30 seconds idle - critical for edge power efficiency.

Operational Tasks

View Cluster Status

kubectl get nodes -o wide
kubectl get pods -A
kubectl top nodes               # Resource usage
kubectl top pods -A

Deploy Application

kubectl apply -f deployment.yaml
kubectl get pods -w             # Watch pod creation
kubectl logs <pod-name> -f      # Stream logs

Node Maintenance

cd k8s/ops
sudo ./drain_node.sh            # Safely remove node from cluster

Cluster Upgrade

See Kubernetes README for detailed upgrade procedures.


Monitoring and Observability

Install Monitoring Stack

cd k8s/tools
sudo ./install_prometheus.sh    # Prometheus + Grafana

Access dashboards:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Open browser: http://localhost:3000

Service Mesh Observability

If you installed Istio:

kubectl port-forward -n istio-system svc/kiali 20001:20001
# Open browser: http://localhost:20001

If you installed Linkerd:

linkerd viz dashboard
# Automatically opens browser

Troubleshooting

Nodes Stuck "NotReady"

# Check CNI pods
kubectl get pods -n kube-system -l k8s-app=calico-node
kubectl logs -n kube-system <cni-pod-name>
# Check kubelet
sudo journalctl -u kubelet -n 50

Pods Not Scheduling

kubectl describe pod <pod-name>
# Look for events indicating resource constraints or node selectors
kubectl describe nodes
# Check available resources

ImagePullBackOff

# Check if image exists and is accessible
docker pull <image-name>
# For private registry, ensure nodes trust CA
ls -la /usr/local/share/ca-certificates/
sudo update-ca-certificates

GPU Not Detected

# On Jetson nodes, verify NVIDIA device plugin
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds
# Check GPU availability
kubectl get nodes -o json | jq '.items[].status.capacity'

For comprehensive troubleshooting, see Kubernetes README.


What's Next?

After successfully deploying your cluster:

Deploy GPU Workloads:

  • Try example manifests in k8s/deployments/
  • Deploy PyTorch/TensorFlow distributed training
  • Set up Knative for serverless inference

Enable Monitoring:

  • Install Prometheus and Grafana: k8s/tools/install_prometheus.sh
  • Set up alerting for critical metrics
  • Integrate with service mesh observability

Configure Storage:

  • Set up NFS server: k8s/tools/install_nfs_server.sh
  • Configure persistent volumes for stateful workloads
  • Consider distributed storage (Longhorn, Rook-Ceph) for larger clusters

Explore Service Mesh:

  • Enable traffic splitting for canary deployments
  • Configure automatic mTLS between services
  • Set up distributed tracing with Jaeger

Implement GitOps:

  • Install ArgoCD: k8s/tools/install_argocd.sh
  • Set up Gitea for self-hosted Git: support_components/02_components/install_gitea.sh
  • Configure continuous deployment from Git

Chaos Engineering:

  • Install Chaos Mesh: k8s/tools/install_chaos_mesh.sh
  • Test cluster resilience with controlled failures
  • Build confidence in production readiness

Contributing

Contributions are welcome! Areas of particular interest:

  • Additional hardware support (other ARM SBCs, cloud platforms)
  • Additional CNI plugins (Cilium, Kube-router)
  • Enhanced monitoring and logging integrations
  • ML/AI workload examples and tutorials
  • Documentation improvements

Please see CONTRIBUTING.md and review the GNU GPL v3.0 license.


Code of Conduct

This project follows the Contributor Covenant Code of Conduct.


License

This project is licensed under the GNU General Public License v3.0 — see the LICENSE file for details.

You are free to:

  • Use this software for any purpose
  • Study and modify the source code
  • Share the software with others
  • Distribute modified versions

Under these terms:

  • Modifications must also be licensed under GPL v3.0
  • You must include the original copyright notice and license
  • You must disclose source code of distributed modifications

Acknowledgments

This project builds on the excellent work of:

  • The Kubernetes project and CNCF community
  • NVIDIA for Jetson hardware and JetPack SDK
  • The Raspberry Pi Foundation
  • CNI plugin maintainers (Calico, Flannel, Weave)
  • Service mesh projects (Istio, Linkerd)
  • The broader cloud-native and edge computing communities

About the Author

Christi Mahu is a software engineer specializing in MLOps, edge computing, and Kubernetes on ARM and GPU architectures.
Visit christimahu.dev for more projects and technical writing.


Project Status

Current Version: 1.0.0
Status: Production-ready for edge deployments, actively maintained

Tested Platforms:

  • ✅ NVIDIA Jetson Orin Nano (8GB)
  • ✅ NVIDIA Jetson Orin NX (16GB)
  • ✅ Kubernetes v1.30
  • ✅ JetPack 5.1.2, 6.0
  • 🔄 Raspberry Pi 4/5 (coming soon)
  • 🔄 Cloud platforms (AWS/GCP/Azure — patterns proven, scripts in progress)

Happy Clustering! 🚀
For questions, issues, or discussion, visit the GitHub repository.

About

Production-grade, GPU-ready MLOps platform built for edge clusters and scalable to the cloud

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published