Skip to content

togettoyou/kpilot

Repository files navigation

KPilot

Unified GPU + model platform for Kubernetes.

English · 中文

KPilot — cluster ops, GPU scheduling, monitoring, and plugin management in one console

License Stars Last commit Helm chart


What is KPilot

KPilot is a control plane for running GPU workloads on Kubernetes. Cluster operations, Volcano-based batch scheduling, vGPU governance, hardware telemetry, plugin lifecycle, and model serving live behind one console with a consistent permission and audit surface.

Multi-cluster is the default — a single KPilot Server manages many clusters, with the in-cluster agent dialing back over gRPC. No inbound ports on the cluster side, no shared kubeconfigs, no per-cloud divergence.

Architecture

KPilot architecture (C4 container diagram)

Server owns the UI, API, and durable state (cluster registry, plugin metadata, accounts) but holds no kubeconfigs. Worker runs inside each managed cluster, dials the Server over a single long-lived gRPC stream, and brokers every Kubernetes operation on its behalf — no inbound ports, no shared credentials, no cross-cloud divergence. Plugins ship as Helm charts and reconcile via an in-cluster CRD, executing in the cluster's own RBAC context.

Quick Start

Install the Server (control-plane cluster):

helm install kpilot oci://ghcr.io/togettoyou/charts/kpilot \
  --version 0.0.0-dev \
  --namespace kpilot-system --create-namespace \
  --set server.admin.password='<change-me>'

Port-forward the UI and log in with kpilot / <your password>:

kubectl -n kpilot-system port-forward svc/kpilot-server 8080:80
open http://localhost:8080

Install the Worker (each managed cluster). Create a cluster row in the UI, copy the one-time ClusterToken, then:

helm install kpilot-worker oci://ghcr.io/togettoyou/charts/kpilot \
  --version 0.0.0-dev \
  --namespace kpilot-system --create-namespace \
  --set server.enabled=false,worker.enabled=true,postgresql.enabled=false \
  --set worker.serverAddr='kpilot-server-grpc.kpilot-system.svc:9090' \
  --set worker.clusterToken='<paste-token>'

The cluster row in the Server UI transitions to Online within a few seconds. Production exposure (Ingress, external Postgres, image registry mirrors) is covered in deploy/README.md.

Use Cases

  • Multi-cluster GPU operations — run a single platform team across clusters in different VPCs, regions, or clouds without touching network policies.
  • Shared GPU tenancy — partition each card into vGPU slices and govern allocation through Volcano queues with explicit capability / guarantee / deserved policies.
  • GPU usage metering — produce GPU-Hour reports per node and per card straight from DCGM, then drill into hotspots from the same UI.
  • Self-service AI platform (roadmap) — let teams deploy inference endpoints from a model catalog and run distributed fine-tuning without writing YAML.

Key Features

Cluster Management
  • Multi-cluster onboarding via a single-use token; no kubeconfig sharing
  • Live node and workload browser covering native and custom resources
  • In-browser Pod logs, terminal, and per-container CPU / memory metrics
  • Inline YAML editor with apply / describe / delete for any resource
Compute Scheduling
  • Volcano gang scheduling across Queue, Job, CronJob, PodGroup, HyperNode
  • Fine-grained GPU sharing via volcano-vgpu-device-plugin (slot / framebuffer / SM cores)
  • Multi-resource queue quotas with capability, guarantee, allocated, and deserved views
  • Visual scheduler-policy editor for actions, tiers, and plugin parameters
GPU Observability
  • Per-card panels for utilization, temperature, power, framebuffer, SM clock, tensor activity
  • DCGM-driven GPU-Hour usage reports across 1h / 24h / 7d / 30d windows
  • Alerting on DCGM XID, ECC, thermal, and framebuffer-pressure conditions
  • vGPU view mapping every physical card to its current slice holders
Plugin Management
  • Built-in Helm registry covering Volcano, DCGM Exporter, VictoriaMetrics, VictoriaLogs, Grafana, Metrics Server, kube-state-metrics
  • Per-cluster enable / disable / upgrade with the install log streamed live
  • Bring-your-own charts with per-cluster values overrides
  • The same plugin pipeline that powers customer workloads also bootstraps KPilot's own observability stack

Screenshots

Cluster Management — docs/clusters.md

Pod browser with live logs and terminal Self-rendered cluster monitoring
Cluster logging Embedded Grafana escape hatch

Compute Scheduling — docs/compute.md

Visual scheduler policy editor Queue quotas
vGPU view Volcano Job authoring

Plugin Management — docs/plugins.md

Plugin admin Plugin install

Roadmap — Model Serving

Coming in upcoming releases:

  • Model repository with curated vLLM templates for Qwen, DeepSeek, Llama, and other open-weights families
  • One-click inference deployment with a built-in chat playground
  • OpenAI-compatible routing with canary and A/B controls
  • Distributed fine-tuning on Volcano gang scheduling

About

KPilot: A Kubernetes-native GPU orchestration pilot.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors