KPilot

Unified GPU + model platform for Kubernetes.

What is KPilot

KPilot is a control plane for running GPU workloads on Kubernetes. Cluster operations, Volcano-based batch scheduling, vGPU governance, hardware telemetry, plugin lifecycle, and model serving live behind one console with a consistent permission and audit surface.

Multi-cluster is the default — a single KPilot Server manages many clusters, with the in-cluster agent dialing back over gRPC. No inbound ports on the cluster side, no shared kubeconfigs, no per-cloud divergence.

Architecture

Server owns the UI, API, and durable state (cluster registry, plugin metadata, accounts) but holds no kubeconfigs. Worker runs inside each managed cluster, dials the Server over a single long-lived gRPC stream, and brokers every Kubernetes operation on its behalf — no inbound ports, no shared credentials, no cross-cloud divergence. Plugins ship as Helm charts and reconcile via an in-cluster CRD, executing in the cluster's own RBAC context.

Quick Start

Install the Server (control-plane cluster):

helm install kpilot oci://ghcr.io/togettoyou/charts/kpilot \
  --version 0.0.0-dev \
  --namespace kpilot-system --create-namespace \
  --set server.admin.password='<change-me>'

Port-forward the UI and log in with kpilot / <your password>:

kubectl -n kpilot-system port-forward svc/kpilot-server 8080:80
open http://localhost:8080

Install the Worker (each managed cluster). Create a cluster row in the UI, copy the one-time ClusterToken, then:

helm install kpilot-worker oci://ghcr.io/togettoyou/charts/kpilot \
  --version 0.0.0-dev \
  --namespace kpilot-system --create-namespace \
  --set server.enabled=false,worker.enabled=true,postgresql.enabled=false \
  --set worker.serverAddr='kpilot-server-grpc.kpilot-system.svc:9090' \
  --set worker.clusterToken='<paste-token>'

The cluster row in the Server UI transitions to Online within a few seconds. Production exposure (Ingress, external Postgres, image registry mirrors) is covered in deploy/README.md.

Use Cases

Multi-cluster GPU operations — run a single platform team across clusters in different VPCs, regions, or clouds without touching network policies.
Shared GPU tenancy — partition each card into vGPU slices and govern allocation through Volcano queues with explicit capability / guarantee / deserved policies.
GPU usage metering — produce GPU-Hour reports per node and per card straight from DCGM, then drill into hotspots from the same UI.
Self-service AI platform (roadmap) — let teams deploy inference endpoints from a model catalog and run distributed fine-tuning without writing YAML.

Key Features


Cluster Management Multi-cluster onboarding via a single-use token; no kubeconfig sharing Live node and workload browser covering native and custom resources In-browser Pod logs, terminal, and per-container CPU / memory metrics Inline YAML editor with apply / describe / delete for any resource	Compute Scheduling Volcano gang scheduling across Queue, Job, CronJob, PodGroup, HyperNode Fine-grained GPU sharing via volcano-vgpu-device-plugin (slot / framebuffer / SM cores) Multi-resource queue quotas with capability, guarantee, allocated, and deserved views Visual scheduler-policy editor for actions, tiers, and plugin parameters
GPU Observability Per-card panels for utilization, temperature, power, framebuffer, SM clock, tensor activity DCGM-driven GPU-Hour usage reports across 1h / 24h / 7d / 30d windows Alerting on DCGM XID, ECC, thermal, and framebuffer-pressure conditions vGPU view mapping every physical card to its current slice holders	Plugin Management Built-in Helm registry covering Volcano, DCGM Exporter, VictoriaMetrics, VictoriaLogs, Grafana, Metrics Server, kube-state-metrics Per-cluster enable / disable / upgrade with the install log streamed live Bring-your-own charts with per-cluster values overrides The same plugin pipeline that powers customer workloads also bootstraps KPilot's own observability stack

Screenshots

Cluster Management — `docs/clusters.md`

Compute Scheduling — `docs/compute.md`

Plugin Management — `docs/plugins.md`

Roadmap — Model Serving

Coming in upcoming releases:

Model repository with curated vLLM templates for Qwen, DeepSeek, Llama, and other open-weights families
One-click inference deployment with a built-in chat playground
OpenAI-compatible routing with canary and A/B controls
Distributed fine-tuning on Volcano gang scheduling

Name		Name	Last commit message	Last commit date
Latest commit History 514 Commits
.github/workflows		.github/workflows
cmd		cmd
deploy		deploy
docs		docs
hack		hack
pkg		pkg
proto		proto
vendor		vendor
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KPilot

What is KPilot

Architecture

Quick Start

Use Cases

Key Features

Screenshots

Cluster Management — `docs/clusters.md`

Compute Scheduling — `docs/compute.md`

Plugin Management — `docs/plugins.md`

Roadmap — Model Serving

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KPilot

What is KPilot

Architecture

Quick Start

Use Cases

Key Features

Screenshots

Cluster Management — docs/clusters.md

Compute Scheduling — docs/compute.md

Plugin Management — docs/plugins.md

Roadmap — Model Serving

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Cluster Management — `docs/clusters.md`

Compute Scheduling — `docs/compute.md`

Plugin Management — `docs/plugins.md`

Packages