k10s: GPU-Aware Kubernetes Toolkit

Alpha. APIs will change rapidly. Consider becoming a contributor to shape the beta.

k10s is two things:

kitty - a Daemonset that lives on your kubernetes cluster that collects node-level GPU + Network telemetry
k10s - (kittens) a tui/cli that shows the ML training jobs in your cluster and surfaces ranks that are misbehaving.

The outcomes will be:

Your idle / misbehaving GPUs are LOUD so you know if you are burning $$
You know exactly WHY your training job is messed up (straggler ranks, oom issues, network chokes etc)
You see a performance profile of your custom CUDA Kernels
You don't have to leave your terminal

These are the problems that I have and thats the main reason to build this. If you also have these problems join our discord and consider becoming a contributor and shaping this tool:

Installation

Helm (recommended)

helm repo add k10s https://shvbsle.github.io/k10s
helm repo update
helm install kitty k10s/kitty

This creates the k10s namespace and deploys the kitty daemonset with GPU node tolerations out of the box. See all available values: helm show values k10s/kitty

Add one env var to your training pods and these metrics appear at :9100/metrics, labeled by rank:

env:
  - name: KITTY_WATCH
    value: "1"

Straggler Detection

Network

Status	Problem	Metric	What you'll see
✅	Network latency	`net.tcp_rtt_us`	Jumps from ~2ms to 50ms+ on the affected rank
✅	Packet loss	`net.tcp_retransmits`	Counter climbs on the affected rank
✅	Bandwidth throttle	`net.tcp_out_segs` + `net.tcp_tx_queue_bytes`	Segment rate drops to zero (stall), tx_queue spikes on recovery
planned	NCCL transport downgrade
planned	Cross-rack routing

GPU / Compute

Status	Problem
planned	Power throttling (SW cap)
planned	Memory fragmentation
planned	Thermal throttling
planned	ECC errors

CPU / Data Pipeline

Status	Problem
planned	DataLoader starvation
planned	CPU contention
planned	Memory pressure

See Metrics Reference for the full list and how each metric works under the hood.

Verify

kubectl get pods -n k10s
kubectl port-forward -n k10s daemonset/kitty 9100:9100
curl -s http://localhost:9100/metrics | grep net.tcp_rtt_us

If your training pods have KITTY_WATCH=1 set, you should see per-rank RTT values. If not, you'll still see GPU metrics (gpu.sm_utilization, gpu.power_draw_watts, etc.) for all GPUs on the node.

Project Structure

src/crates/
├── kitty/         # daemonset agent (GPU, network, eBPF collectors)
├── kitty-ebpf/    # BPF kernel program (tcp_rtt_estimator kprobe)
├── tui/           # tui duh
└── e2e/           # End-to-end tests
charts/
└── kitty/         # Helm chart

Documentation

Metrics Reference — full list of metrics emitted by kitty and the roadmap

License

Apache 2.0. See LICENSE for details.

What happened to the Go version?

There was a vibe-coded go-version of the TUI. Still available for use here: https://github.com/shvbsle/k10s/tree/archive/go-v0.4.0

It became unmaintainable so I've archived that branch and decided to hand-write this TUI again from scratch in Rust. I speak more about it here: Blog: I'm going back to writing code by hand

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
.githooks		.githooks
.github/workflows		.github/workflows
charts/kitty		charts/kitty
docs		docs
src/crates		src/crates
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
STEERING.md		STEERING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

k10s: GPU-Aware Kubernetes Toolkit

Installation

Helm (recommended)

Straggler Detection

Network

GPU / Compute

CPU / Data Pipeline

Verify

Project Structure

Documentation

License

What happened to the Go version?

About

Uh oh!

Releases 7

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

k10s: GPU-Aware Kubernetes Toolkit

Installation

Helm (recommended)

Straggler Detection

Network

GPU / Compute

CPU / Data Pipeline

Verify

Project Structure

Documentation

License

What happened to the Go version?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages