Alpha. APIs will change rapidly. Consider becoming a contributor to shape the beta.
k10s is two things:
- kitty - a Daemonset that lives on your kubernetes cluster that collects node-level GPU + Network telemetry
- k10s - (kittens) a tui/cli that shows the ML training jobs in your cluster and surfaces ranks that are misbehaving.
The outcomes will be:
- Your idle / misbehaving GPUs are LOUD so you know if you are burning $$
- You know exactly WHY your training job is messed up (straggler ranks, oom issues, network chokes etc)
- You see a performance profile of your custom CUDA Kernels
- You don't have to leave your terminal
These are the problems that I have and thats the main reason to build this. If you also have these problems join our discord and consider becoming a contributor and shaping this tool:
helm repo add k10s https://shvbsle.github.io/k10s
helm repo update
helm install kitty k10s/kittyThis creates the k10s namespace and deploys the kitty daemonset with GPU node tolerations out of the box. See all available values: helm show values k10s/kitty
Add one env var to your training pods and these metrics appear at :9100/metrics, labeled by rank:
env:
- name: KITTY_WATCH
value: "1"| Status | Problem | Metric | What you'll see |
|---|---|---|---|
| ✅ | Network latency | net.tcp_rtt_us |
Jumps from ~2ms to 50ms+ on the affected rank |
| ✅ | Packet loss | net.tcp_retransmits |
Counter climbs on the affected rank |
| ✅ | Bandwidth throttle | net.tcp_out_segs + net.tcp_tx_queue_bytes |
Segment rate drops to zero (stall), tx_queue spikes on recovery |
| planned | NCCL transport downgrade | ||
| planned | Cross-rack routing |
| Status | Problem |
|---|---|
| planned | Power throttling (SW cap) |
| planned | Memory fragmentation |
| planned | Thermal throttling |
| planned | ECC errors |
| Status | Problem |
|---|---|
| planned | DataLoader starvation |
| planned | CPU contention |
| planned | Memory pressure |
See Metrics Reference for the full list and how each metric works under the hood.
kubectl get pods -n k10s
kubectl port-forward -n k10s daemonset/kitty 9100:9100
curl -s http://localhost:9100/metrics | grep net.tcp_rtt_usIf your training pods have KITTY_WATCH=1 set, you should see per-rank RTT values. If not, you'll still see GPU metrics (gpu.sm_utilization, gpu.power_draw_watts, etc.) for all GPUs on the node.
src/crates/
├── kitty/ # daemonset agent (GPU, network, eBPF collectors)
├── kitty-ebpf/ # BPF kernel program (tcp_rtt_estimator kprobe)
├── tui/ # tui duh
└── e2e/ # End-to-end tests
charts/
└── kitty/ # Helm chart
- Metrics Reference — full list of metrics emitted by kitty and the roadmap
Apache 2.0. See LICENSE for details.
There was a vibe-coded go-version of the TUI. Still available for use here: https://github.com/shvbsle/k10s/tree/archive/go-v0.4.0
It became unmaintainable so I've archived that branch and decided to hand-write this TUI again from scratch in Rust. I speak more about it here: Blog: I'm going back to writing code by hand