Skip to content

shvbsle/k10s

Repository files navigation

k10s: GPU-Aware Kubernetes Toolkit

Alpha. APIs will change rapidly. Consider becoming a contributor to shape the beta.

License Kubernetes Discord

k10s is two things:

  • kitty - a Daemonset that lives on your kubernetes cluster that collects node-level GPU + Network telemetry
  • k10s - (kittens) a tui/cli that shows the ML training jobs in your cluster and surfaces ranks that are misbehaving.

The outcomes will be:

  1. Your idle / misbehaving GPUs are LOUD so you know if you are burning $$
  2. You know exactly WHY your training job is messed up (straggler ranks, oom issues, network chokes etc)
  3. You see a performance profile of your custom CUDA Kernels
  4. You don't have to leave your terminal

These are the problems that I have and thats the main reason to build this. If you also have these problems join our discord and consider becoming a contributor and shaping this tool: Discord

Installation

Helm (recommended)

helm repo add k10s https://shvbsle.github.io/k10s
helm repo update
helm install kitty k10s/kitty

This creates the k10s namespace and deploys the kitty daemonset with GPU node tolerations out of the box. See all available values: helm show values k10s/kitty

Add one env var to your training pods and these metrics appear at :9100/metrics, labeled by rank:

env:
  - name: KITTY_WATCH
    value: "1"

Straggler Detection

Network

Status Problem Metric What you'll see
Network latency net.tcp_rtt_us Jumps from ~2ms to 50ms+ on the affected rank
Packet loss net.tcp_retransmits Counter climbs on the affected rank
Bandwidth throttle net.tcp_out_segs + net.tcp_tx_queue_bytes Segment rate drops to zero (stall), tx_queue spikes on recovery
planned NCCL transport downgrade
planned Cross-rack routing

GPU / Compute

Status Problem
planned Power throttling (SW cap)
planned Memory fragmentation
planned Thermal throttling
planned ECC errors

CPU / Data Pipeline

Status Problem
planned DataLoader starvation
planned CPU contention
planned Memory pressure

See Metrics Reference for the full list and how each metric works under the hood.


Verify

kubectl get pods -n k10s
kubectl port-forward -n k10s daemonset/kitty 9100:9100
curl -s http://localhost:9100/metrics | grep net.tcp_rtt_us

If your training pods have KITTY_WATCH=1 set, you should see per-rank RTT values. If not, you'll still see GPU metrics (gpu.sm_utilization, gpu.power_draw_watts, etc.) for all GPUs on the node.

Project Structure

src/crates/
├── kitty/         # daemonset agent (GPU, network, eBPF collectors)
├── kitty-ebpf/    # BPF kernel program (tcp_rtt_estimator kprobe)
├── tui/           # tui duh
└── e2e/           # End-to-end tests
charts/
└── kitty/         # Helm chart

Documentation

License

Apache 2.0. See LICENSE for details.


What happened to the Go version?

There was a vibe-coded go-version of the TUI. Still available for use here: https://github.com/shvbsle/k10s/tree/archive/go-v0.4.0

It became unmaintainable so I've archived that branch and decided to hand-write this TUI again from scratch in Rust. I speak more about it here: Blog: I'm going back to writing code by hand

About

Profiler for ML Training jobs in pure Rust. See straggler ranks and idle GPUs.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages