Kubernetes platform for AI/ML workloads. K3s cluster with GitOps automation.
| Component | Spec |
|---|---|
| CPU | Intel Core Ultra 9 285K (8P + 16E cores) |
| GPU | NVIDIA RTX 5080 16GB |
| RAM | 128GB DDR5-5600 |
| Storage | XFS (/cache, /data) |
Comparable to AWS g6.8xlarge (~$1,780/month).
homelab/
βββ bootstrap/ # Phase 1: Helmfile cluster init
β βββ helmfile.yaml
β βββ Justfile
β βββ releases/ # argocd, cilium, gpu-operator, infisical
βββ platform/ # Phase 2: Argo CD GitOps
β βββ gitops/ # ApplicationSet templates
β βββ stacks/ # Kustomize overlays (00-core ~ 06-labs)
βββ docs/ # Architecture docs
Two-phase deployment:
Phase 1: Bootstrap (Helmfile)
βββ K3s cluster
βββ Cilium eBPF CNI
βββ GPU Operator
βββ Infisical secrets
Phase 2: Platform (Argo CD)
βββ 00-core # cert-manager, cloudflared, tailscale
βββ 01-platforms # argo-workflows, harbor, buildkit
βββ 02-o11y # grafana, tempo, quickwit
βββ 03-data # postgres, redis, clickhouse, redpanda
βββ 04-ml # feast, mlflow, ray, qdrant
βββ 05-workloads # deepfx, mt5-trader
βββ 06-labs # jupyterhub, n8n, superset
| Layer | Tech |
|---|---|
| Cluster | K3s |
| Network | Cilium eBPF + Gateway API |
| GitOps | Argo CD (App-of-Apps) |
| Secrets | Infisical Operator |
| GPU | NVIDIA MPS + Ray |
| Observability | VictoriaMetrics, Tempo, Pyroscope, Quickwit, Grafana |
| Data | PostgreSQL, Redis, ClickHouse, Redpanda |
| ML | Feast, MLflow, Ray, Qdrant |
NVIDIA MPS splits RTX 5080 into 16 logical units (1GB each). Ray schedules workloads across hybrid cores:
- P-cores (0-7): GPU tasks, training, inference
- E-cores (8-23): Control plane, scheduling
See GPU Partitioning.
Logs: Vector β Redpanda β Quickwit β Grafana
Metrics: Prometheus β VictoriaMetrics β Grafana
Traces: OTel SDK β Alloy β Tempo β Grafana
APM: Pyroscope (service map, trace-log correlation, error analysis)
cd bootstrap
just up
just nvidia-smi
just argocd-password
export KUBECONFIG=$HOME/.kube/config.home
kubectl get pods -A
open https://argocd.home.lab| Domain | Method |
|---|---|
*.home.lab |
Tailscale + CoreDNS |
*.restack.tech |
Cloudflare Tunnel |
GitHub Actions runners (ARC) with Claude integration:
- Issue analysis and PR generation
- Code review for security and performance
- Manifest validation
Skills are loaded on-demand based on issue content keywords:
| Skill | Trigger Keywords |
|---|---|
argocd-generator |
deploy, helm, chart, λ°°ν¬ |
troubleshoot |
error, crash, pending, μλ¬ |
kubernetes-review |
review, validate, yaml, 리뷰 |
infisical-manager |
secret, env, μν¬λ¦Ώ |
Only relevant skills (max 2) are loaded per request, reducing prompt size and token usage.
- Local (Qwen): Privacy-sensitive tasks
- Cloud (Claude, Gemini): Complex reasoning
MIT