chore: sync develop into main (k8s monitoring stack — full Rust observability)#101
Merged
Conversation
Adds an end-to-end observability stack deployable in a single
helm install kxn-stack:
- kxn-stack umbrella chart (deploy/helm/kxn-stack/)
- standalone PostgreSQL (image postgres:18-alpine, password fixed in values
so secret + initdb stay in sync — no Bitnami chart sync bugs)
- schema-init Job (post-install hook, 16 tables idempotent)
- kxn-monitor sub-chart (compliance + Discord/Slack/Teams alerts, save→pg)
- grafana sub-chart with 20 kxn dashboards via sidecar
- 6 collectors (CronJobs): pod_resource, disk_usage, k8s_jobs,
netpol_coverage, tls_certs, logs (sample Python embedded in ConfigMap)
- extended ClusterRole for collectors (nodes/proxy, pods/log, networkpolicies)
- webhook secret auto-created with placeholder URL if not provided
- Standalone helm charts for users who want only one component:
- deploy/helm/kxn-monitor: compliance + alerts only
- deploy/helm/kxn-logs: pod log forwarding to Loki
- 20 anonymized Grafana dashboards (deploy/examples/grafana/):
cluster, k8s-pods, namespaces, compliance, errors, postgres, security,
disks, certs, netpolicies, ingress, traefik (×3), tenants, top-consumers,
backups, k8s-system, cloud-logs, pod-logs (with namespace/pod/level
selectors), + README
- Production-ready TOML rules (deploy/examples/rules/):
cluster-health.toml, postgres-app-rules.toml, README
- Claude Code skill (skills/kxn-scan/SKILL.md) — drop into ~/.claude/skills/
for natural-language driving of kxn
- End-to-end cookbook (docs/cookbook-k8s-monitoring.md) — deploy on any
Kubernetes cluster in 10 minutes
Tested reproducibly on a fresh AKS cluster: helm install kxn-stack with
no patches yields 16 tables + collectors writing data + dashboards
populated within ~3 minutes.
- New collectors/logs-collector.sh that polls pod logs every minute,
classifies severity, and inserts into the logs table with tags->>
{namespace,pod,node,container} so the Pod / Namespace / k8s-pods
dashboards can group by these fields.
- Wire schedules.logs in values.yaml and the collectors template.
- Pod logs explorer dashboard: replace TimescaleDB-only time_bucket()
with date_trunc('minute', time) for stock PostgreSQL compatibility,
and switch the Recent log lines panel to a table with namespace/pod
columns so the user can spot which pod produced which line.
- Fix kxn-cloud-logs uid (had a literal space, broke API import).
…hema The dashboard used the metric names from an older RTK build: active_connections -> connections_active database_size_mb -> total_size_bytes (divided by 1024^2 in panel) Also retire the database_size series from the trends timeseries since the unit no longer matches; replace it with transactions_committed, which is more useful as a 'is the DB doing anything' line.
The chart shipped 6 Python CronJobs as a stopgap to fill secondary
tables (pod_resource, disk_usage, k8s_jobs, netpol_coverage, tls_certs,
logs). Production runs only the Rust binary, so the chart should too.
Changes:
- Remove templates/collectors.yaml and files/collectors/*.sh.
- Rename collectors-rbac.yaml -> extra-rbac.yaml; rebind to the
kxn-monitor ServiceAccount so the Rust watchers keep nodes/proxy,
nodes/stats, secrets, pods/log, NetworkPolicies, Jobs/CronJobs,
Deployments and friends.
- Add templates/kxn-logs.yaml: a Deployment that runs
'kxn logs kubernetes://in-cluster --stream' (mirrors the kxn-logs
deployment running on prod), batching pod logs into the logs table.
- values.yaml: drop the 'collectors' block, introduce 'logs' (image,
stream batch settings, resources).
- NOTES.txt updated to reflect the new component list.
Tables previously filled by Python CronJobs are now expected to come
from the Rust binary (kxn watch kubernetes://). Sondes still missing
from the Rust K8s provider (pod_resource, disk_usage, k8s_jobs,
netpol_coverage, tls_certs) are tracked separately and will be added
incrementally; the Python rustine is gone.
New gather_tls_certs() in the native Kubernetes provider. Lists Secrets of type kubernetes.io/tls, base64-decodes tls.crt, parses the leaf certificate via x509-parser, and emits one JSON value per cert with: name, namespace, common_name, issuer, not_before, not_after, expires_in_days, expired The cert PEM itself is never returned — only the fields a monitoring dashboard needs to track expiry. Replaces what the old Python certs-collector CronJob was doing (scraping secrets, calling openssl, INSERTing into a side table). This moves it back into the Rust binary, registered as a regular kxn resource_type, so it goes through the standard scan/save pipeline.
New gather_pod_resource() returns one JSON row per (namespace, pod, container) with cpu_millicores + memory_mib + timestamp. Dashboards that group by namespace/pod/container no longer have to flatten the nested 'containers' array from pod_metrics. Same upstream source (metrics.k8s.io/v1beta1/pods), different shape: flat instead of nested. Replaces the Python pod-resource-collector CronJob.
New gather_k8s_jobs() returns one JSON row per Job with derived timing
fields:
- state: active|succeeded|failed|unknown
- duration_seconds: completion_time - start_time
- age_seconds: now - start_time
- last_success_age_seconds / last_failure_age_seconds (for stale-job
alerts)
- owner_kind/owner_name: CronJob this Job belongs to (when applicable)
Replaces what the Python jobs-collector CronJob computed externally.
Same upstream source as the existing 'jobs' resource_type
(batch/v1/jobs), different shape — purpose-built for monitoring
dashboards instead of compliance scans.
New gather_netpol_coverage() computes per-namespace NetworkPolicy coverage: for each namespace, list NetworkPolicies and Pods, then for each pod check whether at least one NP podSelector matches its labels. Emits one row per namespace with: pods_total, pods_covered, coverage_pct, pods_uncovered A pod with no targeting NP shows up in pods_uncovered, which lets dashboards highlight workloads that fall through the network policy net. Uncovered pods are typically a finding worth alerting on in zero-trust setups. MVP only honours podSelector.matchLabels (and the empty-selector case which matches every pod). matchExpressions support is left for a follow-up iteration. Replaces the Python netpol-collector CronJob.
New gather_disk_usage() walks every node and calls the kubelet stats
summary endpoint
(/api/v1/nodes/{name}/proxy/stats/summary) to expose:
- kind=node, fs_kind=root : node root filesystem
- kind=node, fs_kind=image : container image filesystem
- kind=pvc : per-pod PVC volume stats
Each row carries capacity_bytes, used_bytes, available_bytes,
used_pct. Replaces the Python disk-usage-collector CronJob.
Requires the kxn ServiceAccount to hold nodes/proxy + nodes/stats
verbs (already granted in deploy/helm/kxn-stack/templates/extra-rbac.yaml).
flatten_gathered() skipped any resource_type not listed in METRIC_RESOURCE_TYPES, which meant the new monitoring probes (tls_certs, pod_resource, k8s_jobs, netpol_coverage, disk_usage) produced 0 rows in the metrics table on every cycle — dashboards saw no continuous data unless a rule actually fired a violation. Add the 5 probes to the whitelist so each gather() output now generates one MetricRecord per JSON field per item, on every watch cycle, regardless of rule outcome.
Add backend.type switch in values.yaml so the same chart can deploy
different storage tiers without forking templates:
- postgres : default. Postgres StatefulSet + schema-init Job + the
existing 19 SQL Grafana dashboards. No behavior change for
the existing 'helm install kxn-stack' path.
- loki : single-binary Loki StatefulSet with filesystem storage.
kxn save pushes via /loki/api/v1/push, grafana datasource
points at it, sql dashboards are inert (use LogQL ones
instead). New templates/loki.yaml + values.yaml [loki].
- mongo : single-replica MongoDB StatefulSet with a kxn user
bootstrapped via post-install Job. New templates/mongo.yaml
+ values.yaml [mongo].
Other moves:
- templates/postgres.yaml + schema-init-job.yaml gated on
eq backend.type 'postgres'.
- templates/kxn-logs.yaml renders the right [[save]] block + env vars
based on backend.type.
- templates/grafana-datasource.yaml: per-backend datasource ConfigMap
picked up by the grafana sidecar (label grafana_datasource=1) so the
sub-chart values.yaml stays static.
- templates/grafana-deps-placeholder.yaml: when backend != postgres,
create a placeholder kxn-stack-postgresql Secret so the grafana
sub-chart's POSTGRES_PASSWORD env reference still resolves and the
pod boots cleanly.
Validated end-to-end on AKS: helm install kxn-stack -f
values-loki-override.yaml renders 4 pods (grafana, kxn-monitor,
kxn-logs, loki), kxn metric stream pushed continuously into Loki
(100 lines / 3 min).
Adds the resource_types that were missing for a complete monitoring +
compliance picture of a Kubernetes cluster:
endpoints / endpoint_slices : service connectivity layer; flag
services that have no backing pods
replicasets : the deploy -> rs -> pod chain so stale
rollouts can be spotted
leases : controller leader-election state, with
renew_age_seconds for split-brain alerts
csi_drivers / csi_nodes /
volume_attachments : storage-layer health, including stuck
attach/detach errors
certificate_signing_requests: pending CSRs (kubelet bootstrap stuck)
runtime_classes : non-default runtimes (gVisor, kata)
pod_restarts : per-container restart counts +
last_termination_reason + oom_killed
(continuous monitoring metric)
pod_status_phase : counts by namespace and phase
(Running/Pending/Failed/Succeeded)
container_oom_kills : OOMKill events rolled up per namespace
All 9 are added to METRIC_RESOURCE_TYPES so flatten_gathered() emits
rows in the metrics table on every watch cycle, regardless of rule
outcome — same fix path as the previous 5 probes.
extra-rbac.yaml also picks up the new API groups: discovery.k8s.io,
coordination.k8s.io, storage.k8s.io (csidrivers/csinodes/volumeattachments),
certificates.k8s.io, node.k8s.io.
Sprint 1 — APF + admission + discovery (6 probes) flow_schemas, priority_level_configurations, validating_admission_policies, validating_admission_policy_bindings, server_version, api_resources_summary Sprint 2 — fine container/node metrics (2 probes) container_network (rx/tx/errors per pod via kubelet stats summary) node_runtime (pods used vs allocatable, image fs usage) Sprint 3 — top 5 operator CRDs (8 probes, gracefully no-op if CRD absent) istio_virtual_services, istio_gateways argo_applications, argo_workflows cert_manager_certificates prometheus_rules, service_monitors flux_kustomizations Sprint 4 — synthetic DNS health (1 probe) dns_health: resolve kubernetes.default.svc.cluster.local from the kxn pod, emit resolved/latency_ms/error so dashboards can plot DNS availability of the in-cluster stub All 18 added to METRIC_RESOURCE_TYPES so they emit continuous rows in the metrics table on every cycle. RBAC extended with flowcontrol + admissionregistration API groups (CRD probes inherit core RBAC and return [] silently when the CRD is not installed).
Brings the full Kubernetes observability refactor:
* 5 monitoring probes (tls_certs, pod_resource, k8s_jobs,
netpol_coverage, disk_usage) replacing the Python collectors
* 9 coverage probes (endpoints, endpoint_slices, replicasets,
leases, csi_drivers/nodes, volume_attachments, csr,
runtime_classes, pod_restarts, pod_status_phase, container_oom_kills)
* 18 full-coverage probes — Sprints 1-4 (APF, admission, discovery,
container network, node runtime, top operator CRDs, DNS health)
* METRIC_RESOURCE_TYPES whitelist fix so probes emit on every cycle
even when no rule fails
* kxn-stack helm chart: drop Python CronJobs, add multi-backend
support (postgres / loki / mongo)
* Pod logs explorer dashboard + grafana datasource selection per backend
* RBAC extension: discovery, coordination, storage, certificates,
node, flowcontrol, admissionregistration API groups
Validated end-to-end on AKS: 25 resource_types emit continuous metrics
(672 container_network rows / 3 min, 912 pod_resource, 352
api_resources_summary, etc). All test clusters torn down.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sync `develop` into `main` to trigger the release pipeline.
Scope
The full Kubernetes monitoring + compliance refactor:
Rust probes (32 new resource_types)
Monitoring continuous (5) — replaces the Python collectors
`tls_certs`, `pod_resource`, `k8s_jobs`, `netpol_coverage`, `disk_usage`
Coverage gaps (9)
`endpoints`, `endpoint_slices`, `replicasets`, `leases`,
`csi_drivers`, `csi_nodes`, `volume_attachments`,
`certificate_signing_requests`, `runtime_classes`,
`pod_restarts`, `pod_status_phase`, `container_oom_kills`
APF + admission + discovery (6)
`flow_schemas`, `priority_level_configurations`,
`validating_admission_policies`, `validating_admission_policy_bindings`,
`server_version`, `api_resources_summary`
Fine kubelet metrics (2) — `container_network`, `node_runtime`
Top operator CRDs (8) — Istio, Argo, cert-manager, Prom Operator, Flux
Synthetic (1) — `dns_health`
Save layer
resource_types so they emit rows on every `kxn watch` cycle, regardless
of rule outcome (was the cause of the empty graphs panels).
kxn-stack helm chart
`grafana-deps-placeholder.yaml`, `kxn-logs.yaml` (Rust streamer).
storage, certificates, node, flowcontrol, admissionregistration).
Validation
End-to-end on AKS (kxn-final-rg, sub 4urcloud_sponsor, now torn down):
25 resource_types emit continuous metrics — 4400+ rows in 3 minutes,
672 `container_network`, 912 `pod_resource`, 352
`api_resources_summary`, etc.
Production OVH (`prd-rtk-cld-01 / kxn-metrics`) was migrated mid-PR:
3 deployments running the new image, 5 Python CronJobs deleted,
graphs populated continuously from the Rust probes.
Release flow
Merging this PR onto `main` triggers `release-please` which will open
the actual release PR with the version bump and CHANGELOG. Merging that
release PR tags the repo and publishes `kexa/kxn:0.40.0` on Docker Hub.