chore: sync develop into main (k8s monitoring stack — full Rust observability) by pszymkowiak · Pull Request #101 · kexa-io/kxn

pszymkowiak · 2026-05-06T11:38:46Z

Sync `develop` into `main` to trigger the release pipeline.

Scope

The full Kubernetes monitoring + compliance refactor:

Rust probes (32 new resource_types)

Monitoring continuous (5) — replaces the Python collectors
`tls_certs`, `pod_resource`, `k8s_jobs`, `netpol_coverage`, `disk_usage`
Coverage gaps (9)
`endpoints`, `endpoint_slices`, `replicasets`, `leases`,
`csi_drivers`, `csi_nodes`, `volume_attachments`,
`certificate_signing_requests`, `runtime_classes`,
`pod_restarts`, `pod_status_phase`, `container_oom_kills`
APF + admission + discovery (6)
`flow_schemas`, `priority_level_configurations`,
`validating_admission_policies`, `validating_admission_policy_bindings`,
`server_version`, `api_resources_summary`
Fine kubelet metrics (2) — `container_network`, `node_runtime`
Top operator CRDs (8) — Istio, Argo, cert-manager, Prom Operator, Flux
Synthetic (1) — `dns_health`

Save layer

fix(save): `METRIC_RESOURCE_TYPES` whitelist now includes the 32 new
resource_types so they emit rows on every `kxn watch` cycle, regardless
of rule outcome (was the cause of the empty graphs panels).

kxn-stack helm chart

Drop the 6 Python collector CronJobs — full Rust now.
Multi-backend support: `backend.type = postgres | loki | mongo`.
New templates: `loki.yaml`, `mongo.yaml`, `grafana-datasource.yaml`,
`grafana-deps-placeholder.yaml`, `kxn-logs.yaml` (Rust streamer).
RBAC extension covers the new API groups (discovery, coordination,
storage, certificates, node, flowcontrol, admissionregistration).

Validation

End-to-end on AKS (kxn-final-rg, sub 4urcloud_sponsor, now torn down):
25 resource_types emit continuous metrics — 4400+ rows in 3 minutes,
672 `container_network`, 912 `pod_resource`, 352
`api_resources_summary`, etc.

Production OVH (`prd-rtk-cld-01 / kxn-metrics`) was migrated mid-PR:
3 deployments running the new image, 5 Python CronJobs deleted,
graphs populated continuously from the Rust probes.

Release flow

Merging this PR onto `main` triggers `release-please` which will open
the actual release PR with the version bump and CHANGELOG. Merging that
release PR tags the repo and publishes `kexa/kxn:0.40.0` on Docker Hub.

Adds an end-to-end observability stack deployable in a single helm install kxn-stack: - kxn-stack umbrella chart (deploy/helm/kxn-stack/) - standalone PostgreSQL (image postgres:18-alpine, password fixed in values so secret + initdb stay in sync — no Bitnami chart sync bugs) - schema-init Job (post-install hook, 16 tables idempotent) - kxn-monitor sub-chart (compliance + Discord/Slack/Teams alerts, save→pg) - grafana sub-chart with 20 kxn dashboards via sidecar - 6 collectors (CronJobs): pod_resource, disk_usage, k8s_jobs, netpol_coverage, tls_certs, logs (sample Python embedded in ConfigMap) - extended ClusterRole for collectors (nodes/proxy, pods/log, networkpolicies) - webhook secret auto-created with placeholder URL if not provided - Standalone helm charts for users who want only one component: - deploy/helm/kxn-monitor: compliance + alerts only - deploy/helm/kxn-logs: pod log forwarding to Loki - 20 anonymized Grafana dashboards (deploy/examples/grafana/): cluster, k8s-pods, namespaces, compliance, errors, postgres, security, disks, certs, netpolicies, ingress, traefik (×3), tenants, top-consumers, backups, k8s-system, cloud-logs, pod-logs (with namespace/pod/level selectors), + README - Production-ready TOML rules (deploy/examples/rules/): cluster-health.toml, postgres-app-rules.toml, README - Claude Code skill (skills/kxn-scan/SKILL.md) — drop into ~/.claude/skills/ for natural-language driving of kxn - End-to-end cookbook (docs/cookbook-k8s-monitoring.md) — deploy on any Kubernetes cluster in 10 minutes Tested reproducibly on a fresh AKS cluster: helm install kxn-stack with no patches yields 16 tables + collectors writing data + dashboards populated within ~3 minutes.

- New collectors/logs-collector.sh that polls pod logs every minute, classifies severity, and inserts into the logs table with tags->> {namespace,pod,node,container} so the Pod / Namespace / k8s-pods dashboards can group by these fields. - Wire schedules.logs in values.yaml and the collectors template. - Pod logs explorer dashboard: replace TimescaleDB-only time_bucket() with date_trunc('minute', time) for stock PostgreSQL compatibility, and switch the Recent log lines panel to a table with namespace/pod columns so the user can spot which pod produced which line. - Fix kxn-cloud-logs uid (had a literal space, broke API import).

…hema The dashboard used the metric names from an older RTK build: active_connections -> connections_active database_size_mb -> total_size_bytes (divided by 1024^2 in panel) Also retire the database_size series from the trends timeseries since the unit no longer matches; replace it with transactions_committed, which is more useful as a 'is the DB doing anything' line.

The chart shipped 6 Python CronJobs as a stopgap to fill secondary tables (pod_resource, disk_usage, k8s_jobs, netpol_coverage, tls_certs, logs). Production runs only the Rust binary, so the chart should too. Changes: - Remove templates/collectors.yaml and files/collectors/*.sh. - Rename collectors-rbac.yaml -> extra-rbac.yaml; rebind to the kxn-monitor ServiceAccount so the Rust watchers keep nodes/proxy, nodes/stats, secrets, pods/log, NetworkPolicies, Jobs/CronJobs, Deployments and friends. - Add templates/kxn-logs.yaml: a Deployment that runs 'kxn logs kubernetes://in-cluster --stream' (mirrors the kxn-logs deployment running on prod), batching pod logs into the logs table. - values.yaml: drop the 'collectors' block, introduce 'logs' (image, stream batch settings, resources). - NOTES.txt updated to reflect the new component list. Tables previously filled by Python CronJobs are now expected to come from the Rust binary (kxn watch kubernetes://). Sondes still missing from the Rust K8s provider (pod_resource, disk_usage, k8s_jobs, netpol_coverage, tls_certs) are tracked separately and will be added incrementally; the Python rustine is gone.

New gather_tls_certs() in the native Kubernetes provider. Lists Secrets of type kubernetes.io/tls, base64-decodes tls.crt, parses the leaf certificate via x509-parser, and emits one JSON value per cert with: name, namespace, common_name, issuer, not_before, not_after, expires_in_days, expired The cert PEM itself is never returned — only the fields a monitoring dashboard needs to track expiry. Replaces what the old Python certs-collector CronJob was doing (scraping secrets, calling openssl, INSERTing into a side table). This moves it back into the Rust binary, registered as a regular kxn resource_type, so it goes through the standard scan/save pipeline.

New gather_pod_resource() returns one JSON row per (namespace, pod, container) with cpu_millicores + memory_mib + timestamp. Dashboards that group by namespace/pod/container no longer have to flatten the nested 'containers' array from pod_metrics. Same upstream source (metrics.k8s.io/v1beta1/pods), different shape: flat instead of nested. Replaces the Python pod-resource-collector CronJob.

New gather_k8s_jobs() returns one JSON row per Job with derived timing fields: - state: active|succeeded|failed|unknown - duration_seconds: completion_time - start_time - age_seconds: now - start_time - last_success_age_seconds / last_failure_age_seconds (for stale-job alerts) - owner_kind/owner_name: CronJob this Job belongs to (when applicable) Replaces what the Python jobs-collector CronJob computed externally. Same upstream source as the existing 'jobs' resource_type (batch/v1/jobs), different shape — purpose-built for monitoring dashboards instead of compliance scans.

New gather_netpol_coverage() computes per-namespace NetworkPolicy coverage: for each namespace, list NetworkPolicies and Pods, then for each pod check whether at least one NP podSelector matches its labels. Emits one row per namespace with: pods_total, pods_covered, coverage_pct, pods_uncovered A pod with no targeting NP shows up in pods_uncovered, which lets dashboards highlight workloads that fall through the network policy net. Uncovered pods are typically a finding worth alerting on in zero-trust setups. MVP only honours podSelector.matchLabels (and the empty-selector case which matches every pod). matchExpressions support is left for a follow-up iteration. Replaces the Python netpol-collector CronJob.

New gather_disk_usage() walks every node and calls the kubelet stats summary endpoint (/api/v1/nodes/{name}/proxy/stats/summary) to expose: - kind=node, fs_kind=root : node root filesystem - kind=node, fs_kind=image : container image filesystem - kind=pvc : per-pod PVC volume stats Each row carries capacity_bytes, used_bytes, available_bytes, used_pct. Replaces the Python disk-usage-collector CronJob. Requires the kxn ServiceAccount to hold nodes/proxy + nodes/stats verbs (already granted in deploy/helm/kxn-stack/templates/extra-rbac.yaml).

flatten_gathered() skipped any resource_type not listed in METRIC_RESOURCE_TYPES, which meant the new monitoring probes (tls_certs, pod_resource, k8s_jobs, netpol_coverage, disk_usage) produced 0 rows in the metrics table on every cycle — dashboards saw no continuous data unless a rule actually fired a violation. Add the 5 probes to the whitelist so each gather() output now generates one MetricRecord per JSON field per item, on every watch cycle, regardless of rule outcome.

Add backend.type switch in values.yaml so the same chart can deploy different storage tiers without forking templates: - postgres : default. Postgres StatefulSet + schema-init Job + the existing 19 SQL Grafana dashboards. No behavior change for the existing 'helm install kxn-stack' path. - loki : single-binary Loki StatefulSet with filesystem storage. kxn save pushes via /loki/api/v1/push, grafana datasource points at it, sql dashboards are inert (use LogQL ones instead). New templates/loki.yaml + values.yaml [loki]. - mongo : single-replica MongoDB StatefulSet with a kxn user bootstrapped via post-install Job. New templates/mongo.yaml + values.yaml [mongo]. Other moves: - templates/postgres.yaml + schema-init-job.yaml gated on eq backend.type 'postgres'. - templates/kxn-logs.yaml renders the right [[save]] block + env vars based on backend.type. - templates/grafana-datasource.yaml: per-backend datasource ConfigMap picked up by the grafana sidecar (label grafana_datasource=1) so the sub-chart values.yaml stays static. - templates/grafana-deps-placeholder.yaml: when backend != postgres, create a placeholder kxn-stack-postgresql Secret so the grafana sub-chart's POSTGRES_PASSWORD env reference still resolves and the pod boots cleanly. Validated end-to-end on AKS: helm install kxn-stack -f values-loki-override.yaml renders 4 pods (grafana, kxn-monitor, kxn-logs, loki), kxn metric stream pushed continuously into Loki (100 lines / 3 min).

Adds the resource_types that were missing for a complete monitoring + compliance picture of a Kubernetes cluster: endpoints / endpoint_slices : service connectivity layer; flag services that have no backing pods replicasets : the deploy -> rs -> pod chain so stale rollouts can be spotted leases : controller leader-election state, with renew_age_seconds for split-brain alerts csi_drivers / csi_nodes / volume_attachments : storage-layer health, including stuck attach/detach errors certificate_signing_requests: pending CSRs (kubelet bootstrap stuck) runtime_classes : non-default runtimes (gVisor, kata) pod_restarts : per-container restart counts + last_termination_reason + oom_killed (continuous monitoring metric) pod_status_phase : counts by namespace and phase (Running/Pending/Failed/Succeeded) container_oom_kills : OOMKill events rolled up per namespace All 9 are added to METRIC_RESOURCE_TYPES so flatten_gathered() emits rows in the metrics table on every watch cycle, regardless of rule outcome — same fix path as the previous 5 probes. extra-rbac.yaml also picks up the new API groups: discovery.k8s.io, coordination.k8s.io, storage.k8s.io (csidrivers/csinodes/volumeattachments), certificates.k8s.io, node.k8s.io.

Sprint 1 — APF + admission + discovery (6 probes) flow_schemas, priority_level_configurations, validating_admission_policies, validating_admission_policy_bindings, server_version, api_resources_summary Sprint 2 — fine container/node metrics (2 probes) container_network (rx/tx/errors per pod via kubelet stats summary) node_runtime (pods used vs allocatable, image fs usage) Sprint 3 — top 5 operator CRDs (8 probes, gracefully no-op if CRD absent) istio_virtual_services, istio_gateways argo_applications, argo_workflows cert_manager_certificates prometheus_rules, service_monitors flux_kustomizations Sprint 4 — synthetic DNS health (1 probe) dns_health: resolve kubernetes.default.svc.cluster.local from the kxn pod, emit resolved/latency_ms/error so dashboards can plot DNS availability of the in-cluster stub All 18 added to METRIC_RESOURCE_TYPES so they emit continuous rows in the metrics table on every cycle. RBAC extended with flowcontrol + admissionregistration API groups (CRD probes inherit core RBAC and return [] silently when the CRD is not installed).

Brings the full Kubernetes observability refactor: * 5 monitoring probes (tls_certs, pod_resource, k8s_jobs, netpol_coverage, disk_usage) replacing the Python collectors * 9 coverage probes (endpoints, endpoint_slices, replicasets, leases, csi_drivers/nodes, volume_attachments, csr, runtime_classes, pod_restarts, pod_status_phase, container_oom_kills) * 18 full-coverage probes — Sprints 1-4 (APF, admission, discovery, container network, node runtime, top operator CRDs, DNS health) * METRIC_RESOURCE_TYPES whitelist fix so probes emit on every cycle even when no rule fails * kxn-stack helm chart: drop Python CronJobs, add multi-backend support (postgres / loki / mongo) * Pod logs explorer dashboard + grafana datasource selection per backend * RBAC extension: discovery, coordination, storage, certificates, node, flowcontrol, admissionregistration API groups Validated end-to-end on AKS: 25 resource_types emit continuous metrics (672 container_network rows / 3 min, 912 pod_resource, 352 api_resources_summary, etc). All test clusters torn down.

pszymkowiak added 14 commits May 5, 2026 16:36

pszymkowiak merged commit 2a94db0 into main May 6, 2026
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: sync develop into main (k8s monitoring stack — full Rust observability)#101

chore: sync develop into main (k8s monitoring stack — full Rust observability)#101
pszymkowiak merged 14 commits into
mainfrom
chore/sync-develop-into-main

pszymkowiak commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pszymkowiak commented May 6, 2026

Scope

Rust probes (32 new resource_types)

Save layer

kxn-stack helm chart

Validation

Release flow

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant