Skip to content

chore: sync develop into main (k8s monitoring stack — full Rust observability)#101

Merged
pszymkowiak merged 14 commits into
mainfrom
chore/sync-develop-into-main
May 6, 2026
Merged

chore: sync develop into main (k8s monitoring stack — full Rust observability)#101
pszymkowiak merged 14 commits into
mainfrom
chore/sync-develop-into-main

Conversation

@pszymkowiak
Copy link
Copy Markdown
Contributor

Sync `develop` into `main` to trigger the release pipeline.

Scope

The full Kubernetes monitoring + compliance refactor:

Rust probes (32 new resource_types)

  • Monitoring continuous (5) — replaces the Python collectors
    `tls_certs`, `pod_resource`, `k8s_jobs`, `netpol_coverage`, `disk_usage`

  • Coverage gaps (9)
    `endpoints`, `endpoint_slices`, `replicasets`, `leases`,
    `csi_drivers`, `csi_nodes`, `volume_attachments`,
    `certificate_signing_requests`, `runtime_classes`,
    `pod_restarts`, `pod_status_phase`, `container_oom_kills`

  • APF + admission + discovery (6)
    `flow_schemas`, `priority_level_configurations`,
    `validating_admission_policies`, `validating_admission_policy_bindings`,
    `server_version`, `api_resources_summary`

  • Fine kubelet metrics (2) — `container_network`, `node_runtime`

  • Top operator CRDs (8) — Istio, Argo, cert-manager, Prom Operator, Flux

  • Synthetic (1) — `dns_health`

Save layer

  • fix(save): `METRIC_RESOURCE_TYPES` whitelist now includes the 32 new
    resource_types so they emit rows on every `kxn watch` cycle, regardless
    of rule outcome (was the cause of the empty graphs panels).

kxn-stack helm chart

  • Drop the 6 Python collector CronJobs — full Rust now.
  • Multi-backend support: `backend.type = postgres | loki | mongo`.
  • New templates: `loki.yaml`, `mongo.yaml`, `grafana-datasource.yaml`,
    `grafana-deps-placeholder.yaml`, `kxn-logs.yaml` (Rust streamer).
  • RBAC extension covers the new API groups (discovery, coordination,
    storage, certificates, node, flowcontrol, admissionregistration).

Validation

End-to-end on AKS (kxn-final-rg, sub 4urcloud_sponsor, now torn down):
25 resource_types emit continuous metrics — 4400+ rows in 3 minutes,
672 `container_network`, 912 `pod_resource`, 352
`api_resources_summary`, etc.

Production OVH (`prd-rtk-cld-01 / kxn-metrics`) was migrated mid-PR:
3 deployments running the new image, 5 Python CronJobs deleted,
graphs populated continuously from the Rust probes.

Release flow

Merging this PR onto `main` triggers `release-please` which will open
the actual release PR with the version bump and CHANGELOG. Merging that
release PR tags the repo and publishes `kexa/kxn:0.40.0` on Docker Hub.

pszymkowiak added 14 commits May 5, 2026 16:36
Adds an end-to-end observability stack deployable in a single
helm install kxn-stack:

- kxn-stack umbrella chart (deploy/helm/kxn-stack/)
  - standalone PostgreSQL (image postgres:18-alpine, password fixed in values
    so secret + initdb stay in sync — no Bitnami chart sync bugs)
  - schema-init Job (post-install hook, 16 tables idempotent)
  - kxn-monitor sub-chart (compliance + Discord/Slack/Teams alerts, save→pg)
  - grafana sub-chart with 20 kxn dashboards via sidecar
  - 6 collectors (CronJobs): pod_resource, disk_usage, k8s_jobs,
    netpol_coverage, tls_certs, logs (sample Python embedded in ConfigMap)
  - extended ClusterRole for collectors (nodes/proxy, pods/log, networkpolicies)
  - webhook secret auto-created with placeholder URL if not provided

- Standalone helm charts for users who want only one component:
  - deploy/helm/kxn-monitor: compliance + alerts only
  - deploy/helm/kxn-logs: pod log forwarding to Loki

- 20 anonymized Grafana dashboards (deploy/examples/grafana/):
  cluster, k8s-pods, namespaces, compliance, errors, postgres, security,
  disks, certs, netpolicies, ingress, traefik (×3), tenants, top-consumers,
  backups, k8s-system, cloud-logs, pod-logs (with namespace/pod/level
  selectors), + README

- Production-ready TOML rules (deploy/examples/rules/):
  cluster-health.toml, postgres-app-rules.toml, README

- Claude Code skill (skills/kxn-scan/SKILL.md) — drop into ~/.claude/skills/
  for natural-language driving of kxn

- End-to-end cookbook (docs/cookbook-k8s-monitoring.md) — deploy on any
  Kubernetes cluster in 10 minutes

Tested reproducibly on a fresh AKS cluster: helm install kxn-stack with
no patches yields 16 tables + collectors writing data + dashboards
populated within ~3 minutes.
- New collectors/logs-collector.sh that polls pod logs every minute,
  classifies severity, and inserts into the logs table with tags->>
  {namespace,pod,node,container} so the Pod / Namespace / k8s-pods
  dashboards can group by these fields.
- Wire schedules.logs in values.yaml and the collectors template.
- Pod logs explorer dashboard: replace TimescaleDB-only time_bucket()
  with date_trunc('minute', time) for stock PostgreSQL compatibility,
  and switch the Recent log lines panel to a table with namespace/pod
  columns so the user can spot which pod produced which line.
- Fix kxn-cloud-logs uid (had a literal space, broke API import).
…hema

The dashboard used the metric names from an older RTK build:
  active_connections   -> connections_active
  database_size_mb     -> total_size_bytes (divided by 1024^2 in panel)

Also retire the database_size series from the trends timeseries since
the unit no longer matches; replace it with transactions_committed,
which is more useful as a 'is the DB doing anything' line.
The chart shipped 6 Python CronJobs as a stopgap to fill secondary
tables (pod_resource, disk_usage, k8s_jobs, netpol_coverage, tls_certs,
logs). Production runs only the Rust binary, so the chart should too.

Changes:
  - Remove templates/collectors.yaml and files/collectors/*.sh.
  - Rename collectors-rbac.yaml -> extra-rbac.yaml; rebind to the
    kxn-monitor ServiceAccount so the Rust watchers keep nodes/proxy,
    nodes/stats, secrets, pods/log, NetworkPolicies, Jobs/CronJobs,
    Deployments and friends.
  - Add templates/kxn-logs.yaml: a Deployment that runs
    'kxn logs kubernetes://in-cluster --stream' (mirrors the kxn-logs
    deployment running on prod), batching pod logs into the logs table.
  - values.yaml: drop the 'collectors' block, introduce 'logs' (image,
    stream batch settings, resources).
  - NOTES.txt updated to reflect the new component list.

Tables previously filled by Python CronJobs are now expected to come
from the Rust binary (kxn watch kubernetes://). Sondes still missing
from the Rust K8s provider (pod_resource, disk_usage, k8s_jobs,
netpol_coverage, tls_certs) are tracked separately and will be added
incrementally; the Python rustine is gone.
New gather_tls_certs() in the native Kubernetes provider. Lists Secrets
of type kubernetes.io/tls, base64-decodes tls.crt, parses the leaf
certificate via x509-parser, and emits one JSON value per cert with:

  name, namespace, common_name, issuer, not_before, not_after,
  expires_in_days, expired

The cert PEM itself is never returned — only the fields a monitoring
dashboard needs to track expiry.

Replaces what the old Python certs-collector CronJob was doing
(scraping secrets, calling openssl, INSERTing into a side table). This
moves it back into the Rust binary, registered as a regular kxn
resource_type, so it goes through the standard scan/save pipeline.
New gather_pod_resource() returns one JSON row per (namespace, pod,
container) with cpu_millicores + memory_mib + timestamp. Dashboards
that group by namespace/pod/container no longer have to flatten the
nested 'containers' array from pod_metrics.

Same upstream source (metrics.k8s.io/v1beta1/pods), different shape:
flat instead of nested. Replaces the Python pod-resource-collector
CronJob.
New gather_k8s_jobs() returns one JSON row per Job with derived timing
fields:
  - state: active|succeeded|failed|unknown
  - duration_seconds: completion_time - start_time
  - age_seconds: now - start_time
  - last_success_age_seconds / last_failure_age_seconds (for stale-job
    alerts)
  - owner_kind/owner_name: CronJob this Job belongs to (when applicable)

Replaces what the Python jobs-collector CronJob computed externally.
Same upstream source as the existing 'jobs' resource_type
(batch/v1/jobs), different shape — purpose-built for monitoring
dashboards instead of compliance scans.
New gather_netpol_coverage() computes per-namespace NetworkPolicy
coverage: for each namespace, list NetworkPolicies and Pods, then for
each pod check whether at least one NP podSelector matches its labels.
Emits one row per namespace with:

  pods_total, pods_covered, coverage_pct, pods_uncovered

A pod with no targeting NP shows up in pods_uncovered, which lets
dashboards highlight workloads that fall through the network policy
net. Uncovered pods are typically a finding worth alerting on in
zero-trust setups.

MVP only honours podSelector.matchLabels (and the empty-selector case
which matches every pod). matchExpressions support is left for a
follow-up iteration. Replaces the Python netpol-collector CronJob.
New gather_disk_usage() walks every node and calls the kubelet stats
summary endpoint
(/api/v1/nodes/{name}/proxy/stats/summary) to expose:

  - kind=node, fs_kind=root  : node root filesystem
  - kind=node, fs_kind=image : container image filesystem
  - kind=pvc                 : per-pod PVC volume stats

Each row carries capacity_bytes, used_bytes, available_bytes,
used_pct. Replaces the Python disk-usage-collector CronJob.

Requires the kxn ServiceAccount to hold nodes/proxy + nodes/stats
verbs (already granted in deploy/helm/kxn-stack/templates/extra-rbac.yaml).
flatten_gathered() skipped any resource_type not listed in
METRIC_RESOURCE_TYPES, which meant the new monitoring probes
(tls_certs, pod_resource, k8s_jobs, netpol_coverage, disk_usage)
produced 0 rows in the metrics table on every cycle — dashboards saw
no continuous data unless a rule actually fired a violation.

Add the 5 probes to the whitelist so each gather() output now
generates one MetricRecord per JSON field per item, on every watch
cycle, regardless of rule outcome.
Add backend.type switch in values.yaml so the same chart can deploy
different storage tiers without forking templates:

  - postgres : default. Postgres StatefulSet + schema-init Job + the
               existing 19 SQL Grafana dashboards. No behavior change for
               the existing 'helm install kxn-stack' path.
  - loki     : single-binary Loki StatefulSet with filesystem storage.
               kxn save pushes via /loki/api/v1/push, grafana datasource
               points at it, sql dashboards are inert (use LogQL ones
               instead). New templates/loki.yaml + values.yaml [loki].
  - mongo    : single-replica MongoDB StatefulSet with a kxn user
               bootstrapped via post-install Job. New templates/mongo.yaml
               + values.yaml [mongo].

Other moves:
  - templates/postgres.yaml + schema-init-job.yaml gated on
    eq backend.type 'postgres'.
  - templates/kxn-logs.yaml renders the right [[save]] block + env vars
    based on backend.type.
  - templates/grafana-datasource.yaml: per-backend datasource ConfigMap
    picked up by the grafana sidecar (label grafana_datasource=1) so the
    sub-chart values.yaml stays static.
  - templates/grafana-deps-placeholder.yaml: when backend != postgres,
    create a placeholder kxn-stack-postgresql Secret so the grafana
    sub-chart's POSTGRES_PASSWORD env reference still resolves and the
    pod boots cleanly.

Validated end-to-end on AKS: helm install kxn-stack -f
values-loki-override.yaml renders 4 pods (grafana, kxn-monitor,
kxn-logs, loki), kxn metric stream pushed continuously into Loki
(100 lines / 3 min).
Adds the resource_types that were missing for a complete monitoring +
compliance picture of a Kubernetes cluster:

  endpoints / endpoint_slices : service connectivity layer; flag
                                services that have no backing pods
  replicasets                 : the deploy -> rs -> pod chain so stale
                                rollouts can be spotted
  leases                      : controller leader-election state, with
                                renew_age_seconds for split-brain alerts
  csi_drivers / csi_nodes /
    volume_attachments        : storage-layer health, including stuck
                                attach/detach errors
  certificate_signing_requests: pending CSRs (kubelet bootstrap stuck)
  runtime_classes             : non-default runtimes (gVisor, kata)
  pod_restarts                : per-container restart counts +
                                last_termination_reason + oom_killed
                                (continuous monitoring metric)
  pod_status_phase            : counts by namespace and phase
                                (Running/Pending/Failed/Succeeded)
  container_oom_kills         : OOMKill events rolled up per namespace

All 9 are added to METRIC_RESOURCE_TYPES so flatten_gathered() emits
rows in the metrics table on every watch cycle, regardless of rule
outcome — same fix path as the previous 5 probes.

extra-rbac.yaml also picks up the new API groups: discovery.k8s.io,
coordination.k8s.io, storage.k8s.io (csidrivers/csinodes/volumeattachments),
certificates.k8s.io, node.k8s.io.
Sprint 1 — APF + admission + discovery (6 probes)
  flow_schemas, priority_level_configurations, validating_admission_policies,
  validating_admission_policy_bindings, server_version, api_resources_summary

Sprint 2 — fine container/node metrics (2 probes)
  container_network (rx/tx/errors per pod via kubelet stats summary)
  node_runtime (pods used vs allocatable, image fs usage)

Sprint 3 — top 5 operator CRDs (8 probes, gracefully no-op if CRD absent)
  istio_virtual_services, istio_gateways
  argo_applications, argo_workflows
  cert_manager_certificates
  prometheus_rules, service_monitors
  flux_kustomizations

Sprint 4 — synthetic DNS health (1 probe)
  dns_health: resolve kubernetes.default.svc.cluster.local from the kxn
  pod, emit resolved/latency_ms/error so dashboards can plot DNS
  availability of the in-cluster stub

All 18 added to METRIC_RESOURCE_TYPES so they emit continuous rows in
the metrics table on every cycle. RBAC extended with flowcontrol +
admissionregistration API groups (CRD probes inherit core RBAC and
return [] silently when the CRD is not installed).
Brings the full Kubernetes observability refactor:

  * 5 monitoring probes (tls_certs, pod_resource, k8s_jobs,
    netpol_coverage, disk_usage) replacing the Python collectors
  * 9 coverage probes (endpoints, endpoint_slices, replicasets,
    leases, csi_drivers/nodes, volume_attachments, csr,
    runtime_classes, pod_restarts, pod_status_phase, container_oom_kills)
  * 18 full-coverage probes — Sprints 1-4 (APF, admission, discovery,
    container network, node runtime, top operator CRDs, DNS health)
  * METRIC_RESOURCE_TYPES whitelist fix so probes emit on every cycle
    even when no rule fails
  * kxn-stack helm chart: drop Python CronJobs, add multi-backend
    support (postgres / loki / mongo)
  * Pod logs explorer dashboard + grafana datasource selection per backend
  * RBAC extension: discovery, coordination, storage, certificates,
    node, flowcontrol, admissionregistration API groups

Validated end-to-end on AKS: 25 resource_types emit continuous metrics
(672 container_network rows / 3 min, 912 pod_resource, 352
api_resources_summary, etc). All test clusters torn down.
@pszymkowiak pszymkowiak merged commit 2a94db0 into main May 6, 2026
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant