Skip to content

Add HPA diagnosis insights#916

Open
nadaverell wants to merge 1 commit into
mainfrom
hpa-diagnosis-insights
Open

Add HPA diagnosis insights#916
nadaverell wants to merge 1 commit into
mainfrom
hpa-diagnosis-insights

Conversation

@nadaverell

@nadaverell nadaverell commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

HPAs can fail quietly: the target workload may still have healthy-looking pods while autoscaling is capped, unable to read metrics, pinned by configuration, or paused at zero replicas. This PR makes HPA diagnosis a first-class Radar insight so operators can understand autoscaling state directly from Radar instead of reconstructing it from raw HPA conditions.

The goal is not to turn every HPA condition into an alert. The branch separates high-signal list/dashboard states from richer drawer context: broken/capped autoscaling is surfaced prominently, while states like min-bound, stale status, partial metrics, and stabilization remain available in detail views without adding table noise.

What Changed

Shared HPA Diagnosis Engine

  • Added pkg/hpadiag, a shared Go analyzer for autoscaling/v2 HPAs.
  • The analyzer produces a structured Diagnosis with:
    • state: normalized HPA state such as limited_max, metrics_unavailable, unable_to_scale, disabled, pinned, stale, stabilized, scaling_up, and ok.
    • summary: operator-facing text intended for Radar UI / AI context.
    • target: scale target reference.
    • bounds: min/max/current/desired replica data plus generation info.
    • metrics: normalized configured/current metric rows.
    • reasons: raw condition-backed evidence, preserving Kubernetes condition type/reason/message where available.
  • Added shared fixtures in testdata/hpa-diagnosis/cases.json covering maxed, metrics unavailable, partial metrics missing, unable to scale, disabled, pinned, scaling, stale, min-limited, stabilized, stable, and “at max without controller limit condition.”

Signal Policy

  • “Maxed” now requires the HPA controller’s explicit ScalingLimited=True / TooManyReplicas evidence.
  • An HPA merely sitting at current == desired == maxReplicas is treated as normal unless the controller says it wanted more replicas and was capped.
  • ScalingActive=False is classified as metrics_unavailable unless it is the intentional zero-replica ScalingDisabled case.
  • AbleToScale=False is classified separately as unable_to_scale.
  • Partial metric gaps, min bounds, stale observed generation, pinned min=max, disabled zero-replica targets, and scale-down stabilization are kept as drawer/detail context rather than dashboard issue spam.

Backend Surfaces

  • Resource detail responses now include hpaDiagnosis for HPA resources.
  • Resource context includes an HPA summary so MCP/AI workflows can reason about autoscaler state without fetching raw YAML first.
  • AI summary output uses the shared analyzer for HPA issue text.
  • Dashboard/problem detection now delegates HPA state decisions to the shared analyzer instead of maintaining separate HPA heuristics.
  • Topology resource wrapper types were extended to carry HPA diagnosis through existing resource detail response paths.

Frontend / UX

  • The shared HPA renderer now has a Diagnosis section showing:
    • primary diagnosis summary,
    • state badge,
    • replica bounds,
    • raw reason/evidence rows,
    • normalized metric rows.
  • The HPA Metrics section prefers backend diagnosis metrics when present and falls back to existing raw status metrics otherwise.
  • Workload details now show compact autoscaler context when the workload is controlled by an HPA.
  • Manual replica scaling is disabled for workloads whose replicas are controlled by HPA/KEDA, with the UI pointing the user toward the controlling scaler instead.
  • The HPA list/table classifier remains intentionally conservative and fixture-backed so broad list views do not become noisy.

Shared UI / Types

  • Added shared HPA diagnosis TypeScript types to @skyhook-io/k8s-ui.
  • Added resource-utils-hpa for table-state classification, label/tone mapping, and status badge generation.
  • Wired the HPA renderer through the existing renderer override path so Radar’s app can inject Prometheus charts while the shared renderer stays host-agnostic.

Reviewer Focus

  • The main correctness question is the HPA state policy in pkg/hpadiag: whether each condition/state maps to the right Radar severity and surface.
  • The main product question is whether list/dashboard signals are conservative enough. This PR intentionally does not promote metrics_incomplete, limited_min, stale, or stabilized into table warnings.
  • The main UI question is whether the drawer gives enough evidence to act without exposing raw Kubernetes condition text as the primary summary.
  • The main API question is whether adding optional hpaDiagnosis to resource detail responses is the right shape for Radar app consumers.

Testing

  • go test github.com/skyhook-io/radar/pkg/hpadiag github.com/skyhook-io/radar/internal/k8s github.com/skyhook-io/radar/pkg/ai/context github.com/skyhook-io/radar/pkg/resourcecontext github.com/skyhook-io/radar/internal/server
  • npm test --workspace @skyhook-io/k8s-ui -- resource-utils-hpa.test.ts WorkloadRenderer.test.tsx ResourceRendererDispatch.test.tsx
  • make tsc
  • make test

Visual-test: ran against kind-radar-gitops-demo with live HPA fixtures covering the HPA list, maxed drawer, metrics-unavailable drawer, and workload HPA context.

Notes / Tradeoffs

  • This does not add live Prometheus metric diagnosis beyond the existing HPA charts. The diagnosis is based on Kubernetes HPA spec/status/conditions.
  • Metric row normalization is best-effort across resource, container resource, pods, object, and external metrics.
  • The table and drawer intentionally do not use identical state visibility. The drawer is the complete diagnosis surface; the table is for scan-worthy operational signal.

Note

Medium Risk
Touches problem detection and resource API shape for autoscaling; behavior changes (stricter “maxed”) could alter dashboard issue counts, but logic is fixture-backed and well-tested.

Overview
Introduces pkg/hpadiag, a shared analyzer that turns HPA spec/status/conditions into a structured diagnosis (state, summary, bounds, metrics, reasons). “Maxed” now requires controller evidence (ScalingLimited / TooManyReplicas); sitting at max replicas without that condition is no longer flagged. Metrics and scale failures map to cannot-scale issues; min-bound, stale, stabilization, and similar states stay detail-only.

Backend: Resource GET responses and topology wrappers add optional hpaDiagnosis; resource context and AI summaries use the same analyzer. DetectHPAProblems delegates to hpadiag instead of inline condition parsing.

Frontend: HPA drawers show a Diagnosis section (badge, reasons, metrics); list/table status uses conservative resource-utils-hpa classification. Workloads controlled by an HPA fetch compact scaler diagnosis; ConditionsSection supports warning tones for max-limited conditions.

Reviewed by Cursor Bugbot for commit 32c9045. Bugbot is set up for automated code reviews on this repo. Configure here.

@nadaverell nadaverell requested a review from hisco as a code owner June 13, 2026 20:54
Comment thread web/src/components/resources/renderers/WorkloadRenderer.tsx Outdated
@nadaverell nadaverell force-pushed the hpa-diagnosis-insights branch from cb0d80a to 2db4ac8 Compare June 13, 2026 21:02
Comment thread pkg/hpadiag/diagnosis.go
@nadaverell nadaverell force-pushed the hpa-diagnosis-insights branch 3 times, most recently from 347883b to 255cb83 Compare June 13, 2026 22:58

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 255cb83. Configure here.

Comment thread internal/k8s/detect_workload.go
Comment thread pkg/hpadiag/diagnosis.go
@nadaverell nadaverell force-pushed the hpa-diagnosis-insights branch from 255cb83 to 32c9045 Compare June 13, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant