🚀 enhance kwatch by abahmed · Pull Request #456 · abahmed/kwatch

abahmed · 2026-06-13T22:42:22Z

No description provided.

B1: enqueuePod handles tombstones via DeletionHandlingMetaNamespaceKeyFunc B2: Guard inner per-pod map with sync.Mutex in memory.go B3: RemovePod resolves all incidents (remove break) B4: CrashLoopBackOff fetches previous logs (RestartCount>0 && Running==nil) B5: Owner-lookup errors captured and logged instead of swallowed B6: Startup panic on unknown provider name guarded B7: Incident messages include Logs and Events fields MC1: Eliminate syncPod double lister Get MC2: Restrict normalizeReason to known retry reasons MC3: Pod-only incident key uses '.' container name for consistency

B1: TestEnqueuePodDeletedFinalStateUnknown B2: TestMemoryConcurrentAccess B3: TestRemovePodMultiIncidentResolve B4: TestContainerLogsFilterCrashLoopBackOff B5: TestPodOwnersFilterReplicaSet (verify owner stays as RS on error) B6: TestGetProvidersUnknownSkipped B7: TestFormatIncidentMessageWithLogsEvents

…tore - Add model.ContainerState and Incident.LastContainerState - Engine.Process now accepts and persists container state - Add Engine.GetLastContainerState for change-detection queries - Remove Memory from filter.Context; add LastState/PodLastState - ContainerRestartsFilter, ContainerReasonsFilter, PodStatusFilter read from ContainerContext.LastState / Context.PodLastState - executeContainersFilters and executePodFilters query engine for last state instead of memory; pass ContainerState to Process - Remove h.memory.DelPod and Memory context wiring from process_pod - Delete dead EventFilter (never registered) - Update all tests

- Add filter.Status, filter.Detector (pure predicate, no I/O), filter.Enricher (I/O phase for broken signals) - Pure filters (Namespace, PodName, PodStatus, ContainerName, ContainerRestarts, ContainerState, ContainerReasons, Noise) implement Detector with Status return - I/O filters (PodOwnersFilter, ContainerLogsFilter) implement both Detector (returns StatusAlert) and Enricher - Event-dependent filters (PodEventsFilter, ContainerKillingFilter) implement Enricher (events fetched only when needed) - handler splits into podDetectors/podEnrichers/containerDetectors/ containerEnrichers — no I/O on healthy path - executePodFilters/executeContainersFilters: detect → enrich → engine - Remove upfront GetPodEvents from ProcessPodObject (now fetched only when a broken signal is detected)

- Add Resource field to event.Event for signal type (pod/node/pvc) - Engine.Process uses ev.Resource dynamically (defaults to pod) - Add Engine.StartupQuiet config — suppresses alerts N seconds after startup to prevent re-alerting pre-existing breakage - Add Engine.ResolveByResource — resolves incidents by resource type and name (used by node/PVC recovery) - process_node.go: emit node signals through engine (cooldown/ dedup/resolve lifecycle); drop h.memory.HasNode/AddNode/DelNode - PVC monitor: emit pvc signals through engine; drop alertManager.Notify in favor of NotifyIncident; resolve when usage drops below threshold - Remove unused memory field from handler struct and constructor; remove storage/memory imports from handler, main, and handler_test

…ility, docs

C1 — multi-ns informers now return []cache.SharedIndexInformer; event handlers attached to all; all HasSynced collected in WaitForCacheSync. C3 — RBAC rules for batch/jobs (ClusterRole) + coordination.k8s.io/leases (Role) in chart and deploy manifests. C2 — leader-election scope gates PVC/correlator/heartbeat/startup-msg/ controller; os.Exit(0) on lost leadership to prevent zombies. C5 — heartbeat rewritten to external HTTP GET/POST; no corr-engine or alert-manager involvement. C6 — baseline (seen) suppression checked unconditionally, not only during StartupQuiet window; ClearSeen on healthy pod or delete. C7 — Slack SendIncident Logs/Events blocks added to create/update paths and fallback text. C4 — removed dead CRD scaffolding (api/v1alpha1/ + deploy/crd.yaml). Add tests: engine stale/lifecycle, baseline suppression, ClearSeen, multi-ns informer, Slack Logs/Events golden, heartbeat ping.

Add regression tests for: - StatefulSet owner resolution in PodOwnersFilter - STS-owned pod grouping in correlation engine - Golden messages for stale + resolved incidents Implement DS/SS listers (unconditional, like RS lister): - multiDaemonSetLister and multiStatefulSetLister in listers.go - factorySet methods daemonSetLister/informers + statefulSetLister/informers - DS/SS fields on Controller struct, wired in New() and Run() - SetDaemonSetLister/SetStatefulSetLister on Handler interface - DSLister/SSLister fields on filter.Context - PodOwnersFilter uses lister first, falls back to live API - process_pod.go passes both listers into Context

Add multiEventLister/multiEventNamespaceLister, dedicated factory with involvedObject.kind=Pod field selector, wire into Controller.New() and Run(). Replace live GetPodEvents calls in executePodFilters and executeContainersFilters with lister-first (fallback to live API when lister is nil). Add SetEventLister to Handler interface + implementation. Tests: TestBrokenPodEventsFromCache (zero event LIST calls with lister), tighten TestBrokenPodMakesAPICalls to exact count (2 without lister, 1 with).

Add Workers field to config.Config (default 1). Replace hardcoded ctrl.Run(ctx, 1) in main.go with cfg.Workers with min-1 guard. Add TestRunMultipleWorkers in controller_test.go.

Switch baseline from pod-key (ns/name) to incident-key (ns:owner:reason:container) with TTL-based expiry. Engine changes: - seen becomes map[string]int64 (incident key → baselinedAt unix ts) - isBaselined checks incident key with BaselineTTL (default 24h) - BuildKey helper for consistent key construction - OnBaselineChange hook fires on ClearSeen/MarkResolved/RemovePod - NewEngine accepts initial Baseline from persistence Persistence (state.ConfigMap): - GetBaseline/SaveBaseline round-trip via JSON in kwatch-state ConfigMap - main.go loads baseline before controller start, saves on changes Controller buildSeenSet: - Resolves owners via RS/DS/SS listers, computes incident keys - Replaces old pod-key collection with incident-key baseline Tests updated for new incident-key baseline semantics.

…D restore Section 5 Items 4-9: - PVC warn/critical thresholds (Severity field on Event, configurable tiers) - Configurable log-block bound (MaxLogBlockLines, replaces hardcoded 100) - DaemonSet monitor (unavailable-pod detection, gated) - CronJob monitor (suspended/missed-schedule detection, gated) - Operability: /readyz endpoint, /debug/pprof/* profiling handlers - CRD scaffolding restored with all new config fields - README updated for all new configuration options

… prefix matching engine.ClearSeenByPrefix removes entries by prefix (ns:owner:) and fires OnBaselineChange on actual change. handler.resolveOwnerName resolves pod owner the same way controller.buildSeenSet does, so the healthy-path clear matches incident keys exactly (not stale pod-key deletes that never hit).

Replaces SetMaxLogBlockLines with SetMaxLogLines wired from cfg.MaxRecentLogLines (0 → default 100). Avoids a second lever for the same concept.

Pkg internal/crdwatch watches KwatchConfig CRs via client-go dynamic informer. Hot-applies: maxRecentLogLines, silences, severityByOwnerKind. Restart-only fields logged with 'requires restart' message. CR deletion restores boot-time ConfigMap snapshot. Gated behind crd.enabled (default off); graceful absence when CRD not installed. RBAC in both manifests.

Add correlation.EscalationTiers to engine config; BuildTiers/escaltionSeverity helpers compute severity from restart count. When enabled, engine.Process sets ev.Severity before enrichment based on thresholds [3,10,50] → 3+ high, 10+ critical. Wired from main.go config.

Watch HorizontalPodAutoscalers via autoscaling/v2 informer. Alert when currentReplicas >= maxReplicas with reason HPAMaxedOut. Gated behind hpaMonitor.enabled (default false). RBAC autoscaling/horizontalpodautoscalers added to both manifests. Multi-namespace lister support included.

Periodic scanner of kubernetes.io/tls Secrets. Parses tls.crt PEM and checks NotAfter against configurable threshold (default 30d). Alerts with TlsCertExpired or TlsCertExpiringSoon reasons. Gated behind tlsMonitor.enabled (default false). Runs inside runLeaderTasks. RBAC for secrets added.

--version flag prints version.Short() and exits.

Add docs for: pprof, escalation, CRD config, HPA monitor, TLS monitor, --version flag.

Extract shared owner resolution into correlation.ResolveOwnerName — used by controller buildSeenSet and handler healthy path. Replace ClearSeen on Handler interface with ClearSeenByOwner(namespace, owner). On lister error return '' and skip baseline (never guess).

Replace BuildTiers/escalationSeverity with crossedTier/escalateSeverity. Escalation fires on UPDATE (not CREATE), comparing prev vs new RestartCount. Severity escalated AFTER enricher.Enrich (which overwrites it). Track RestartCount on both CREATE and UPDATE paths to enable cross-tier detection.

Rework HPA detection: alert only when DesiredReplicas >= MaxReplicas AND CurrentReplicas < DesiredReplicas (still scaling). First-maxed timestamp tracked per HPA; sustainedMinutes window (default 0 = immediate) before alerting. Add SustainedMinutes to HpaMonitor config.

Replace LIST-API-based tlsmonitor with informer-backed TLS secret lister. Add SecretLister to handler, SweepTLSSecrets iterates cache. TLS secret informer uses field selector type=kubernetes.io/tls (separate factory). Runs 24h ticker + immediate sweep inside runLeaderTasks. RBAC unchanged.

Track active node incidents per NodeName in engine. Suppress all pod incidents on that node when an active node incident exists. Increment SuppressedPods counter on the node incident. Config inhibition: nodeSuppressesPods (default false).

Track create rate in sliding window. When threshold exceeded, buffer new incidents into digestBuf and return ActionDigest (silent). checkLifecycle flushes digest as ActionDigestFlush with Hint summary listing top 5 reason×ns combos. Config storm: {enabled, threshold, windowMinutes, digestIntervalMinutes}.

…tured log FIX-1: os.Expand -> regex ${VAR} only (bare $ preserved); bcrypt/password-safe FIX-2: namespaces get,list,watch RBAC in chart ClusterRole + deploy.yaml NIT: klog.Warning -> klog.InfoS for config-file-not-found Tests: TestConfigEnvInterpolation covers ${VAR}, bare $, missing var, bcrypt $

…proxy, BUG-IV threadMap leak BUG-I: refreshNodeInhibition in MarkResolved + checkLifecycle pending-resolve finalize BUG-II: delete dead ClearSeen (fires hook under lock) BUG-III: url.Parse(appConfig.ProxyURL) instead of http.ProxyURL(nil) in client.go BUG-IV: delete(s.threadMap, key) in Slack ActionResolved

…er,reason) Rule 1: notifSig/edgeAction replaces cooldown-gated ActionUpdate; StateStale/ActionStale removed; renotify uses incident fields directly Rule 2: IncidentKey drops container component (namespace:owner:reason); Containers map tracks affected containers; formatters show multi-container

DefaultConfig: all monitors ON, MaxRecentLogLines:50, ResyncSeconds:600, HealthCheck.Enabled:true, Storm ON (10/5m), Inhibition ON, ResolveHoldDown:30, Escalation ON config.go: HealthCheck.Diagnostics field (default false, gates /incidents+/test-alert) validate.go: Validate() for storm/escalation/pendingPod/pvc rules, called from LoadConfig Chart: replicaCount, probes, LE auto-inject on >1 replica, RBAC gated (secrets/TLS, leases/LE, kwatchconfigs/CRD) deploy.yaml: replicas:1, probes, commented-out optional RBAC

§5f: RenotifyInterval collapsed into RenotifyIntervalBySeverity["default"]; deprecated Interval merged in main.go IMP-1: async notification delivery — per-provider buffered channel + worker goroutine; non-blocking send IMP-2: internal/metrics package with atomic counters, Prometheus /metrics handler registered on health server

§5: config_test.go uses t.Setenv + t.TempDir instead of manual cleanup IMP-7: truncateMsg() at 4096 chars in formatCreate/Update/ResolvedMessage IMP-8: chart + deploy.yaml default memory 128Mi -> 256Mi

IMP-5: PeakResources tracks max resources across updates; Runbooks config maps reason->URL appended to hint §5b: ErrImagePull -> ImagePullBackOff normalization in normalizeReason (same root cause, one incident)

…v0.11.0 IMP-3: harness_test.go with recordingAlertManager + 5 integration tests (crashloop, node, inhibition, baseline, grouping) IMP-4: event.Signal type + h.signalEvent helper; process_node.go converted as example §5f: deprecation warnings for ignoreContainerNames/LogPatterns/ContainerMessages/NodeReasons/NodeMessages §6c: version bump v0.11.0, CHANGELOG.md with all changes since v0.10.x

- IMP-4: all handler files now use event.Signal via h.signalEvent() or PvcMonitor.reportSignal() - README: upgrading section documents low-noise defaults + edge-triggered; config tables updated with new defaults - deploy/chart/test_helm.sh: verifies replicaCount=1/2 template invariants

…w/time.Since in hpa, cronjob, tls_sweep

….TempDir

…provider credential validation

…lidation - Add ActiveCount() to Engine (O(1) count, replaces O(n) Snapshot() in metrics hook) - Add containerDisplayName() helper — fixes blank Container: line when single container + empty ContainerName in all three formatters - Export ProviderNames() from alert package - Validate unknown alert provider keys in Validate() / ValidateConfig() (startup failure + lint error)

…d benchmark, docs HYST-1 (P2): PVC hysteresis with ClearThreshold (default 75) FILTER-1 (P2): disruption filter skips phase=Failed, reason=Evicted LOG-1 (P2): skip reasons lowered to klog.V(2) SCAN-1: ActiveCount excludes StateResolved SCAN-2: maxBackoff in extractRetry() on all delivery paths SCAN-3: Init() calls shutdown() before reinitializing NEW-2b: canonical provider-name set in config.KnownProviders NEW-2 Change B: VerifiableProvider interface + VerifyAll() + Telegram/Discord/Slack Verify() kwatch lint --check wiring NEW-4: storm collapse load test + bounded state test + BenchmarkProcessStorm §4: silences consolidation — SilenceRule gets ContainerNames, LogPatterns, ContainerMessages, NodeReasons, NodeMessages; SuppressionIndex for unified detect-time filtering; deprecation shim maps ignore* fields into synthetic SilenceRules; CRD watcher maps new fields §5: Upgrading, Guarantees, Recipes docs; CLI table with --strict/--check FIX-6: lenient LoadConfig + LintStrict() + lint --strict flag FIX-2 drain: Start(ctx) drains workers on cancellation SKIP_UPGRADE_CHECK env var wired in NewUpgrader SeverityByReason config field checked before owner-kind

CR-1 LifecycleHook data race (clone under lock) CR-2 Node inhibition cleared during hold-down PVC-3 Swallowed node error auto-resolves PVCs BUG-1 CronJob false positives (split nil/staleness) BUG-2 PodStatusFilter Added casing (EqualFold) BUG-3 OOMKilled hint for containers with no limit BUG-4 Per-signal IncludeEvents/IncludeLogs (dead code) BUG-5 tls_sweep clock (expiry.Sub now) BUG-6 Init container error hint BUG-7 CrashLoop dropped on transient clean exit F2 Drop-oldest channel send can block SCAN-1 ActiveCount overcount SCAN-3 AlertManager Init race (idempotent shutdown) LOG-1 Container detection log level (V(2)) SQ-1 Remove startupQuiet PVC-1 Kubelet ctx timeout threading PVC-2 N+1 PVC API fan-out (single List) PVC-4 Nil PodRef panic guard PVC-5 Div by zero guard HTTP-1 Discord/Telegram shared HTTP client HTTP-2 Unbounded log fetch (cap at 500/1MB) HB-1 Heartbeat ping tied to request ctx MAIN-1/MAIN-2 Graceful shutdown (stop channel) HEALTH-1 /readyz returns 503 until leader tasks ready Severity default table + escalation fix NEW-4 integration test for controller fault path

… startup baseline summary HPA-2: Detect misconfigured HPAs via AbleToScale/ScalingActive conditions. Uses reason-specific MarkResolved (not blanket ResolveByResource) so HPAScalingError and HPAMaxedOut coexist independently on the same HPA. BASE-1a: PVC firstScan seeds over-threshold volumes into notifiedPvc without alerting on first checkUsage, preventing re-alerts on restart. BASE-1b: Node alerting conditions (NotReady, MemoryPressure, etc.) are seeded into the Seen baseline at startup for restart parity. CHRONIC-1: Emits one PreExistingAtStartup notification at startup summarizing pre-breakage suppressed by the baseline, gated by ReportStartupBaseline config (default true).

- self-hosted LLM sidecar (Qwen2.5-Coder-1.5B via Ollama) appends root-cause analysis to alerts; opt-in via llm.enabled (default off) - internal/llm/ package: client, prompt, redact, selectRelevant - circuit breaker (3-fail/60s cooldown), single-flight enrich channel - CD-1..5: grounding, correlation info, signature hints, runbooks, investigate commands + dashboard deep-link - deploy/llm/: baked model image, Dockerfile, Makefile target, CI - Helm: config.llm.enabled, llm.nativeSidecar, replica guard - Raw manifests: GOMEMLIMIT, commented LLM sidecar + config - Metrics: kwatch_llm_enrich_{total,failed,skipped}_total - Documentation: README, chart README, CHANGELOG, helm test suite

…-bit affinity + CI split + trims

… race, marshal errors)

…ODO()

…ough ContainsKillingStoppingContainerEvents and handler filters

+		return "", fmt.Errorf("failed to marshal feishu title: %w", err)
+	}
+	body := `{"msg_type":"interactive","card":{"config":{"wide_screen_mode":true},` +
+		`"header":{"title":{"tag":"plain_text","content":` + string(titleJSON) +


abahmed added 30 commits June 10, 2026 12:07

refactor: Phases 4-8 — coverage, lifecycle signals, diagnosis, operab…

9b7781c

…ility, docs

refactor: Remediation — A1-A4 + B-P1 through B-P8 fixes

d33135f

feat: configurable workers knob

e0f76c9

Add Workers field to config.Config (default 1). Replace hardcoded ctrl.Run(ctx, 1) in main.go with cfg.Workers with min-1 guard. Add TestRunMultipleWorkers in controller_test.go.

chore: remove accidentally committed build binary

748aac0

fix: gate pprof behind healthCheck.pprof (default off)

8905466

fix: drop maxLogBlockLines knob, reuse MaxRecentLogLines

56cdf96

Replaces SetMaxLogBlockLines with SetMaxLogLines wired from cfg.MaxRecentLogLines (0 → default 100). Avoids a second lever for the same concept.

fix: sort events by timestamp, chart values, DS availability hint

56c6fb7

feat: add --version flag for operability (Part B B4)

f909cfa

--version flag prints version.Short() and exits.

docs: update README with all new config options (Part B B5)

9306c12

Add docs for: pprof, escalation, CRD config, HPA monitor, TLS monitor, --version flag.

feat: node-pod inhibition (Part B4.1)

c8effef

Track active node incidents per NodeName in engine. Suppress all pod incidents on that node when an active node incident exists. Increment SuppressedPods counter on the node incident. Config inhibition: nodeSuppressesPods (default false).

abahmed added 19 commits June 14, 2026 15:55

fixes

e50e3a3

chore: §5 test hygiene, IMP-7 message-size, IMP-8 256Mi memory

9e82cc7

§5: config_test.go uses t.Setenv + t.TempDir instead of manual cleanup IMP-7: truncateMsg() at 4096 chars in formatCreate/Update/ResolvedMessage IMP-8: chart + deploy.yaml default memory 128Mi -> 256Mi

feat: IMP-5 PeakResources+runbooks, §5b image-pull normalization

b374f68

IMP-5: PeakResources tracks max resources across updates; Runbooks config maps reason->URL appended to hint §5b: ErrImagePull -> ImagePullBackOff normalization in normalizeReason (same root cause, one incident)

fix: injectable clock for handler (h.now()) — replaces direct time.No…

f027aa2

…w/time.Since in hpa, cronjob, tls_sweep

fix: container set in update/resolved messages (alert.go + Slack)

a964885

chore: test hygiene — replace os.Setenv/defer cleanup with t.Setenv/t…

8699335

….TempDir

fix: FIX-1 (P0 crash) — assign controller-level CronJob/HPA listers

cdfaeef

FIX-4+FIX-5: rune-safe truncateMsg + per-provider maxBytes

e352652

FIX-3: remove flat renotify.interval field

336615d

FIX-2+NEW-1: async delivery with per-provider channels, backoff, DLQ

c9263b4

NEW-3: stable incident_id in logs + NEW-5: strict YAML lint

8ee47c7

copy-on-emit: clone incident before async enqueue (point 7) + NEW-2: …

f05c9ba

…provider credential validation

shameemshah approved these changes Jun 15, 2026

View reviewed changes

abahmed added 9 commits June 15, 2026 13:01

fix: kwatch-llm non-root + shutdown panic + breaker single-probe + 64…

c3248e1

…-bit affinity + CI split + trims

fix: apply scan batch + remaining items (digested, GOMEMLIMIT, fanOut…

98034b6

… race, marshal errors)

fix: thread context.Context through filter pipeline, remove context.T…

c6e3f78

…ODO()

fix: wrap GetPodEvents/GetPVNameFromPVC with timeout + thread ctx thr…

49ebc19

…ough ContainsKillingStoppingContainerEvents and handler filters

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread internal/alert/feishu/feishu.go

return "", fmt.Errorf("failed to marshal feishu title: %w", err)

}

body := `{"msg_type":"interactive","card":{"config":{"wide_screen_mode":true},` +

`"header":{"title":{"tag":"plain_text","content":` + string(titleJSON) +

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 enhance kwatch#456

🚀 enhance kwatch#456
abahmed wants to merge 81 commits into
mainfrom
fix/phase-0-bugs

abahmed commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abahmed commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants