Skip to content

🚀 enhance kwatch#456

Open
abahmed wants to merge 81 commits into
mainfrom
fix/phase-0-bugs
Open

🚀 enhance kwatch#456
abahmed wants to merge 81 commits into
mainfrom
fix/phase-0-bugs

Conversation

@abahmed

@abahmed abahmed commented Jun 13, 2026

Copy link
Copy Markdown
Owner

No description provided.

abahmed added 30 commits June 10, 2026 12:07
B1: enqueuePod handles tombstones via DeletionHandlingMetaNamespaceKeyFunc
B2: Guard inner per-pod map with sync.Mutex in memory.go
B3: RemovePod resolves all incidents (remove break)
B4: CrashLoopBackOff fetches previous logs (RestartCount>0 && Running==nil)
B5: Owner-lookup errors captured and logged instead of swallowed
B6: Startup panic on unknown provider name guarded
B7: Incident messages include Logs and Events fields
MC1: Eliminate syncPod double lister Get
MC2: Restrict normalizeReason to known retry reasons
MC3: Pod-only incident key uses '.' container name for consistency
B1: TestEnqueuePodDeletedFinalStateUnknown
B2: TestMemoryConcurrentAccess
B3: TestRemovePodMultiIncidentResolve
B4: TestContainerLogsFilterCrashLoopBackOff
B5: TestPodOwnersFilterReplicaSet (verify owner stays as RS on error)
B6: TestGetProvidersUnknownSkipped
B7: TestFormatIncidentMessageWithLogsEvents
…tore

- Add model.ContainerState and Incident.LastContainerState
- Engine.Process now accepts and persists container state
- Add Engine.GetLastContainerState for change-detection queries
- Remove Memory from filter.Context; add LastState/PodLastState
- ContainerRestartsFilter, ContainerReasonsFilter, PodStatusFilter
  read from ContainerContext.LastState / Context.PodLastState
- executeContainersFilters and executePodFilters query engine
  for last state instead of memory; pass ContainerState to Process
- Remove h.memory.DelPod and Memory context wiring from process_pod
- Delete dead EventFilter (never registered)
- Update all tests
- Add filter.Status, filter.Detector (pure predicate, no I/O),
  filter.Enricher (I/O phase for broken signals)
- Pure filters (Namespace, PodName, PodStatus, ContainerName,
  ContainerRestarts, ContainerState, ContainerReasons, Noise)
  implement Detector with Status return
- I/O filters (PodOwnersFilter, ContainerLogsFilter) implement
  both Detector (returns StatusAlert) and Enricher
- Event-dependent filters (PodEventsFilter, ContainerKillingFilter)
  implement Enricher (events fetched only when needed)
- handler splits into podDetectors/podEnrichers/containerDetectors/
  containerEnrichers — no I/O on healthy path
- executePodFilters/executeContainersFilters: detect → enrich → engine
- Remove upfront GetPodEvents from ProcessPodObject (now fetched
  only when a broken signal is detected)
- Add Resource field to event.Event for signal type (pod/node/pvc)
- Engine.Process uses ev.Resource dynamically (defaults to pod)
- Add Engine.StartupQuiet config — suppresses alerts N seconds
  after startup to prevent re-alerting pre-existing breakage
- Add Engine.ResolveByResource — resolves incidents by resource
  type and name (used by node/PVC recovery)
- process_node.go: emit node signals through engine (cooldown/
  dedup/resolve lifecycle); drop h.memory.HasNode/AddNode/DelNode
- PVC monitor: emit pvc signals through engine; drop alertManager.Notify
  in favor of NotifyIncident; resolve when usage drops below threshold
- Remove unused memory field from handler struct and constructor;
  remove storage/memory imports from handler, main, and handler_test
C1 — multi-ns informers now return []cache.SharedIndexInformer; event
handlers attached to all; all HasSynced collected in WaitForCacheSync.
C3 — RBAC rules for batch/jobs (ClusterRole) + coordination.k8s.io/leases
(Role) in chart and deploy manifests.
C2 — leader-election scope gates PVC/correlator/heartbeat/startup-msg/
controller; os.Exit(0) on lost leadership to prevent zombies.
C5 — heartbeat rewritten to external HTTP GET/POST; no corr-engine or
alert-manager involvement.
C6 — baseline (seen) suppression checked unconditionally, not only during
StartupQuiet window; ClearSeen on healthy pod or delete.
C7 — Slack SendIncident Logs/Events blocks added to create/update paths
and fallback text.
C4 — removed dead CRD scaffolding (api/v1alpha1/ + deploy/crd.yaml).
Add tests: engine stale/lifecycle, baseline suppression, ClearSeen,
multi-ns informer, Slack Logs/Events golden, heartbeat ping.
Add regression tests for:
- StatefulSet owner resolution in PodOwnersFilter
- STS-owned pod grouping in correlation engine
- Golden messages for stale + resolved incidents

Implement DS/SS listers (unconditional, like RS lister):
- multiDaemonSetLister and multiStatefulSetLister in listers.go
- factorySet methods daemonSetLister/informers + statefulSetLister/informers
- DS/SS fields on Controller struct, wired in New() and Run()
- SetDaemonSetLister/SetStatefulSetLister on Handler interface
- DSLister/SSLister fields on filter.Context
- PodOwnersFilter uses lister first, falls back to live API
- process_pod.go passes both listers into Context
Add multiEventLister/multiEventNamespaceLister, dedicated factory with
involvedObject.kind=Pod field selector, wire into Controller.New() and
Run(). Replace live GetPodEvents calls in executePodFilters and
executeContainersFilters with lister-first (fallback to live API when
lister is nil). Add SetEventLister to Handler interface + implementation.

Tests: TestBrokenPodEventsFromCache (zero event LIST calls with lister),
tighten TestBrokenPodMakesAPICalls to exact count (2 without lister, 1 with).
Add Workers field to config.Config (default 1). Replace hardcoded
ctrl.Run(ctx, 1) in main.go with cfg.Workers with min-1 guard.
Add TestRunMultipleWorkers in controller_test.go.
Switch baseline from pod-key (ns/name) to incident-key
(ns:owner:reason:container) with TTL-based expiry.

Engine changes:
- seen becomes map[string]int64 (incident key → baselinedAt unix ts)
- isBaselined checks incident key with BaselineTTL (default 24h)
- BuildKey helper for consistent key construction
- OnBaselineChange hook fires on ClearSeen/MarkResolved/RemovePod
- NewEngine accepts initial Baseline from persistence

Persistence (state.ConfigMap):
- GetBaseline/SaveBaseline round-trip via JSON in kwatch-state ConfigMap
- main.go loads baseline before controller start, saves on changes

Controller buildSeenSet:
- Resolves owners via RS/DS/SS listers, computes incident keys
- Replaces old pod-key collection with incident-key baseline

Tests updated for new incident-key baseline semantics.
…D restore

Section 5 Items 4-9:

- PVC warn/critical thresholds (Severity field on Event, configurable tiers)
- Configurable log-block bound (MaxLogBlockLines, replaces hardcoded 100)
- DaemonSet monitor (unavailable-pod detection, gated)
- CronJob monitor (suspended/missed-schedule detection, gated)
- Operability: /readyz endpoint, /debug/pprof/* profiling handlers
- CRD scaffolding restored with all new config fields
- README updated for all new configuration options
… prefix matching

engine.ClearSeenByPrefix removes entries by prefix (ns:owner:) and
fires OnBaselineChange on actual change. handler.resolveOwnerName
resolves pod owner the same way controller.buildSeenSet does, so the
healthy-path clear matches incident keys exactly (not stale pod-key
deletes that never hit).
Replaces SetMaxLogBlockLines with SetMaxLogLines wired from
cfg.MaxRecentLogLines (0 → default 100). Avoids a second lever for
the same concept.
Pkg internal/crdwatch watches KwatchConfig CRs via client-go dynamic
informer. Hot-applies: maxRecentLogLines, silences, severityByOwnerKind.
Restart-only fields logged with 'requires restart' message. CR deletion
restores boot-time ConfigMap snapshot. Gated behind crd.enabled (default
off); graceful absence when CRD not installed. RBAC in both manifests.
Add correlation.EscalationTiers to engine config; BuildTiers/escaltionSeverity
helpers compute severity from restart count. When enabled, engine.Process
sets ev.Severity before enrichment based on thresholds [3,10,50] →
3+ high, 10+ critical. Wired from main.go config.
Watch HorizontalPodAutoscalers via autoscaling/v2 informer. Alert when
currentReplicas >= maxReplicas with reason HPAMaxedOut. Gated behind
hpaMonitor.enabled (default false). RBAC autoscaling/horizontalpodautoscalers
added to both manifests. Multi-namespace lister support included.
Periodic scanner of kubernetes.io/tls Secrets. Parses tls.crt PEM and
checks NotAfter against configurable threshold (default 30d). Alerts with
TlsCertExpired or TlsCertExpiringSoon reasons. Gated behind tlsMonitor.enabled
(default false). Runs inside runLeaderTasks. RBAC for secrets added.
--version flag prints version.Short() and exits.
Add docs for: pprof, escalation, CRD config, HPA monitor, TLS monitor,
--version flag.
Extract shared owner resolution into correlation.ResolveOwnerName — used
by controller buildSeenSet and handler healthy path. Replace ClearSeen on
Handler interface with ClearSeenByOwner(namespace, owner). On lister error
return '' and skip baseline (never guess).
Replace BuildTiers/escalationSeverity with crossedTier/escalateSeverity.
Escalation fires on UPDATE (not CREATE), comparing prev vs new RestartCount.
Severity escalated AFTER enricher.Enrich (which overwrites it). Track
RestartCount on both CREATE and UPDATE paths to enable cross-tier detection.
Rework HPA detection: alert only when DesiredReplicas >= MaxReplicas AND
CurrentReplicas < DesiredReplicas (still scaling). First-maxed timestamp
tracked per HPA; sustainedMinutes window (default 0 = immediate) before
alerting. Add SustainedMinutes to HpaMonitor config.
Replace LIST-API-based tlsmonitor with informer-backed TLS secret lister.
Add SecretLister to handler, SweepTLSSecrets iterates cache. TLS secret
informer uses field selector type=kubernetes.io/tls (separate factory).
Runs 24h ticker + immediate sweep inside runLeaderTasks. RBAC unchanged.
Track active node incidents per NodeName in engine. Suppress all pod
incidents on that node when an active node incident exists. Increment
SuppressedPods counter on the node incident. Config inhibition:
nodeSuppressesPods (default false).
Track create rate in sliding window. When threshold exceeded, buffer new
incidents into digestBuf and return ActionDigest (silent). checkLifecycle
flushes digest as ActionDigestFlush with Hint summary listing top 5
reason×ns combos. Config storm: {enabled, threshold, windowMinutes,
digestIntervalMinutes}.
abahmed added 19 commits June 14, 2026 15:55
…tured log

FIX-1: os.Expand -> regex ${VAR} only (bare $ preserved); bcrypt/password-safe
FIX-2: namespaces get,list,watch RBAC in chart ClusterRole + deploy.yaml
NIT: klog.Warning -> klog.InfoS for config-file-not-found

Tests: TestConfigEnvInterpolation covers ${VAR}, bare $, missing var, bcrypt $
…proxy, BUG-IV threadMap leak

BUG-I: refreshNodeInhibition in MarkResolved + checkLifecycle pending-resolve finalize
BUG-II: delete dead ClearSeen (fires hook under lock)
BUG-III: url.Parse(appConfig.ProxyURL) instead of http.ProxyURL(nil) in client.go
BUG-IV: delete(s.threadMap, key) in Slack ActionResolved
…er,reason)

Rule 1: notifSig/edgeAction replaces cooldown-gated ActionUpdate; StateStale/ActionStale removed; renotify uses incident fields directly
Rule 2: IncidentKey drops container component (namespace:owner:reason); Containers map tracks affected containers; formatters show multi-container
DefaultConfig: all monitors ON, MaxRecentLogLines:50, ResyncSeconds:600, HealthCheck.Enabled:true, Storm ON (10/5m), Inhibition ON, ResolveHoldDown:30, Escalation ON
config.go: HealthCheck.Diagnostics field (default false, gates /incidents+/test-alert)
validate.go: Validate() for storm/escalation/pendingPod/pvc rules, called from LoadConfig
Chart: replicaCount, probes, LE auto-inject on >1 replica, RBAC gated (secrets/TLS, leases/LE, kwatchconfigs/CRD)
deploy.yaml: replicas:1, probes, commented-out optional RBAC
§5f: RenotifyInterval collapsed into RenotifyIntervalBySeverity["default"]; deprecated Interval merged in main.go
IMP-1: async notification delivery — per-provider buffered channel + worker goroutine; non-blocking send
IMP-2: internal/metrics package with atomic counters, Prometheus /metrics handler registered on health server
§5: config_test.go uses t.Setenv + t.TempDir instead of manual cleanup
IMP-7: truncateMsg() at 4096 chars in formatCreate/Update/ResolvedMessage
IMP-8: chart + deploy.yaml default memory 128Mi -> 256Mi
IMP-5: PeakResources tracks max resources across updates; Runbooks config maps reason->URL appended to hint
§5b: ErrImagePull -> ImagePullBackOff normalization in normalizeReason (same root cause, one incident)
…v0.11.0

IMP-3: harness_test.go with recordingAlertManager + 5 integration tests (crashloop, node, inhibition, baseline, grouping)
IMP-4: event.Signal type + h.signalEvent helper; process_node.go converted as example
§5f: deprecation warnings for ignoreContainerNames/LogPatterns/ContainerMessages/NodeReasons/NodeMessages
§6c: version bump v0.11.0, CHANGELOG.md with all changes since v0.10.x
- IMP-4: all handler files now use event.Signal via h.signalEvent() or PvcMonitor.reportSignal()

- README: upgrading section documents low-noise defaults + edge-triggered; config tables updated with new defaults

- deploy/chart/test_helm.sh: verifies replicaCount=1/2 template invariants
abahmed added 9 commits June 15, 2026 13:01
…lidation

- Add ActiveCount() to Engine (O(1) count, replaces O(n) Snapshot() in metrics hook)
- Add containerDisplayName() helper — fixes blank Container: line when single
  container + empty ContainerName in all three formatters
- Export ProviderNames() from alert package
- Validate unknown alert provider keys in Validate() / ValidateConfig()
  (startup failure + lint error)
…d benchmark, docs

HYST-1 (P2): PVC hysteresis with ClearThreshold (default 75)
FILTER-1 (P2): disruption filter skips phase=Failed, reason=Evicted
LOG-1 (P2): skip reasons lowered to klog.V(2)

SCAN-1: ActiveCount excludes StateResolved
SCAN-2: maxBackoff in extractRetry() on all delivery paths
SCAN-3: Init() calls shutdown() before reinitializing

NEW-2b: canonical provider-name set in config.KnownProviders
NEW-2 Change B: VerifiableProvider interface + VerifyAll() + Telegram/Discord/Slack
  Verify() kwatch lint --check wiring
NEW-4: storm collapse load test + bounded state test + BenchmarkProcessStorm

§4: silences consolidation — SilenceRule gets ContainerNames, LogPatterns,
  ContainerMessages, NodeReasons, NodeMessages; SuppressionIndex for unified
  detect-time filtering; deprecation shim maps ignore* fields into synthetic
  SilenceRules; CRD watcher maps new fields

§5: Upgrading, Guarantees, Recipes docs; CLI table with --strict/--check

FIX-6: lenient LoadConfig + LintStrict() + lint --strict flag
FIX-2 drain: Start(ctx) drains workers on cancellation
SKIP_UPGRADE_CHECK env var wired in NewUpgrader
SeverityByReason config field checked before owner-kind
CR-1  LifecycleHook data race (clone under lock)
CR-2  Node inhibition cleared during hold-down
PVC-3 Swallowed node error auto-resolves PVCs
BUG-1 CronJob false positives (split nil/staleness)
BUG-2 PodStatusFilter Added casing (EqualFold)
BUG-3 OOMKilled hint for containers with no limit
BUG-4 Per-signal IncludeEvents/IncludeLogs (dead code)
BUG-5 tls_sweep clock (expiry.Sub now)
BUG-6 Init container error hint
BUG-7 CrashLoop dropped on transient clean exit
F2    Drop-oldest channel send can block
SCAN-1 ActiveCount overcount
SCAN-3 AlertManager Init race (idempotent shutdown)
LOG-1 Container detection log level (V(2))
SQ-1  Remove startupQuiet
PVC-1 Kubelet ctx timeout threading
PVC-2 N+1 PVC API fan-out (single List)
PVC-4 Nil PodRef panic guard
PVC-5 Div by zero guard
HTTP-1 Discord/Telegram shared HTTP client
HTTP-2 Unbounded log fetch (cap at 500/1MB)
HB-1  Heartbeat ping tied to request ctx
MAIN-1/MAIN-2 Graceful shutdown (stop channel)
HEALTH-1 /readyz returns 503 until leader tasks ready
Severity default table + escalation fix
NEW-4 integration test for controller fault path
… startup baseline summary

HPA-2: Detect misconfigured HPAs via AbleToScale/ScalingActive conditions.
Uses reason-specific MarkResolved (not blanket ResolveByResource) so
HPAScalingError and HPAMaxedOut coexist independently on the same HPA.

BASE-1a: PVC firstScan seeds over-threshold volumes into notifiedPvc
without alerting on first checkUsage, preventing re-alerts on restart.

BASE-1b: Node alerting conditions (NotReady, MemoryPressure, etc.) are
seeded into the Seen baseline at startup for restart parity.

CHRONIC-1: Emits one PreExistingAtStartup notification at startup
summarizing pre-breakage suppressed by the baseline, gated by
ReportStartupBaseline config (default true).
- self-hosted LLM sidecar (Qwen2.5-Coder-1.5B via Ollama) appends
  root-cause analysis to alerts; opt-in via llm.enabled (default off)
- internal/llm/ package: client, prompt, redact, selectRelevant
- circuit breaker (3-fail/60s cooldown), single-flight enrich channel
- CD-1..5: grounding, correlation info, signature hints, runbooks,
  investigate commands + dashboard deep-link
- deploy/llm/: baked model image, Dockerfile, Makefile target, CI
- Helm: config.llm.enabled, llm.nativeSidecar, replica guard
- Raw manifests: GOMEMLIMIT, commented LLM sidecar + config
- Metrics: kwatch_llm_enrich_{total,failed,skipped}_total
- Documentation: README, chart README, CHANGELOG, helm test suite
…ough ContainsKillingStoppingContainerEvents and handler filters
return "", fmt.Errorf("failed to marshal feishu title: %w", err)
}
body := `{"msg_type":"interactive","card":{"config":{"wide_screen_mode":true},` +
`"header":{"title":{"tag":"plain_text","content":` + string(titleJSON) +
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants