🚀 enhance kwatch#456
Open
abahmed wants to merge 81 commits into
Open
Conversation
B1: enqueuePod handles tombstones via DeletionHandlingMetaNamespaceKeyFunc B2: Guard inner per-pod map with sync.Mutex in memory.go B3: RemovePod resolves all incidents (remove break) B4: CrashLoopBackOff fetches previous logs (RestartCount>0 && Running==nil) B5: Owner-lookup errors captured and logged instead of swallowed B6: Startup panic on unknown provider name guarded B7: Incident messages include Logs and Events fields MC1: Eliminate syncPod double lister Get MC2: Restrict normalizeReason to known retry reasons MC3: Pod-only incident key uses '.' container name for consistency
B1: TestEnqueuePodDeletedFinalStateUnknown B2: TestMemoryConcurrentAccess B3: TestRemovePodMultiIncidentResolve B4: TestContainerLogsFilterCrashLoopBackOff B5: TestPodOwnersFilterReplicaSet (verify owner stays as RS on error) B6: TestGetProvidersUnknownSkipped B7: TestFormatIncidentMessageWithLogsEvents
…tore - Add model.ContainerState and Incident.LastContainerState - Engine.Process now accepts and persists container state - Add Engine.GetLastContainerState for change-detection queries - Remove Memory from filter.Context; add LastState/PodLastState - ContainerRestartsFilter, ContainerReasonsFilter, PodStatusFilter read from ContainerContext.LastState / Context.PodLastState - executeContainersFilters and executePodFilters query engine for last state instead of memory; pass ContainerState to Process - Remove h.memory.DelPod and Memory context wiring from process_pod - Delete dead EventFilter (never registered) - Update all tests
- Add filter.Status, filter.Detector (pure predicate, no I/O), filter.Enricher (I/O phase for broken signals) - Pure filters (Namespace, PodName, PodStatus, ContainerName, ContainerRestarts, ContainerState, ContainerReasons, Noise) implement Detector with Status return - I/O filters (PodOwnersFilter, ContainerLogsFilter) implement both Detector (returns StatusAlert) and Enricher - Event-dependent filters (PodEventsFilter, ContainerKillingFilter) implement Enricher (events fetched only when needed) - handler splits into podDetectors/podEnrichers/containerDetectors/ containerEnrichers — no I/O on healthy path - executePodFilters/executeContainersFilters: detect → enrich → engine - Remove upfront GetPodEvents from ProcessPodObject (now fetched only when a broken signal is detected)
- Add Resource field to event.Event for signal type (pod/node/pvc) - Engine.Process uses ev.Resource dynamically (defaults to pod) - Add Engine.StartupQuiet config — suppresses alerts N seconds after startup to prevent re-alerting pre-existing breakage - Add Engine.ResolveByResource — resolves incidents by resource type and name (used by node/PVC recovery) - process_node.go: emit node signals through engine (cooldown/ dedup/resolve lifecycle); drop h.memory.HasNode/AddNode/DelNode - PVC monitor: emit pvc signals through engine; drop alertManager.Notify in favor of NotifyIncident; resolve when usage drops below threshold - Remove unused memory field from handler struct and constructor; remove storage/memory imports from handler, main, and handler_test
C1 — multi-ns informers now return []cache.SharedIndexInformer; event handlers attached to all; all HasSynced collected in WaitForCacheSync. C3 — RBAC rules for batch/jobs (ClusterRole) + coordination.k8s.io/leases (Role) in chart and deploy manifests. C2 — leader-election scope gates PVC/correlator/heartbeat/startup-msg/ controller; os.Exit(0) on lost leadership to prevent zombies. C5 — heartbeat rewritten to external HTTP GET/POST; no corr-engine or alert-manager involvement. C6 — baseline (seen) suppression checked unconditionally, not only during StartupQuiet window; ClearSeen on healthy pod or delete. C7 — Slack SendIncident Logs/Events blocks added to create/update paths and fallback text. C4 — removed dead CRD scaffolding (api/v1alpha1/ + deploy/crd.yaml). Add tests: engine stale/lifecycle, baseline suppression, ClearSeen, multi-ns informer, Slack Logs/Events golden, heartbeat ping.
Add regression tests for: - StatefulSet owner resolution in PodOwnersFilter - STS-owned pod grouping in correlation engine - Golden messages for stale + resolved incidents Implement DS/SS listers (unconditional, like RS lister): - multiDaemonSetLister and multiStatefulSetLister in listers.go - factorySet methods daemonSetLister/informers + statefulSetLister/informers - DS/SS fields on Controller struct, wired in New() and Run() - SetDaemonSetLister/SetStatefulSetLister on Handler interface - DSLister/SSLister fields on filter.Context - PodOwnersFilter uses lister first, falls back to live API - process_pod.go passes both listers into Context
Add multiEventLister/multiEventNamespaceLister, dedicated factory with involvedObject.kind=Pod field selector, wire into Controller.New() and Run(). Replace live GetPodEvents calls in executePodFilters and executeContainersFilters with lister-first (fallback to live API when lister is nil). Add SetEventLister to Handler interface + implementation. Tests: TestBrokenPodEventsFromCache (zero event LIST calls with lister), tighten TestBrokenPodMakesAPICalls to exact count (2 without lister, 1 with).
Add Workers field to config.Config (default 1). Replace hardcoded ctrl.Run(ctx, 1) in main.go with cfg.Workers with min-1 guard. Add TestRunMultipleWorkers in controller_test.go.
Switch baseline from pod-key (ns/name) to incident-key (ns:owner:reason:container) with TTL-based expiry. Engine changes: - seen becomes map[string]int64 (incident key → baselinedAt unix ts) - isBaselined checks incident key with BaselineTTL (default 24h) - BuildKey helper for consistent key construction - OnBaselineChange hook fires on ClearSeen/MarkResolved/RemovePod - NewEngine accepts initial Baseline from persistence Persistence (state.ConfigMap): - GetBaseline/SaveBaseline round-trip via JSON in kwatch-state ConfigMap - main.go loads baseline before controller start, saves on changes Controller buildSeenSet: - Resolves owners via RS/DS/SS listers, computes incident keys - Replaces old pod-key collection with incident-key baseline Tests updated for new incident-key baseline semantics.
…D restore Section 5 Items 4-9: - PVC warn/critical thresholds (Severity field on Event, configurable tiers) - Configurable log-block bound (MaxLogBlockLines, replaces hardcoded 100) - DaemonSet monitor (unavailable-pod detection, gated) - CronJob monitor (suspended/missed-schedule detection, gated) - Operability: /readyz endpoint, /debug/pprof/* profiling handlers - CRD scaffolding restored with all new config fields - README updated for all new configuration options
… prefix matching engine.ClearSeenByPrefix removes entries by prefix (ns:owner:) and fires OnBaselineChange on actual change. handler.resolveOwnerName resolves pod owner the same way controller.buildSeenSet does, so the healthy-path clear matches incident keys exactly (not stale pod-key deletes that never hit).
Replaces SetMaxLogBlockLines with SetMaxLogLines wired from cfg.MaxRecentLogLines (0 → default 100). Avoids a second lever for the same concept.
Pkg internal/crdwatch watches KwatchConfig CRs via client-go dynamic informer. Hot-applies: maxRecentLogLines, silences, severityByOwnerKind. Restart-only fields logged with 'requires restart' message. CR deletion restores boot-time ConfigMap snapshot. Gated behind crd.enabled (default off); graceful absence when CRD not installed. RBAC in both manifests.
Add correlation.EscalationTiers to engine config; BuildTiers/escaltionSeverity helpers compute severity from restart count. When enabled, engine.Process sets ev.Severity before enrichment based on thresholds [3,10,50] → 3+ high, 10+ critical. Wired from main.go config.
Watch HorizontalPodAutoscalers via autoscaling/v2 informer. Alert when currentReplicas >= maxReplicas with reason HPAMaxedOut. Gated behind hpaMonitor.enabled (default false). RBAC autoscaling/horizontalpodautoscalers added to both manifests. Multi-namespace lister support included.
Periodic scanner of kubernetes.io/tls Secrets. Parses tls.crt PEM and checks NotAfter against configurable threshold (default 30d). Alerts with TlsCertExpired or TlsCertExpiringSoon reasons. Gated behind tlsMonitor.enabled (default false). Runs inside runLeaderTasks. RBAC for secrets added.
--version flag prints version.Short() and exits.
Add docs for: pprof, escalation, CRD config, HPA monitor, TLS monitor, --version flag.
Extract shared owner resolution into correlation.ResolveOwnerName — used by controller buildSeenSet and handler healthy path. Replace ClearSeen on Handler interface with ClearSeenByOwner(namespace, owner). On lister error return '' and skip baseline (never guess).
Replace BuildTiers/escalationSeverity with crossedTier/escalateSeverity. Escalation fires on UPDATE (not CREATE), comparing prev vs new RestartCount. Severity escalated AFTER enricher.Enrich (which overwrites it). Track RestartCount on both CREATE and UPDATE paths to enable cross-tier detection.
Rework HPA detection: alert only when DesiredReplicas >= MaxReplicas AND CurrentReplicas < DesiredReplicas (still scaling). First-maxed timestamp tracked per HPA; sustainedMinutes window (default 0 = immediate) before alerting. Add SustainedMinutes to HpaMonitor config.
Replace LIST-API-based tlsmonitor with informer-backed TLS secret lister. Add SecretLister to handler, SweepTLSSecrets iterates cache. TLS secret informer uses field selector type=kubernetes.io/tls (separate factory). Runs 24h ticker + immediate sweep inside runLeaderTasks. RBAC unchanged.
Track active node incidents per NodeName in engine. Suppress all pod incidents on that node when an active node incident exists. Increment SuppressedPods counter on the node incident. Config inhibition: nodeSuppressesPods (default false).
Track create rate in sliding window. When threshold exceeded, buffer new
incidents into digestBuf and return ActionDigest (silent). checkLifecycle
flushes digest as ActionDigestFlush with Hint summary listing top 5
reason×ns combos. Config storm: {enabled, threshold, windowMinutes,
digestIntervalMinutes}.
…tured log
FIX-1: os.Expand -> regex ${VAR} only (bare $ preserved); bcrypt/password-safe
FIX-2: namespaces get,list,watch RBAC in chart ClusterRole + deploy.yaml
NIT: klog.Warning -> klog.InfoS for config-file-not-found
Tests: TestConfigEnvInterpolation covers ${VAR}, bare $, missing var, bcrypt $
…proxy, BUG-IV threadMap leak BUG-I: refreshNodeInhibition in MarkResolved + checkLifecycle pending-resolve finalize BUG-II: delete dead ClearSeen (fires hook under lock) BUG-III: url.Parse(appConfig.ProxyURL) instead of http.ProxyURL(nil) in client.go BUG-IV: delete(s.threadMap, key) in Slack ActionResolved
…er,reason) Rule 1: notifSig/edgeAction replaces cooldown-gated ActionUpdate; StateStale/ActionStale removed; renotify uses incident fields directly Rule 2: IncidentKey drops container component (namespace:owner:reason); Containers map tracks affected containers; formatters show multi-container
DefaultConfig: all monitors ON, MaxRecentLogLines:50, ResyncSeconds:600, HealthCheck.Enabled:true, Storm ON (10/5m), Inhibition ON, ResolveHoldDown:30, Escalation ON config.go: HealthCheck.Diagnostics field (default false, gates /incidents+/test-alert) validate.go: Validate() for storm/escalation/pendingPod/pvc rules, called from LoadConfig Chart: replicaCount, probes, LE auto-inject on >1 replica, RBAC gated (secrets/TLS, leases/LE, kwatchconfigs/CRD) deploy.yaml: replicas:1, probes, commented-out optional RBAC
§5f: RenotifyInterval collapsed into RenotifyIntervalBySeverity["default"]; deprecated Interval merged in main.go IMP-1: async notification delivery — per-provider buffered channel + worker goroutine; non-blocking send IMP-2: internal/metrics package with atomic counters, Prometheus /metrics handler registered on health server
§5: config_test.go uses t.Setenv + t.TempDir instead of manual cleanup IMP-7: truncateMsg() at 4096 chars in formatCreate/Update/ResolvedMessage IMP-8: chart + deploy.yaml default memory 128Mi -> 256Mi
IMP-5: PeakResources tracks max resources across updates; Runbooks config maps reason->URL appended to hint §5b: ErrImagePull -> ImagePullBackOff normalization in normalizeReason (same root cause, one incident)
…v0.11.0 IMP-3: harness_test.go with recordingAlertManager + 5 integration tests (crashloop, node, inhibition, baseline, grouping) IMP-4: event.Signal type + h.signalEvent helper; process_node.go converted as example §5f: deprecation warnings for ignoreContainerNames/LogPatterns/ContainerMessages/NodeReasons/NodeMessages §6c: version bump v0.11.0, CHANGELOG.md with all changes since v0.10.x
- IMP-4: all handler files now use event.Signal via h.signalEvent() or PvcMonitor.reportSignal() - README: upgrading section documents low-noise defaults + edge-triggered; config tables updated with new defaults - deploy/chart/test_helm.sh: verifies replicaCount=1/2 template invariants
…w/time.Since in hpa, cronjob, tls_sweep
…provider credential validation
shameemshah
approved these changes
Jun 15, 2026
…lidation - Add ActiveCount() to Engine (O(1) count, replaces O(n) Snapshot() in metrics hook) - Add containerDisplayName() helper — fixes blank Container: line when single container + empty ContainerName in all three formatters - Export ProviderNames() from alert package - Validate unknown alert provider keys in Validate() / ValidateConfig() (startup failure + lint error)
…d benchmark, docs HYST-1 (P2): PVC hysteresis with ClearThreshold (default 75) FILTER-1 (P2): disruption filter skips phase=Failed, reason=Evicted LOG-1 (P2): skip reasons lowered to klog.V(2) SCAN-1: ActiveCount excludes StateResolved SCAN-2: maxBackoff in extractRetry() on all delivery paths SCAN-3: Init() calls shutdown() before reinitializing NEW-2b: canonical provider-name set in config.KnownProviders NEW-2 Change B: VerifiableProvider interface + VerifyAll() + Telegram/Discord/Slack Verify() kwatch lint --check wiring NEW-4: storm collapse load test + bounded state test + BenchmarkProcessStorm §4: silences consolidation — SilenceRule gets ContainerNames, LogPatterns, ContainerMessages, NodeReasons, NodeMessages; SuppressionIndex for unified detect-time filtering; deprecation shim maps ignore* fields into synthetic SilenceRules; CRD watcher maps new fields §5: Upgrading, Guarantees, Recipes docs; CLI table with --strict/--check FIX-6: lenient LoadConfig + LintStrict() + lint --strict flag FIX-2 drain: Start(ctx) drains workers on cancellation SKIP_UPGRADE_CHECK env var wired in NewUpgrader SeverityByReason config field checked before owner-kind
CR-1 LifecycleHook data race (clone under lock) CR-2 Node inhibition cleared during hold-down PVC-3 Swallowed node error auto-resolves PVCs BUG-1 CronJob false positives (split nil/staleness) BUG-2 PodStatusFilter Added casing (EqualFold) BUG-3 OOMKilled hint for containers with no limit BUG-4 Per-signal IncludeEvents/IncludeLogs (dead code) BUG-5 tls_sweep clock (expiry.Sub now) BUG-6 Init container error hint BUG-7 CrashLoop dropped on transient clean exit F2 Drop-oldest channel send can block SCAN-1 ActiveCount overcount SCAN-3 AlertManager Init race (idempotent shutdown) LOG-1 Container detection log level (V(2)) SQ-1 Remove startupQuiet PVC-1 Kubelet ctx timeout threading PVC-2 N+1 PVC API fan-out (single List) PVC-4 Nil PodRef panic guard PVC-5 Div by zero guard HTTP-1 Discord/Telegram shared HTTP client HTTP-2 Unbounded log fetch (cap at 500/1MB) HB-1 Heartbeat ping tied to request ctx MAIN-1/MAIN-2 Graceful shutdown (stop channel) HEALTH-1 /readyz returns 503 until leader tasks ready Severity default table + escalation fix NEW-4 integration test for controller fault path
… startup baseline summary HPA-2: Detect misconfigured HPAs via AbleToScale/ScalingActive conditions. Uses reason-specific MarkResolved (not blanket ResolveByResource) so HPAScalingError and HPAMaxedOut coexist independently on the same HPA. BASE-1a: PVC firstScan seeds over-threshold volumes into notifiedPvc without alerting on first checkUsage, preventing re-alerts on restart. BASE-1b: Node alerting conditions (NotReady, MemoryPressure, etc.) are seeded into the Seen baseline at startup for restart parity. CHRONIC-1: Emits one PreExistingAtStartup notification at startup summarizing pre-breakage suppressed by the baseline, gated by ReportStartupBaseline config (default true).
- self-hosted LLM sidecar (Qwen2.5-Coder-1.5B via Ollama) appends
root-cause analysis to alerts; opt-in via llm.enabled (default off)
- internal/llm/ package: client, prompt, redact, selectRelevant
- circuit breaker (3-fail/60s cooldown), single-flight enrich channel
- CD-1..5: grounding, correlation info, signature hints, runbooks,
investigate commands + dashboard deep-link
- deploy/llm/: baked model image, Dockerfile, Makefile target, CI
- Helm: config.llm.enabled, llm.nativeSidecar, replica guard
- Raw manifests: GOMEMLIMIT, commented LLM sidecar + config
- Metrics: kwatch_llm_enrich_{total,failed,skipped}_total
- Documentation: README, chart README, CHANGELOG, helm test suite
…-bit affinity + CI split + trims
… race, marshal errors)
…ough ContainsKillingStoppingContainerEvents and handler filters
| return "", fmt.Errorf("failed to marshal feishu title: %w", err) | ||
| } | ||
| body := `{"msg_type":"interactive","card":{"config":{"wide_screen_mode":true},` + | ||
| `"header":{"title":{"tag":"plain_text","content":` + string(titleJSON) + |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.