Skip to content

✨ incident enrichment, lifecycle management, and Slack thread support#455

Merged
abahmed merged 3 commits into
mainfrom
feat/incident-enrichment-lifecycle
Jun 8, 2026
Merged

✨ incident enrichment, lifecycle management, and Slack thread support#455
abahmed merged 3 commits into
mainfrom
feat/incident-enrichment-lifecycle

Conversation

@abahmed

@abahmed abahmed commented Jun 8, 2026

Copy link
Copy Markdown
Owner

✨ Incident Enrichment, Lifecycle & Slack Threads

🧠 Model

  • IncidentState: Active → Stale → Resolved
  • IncidentAction: Create, Update, Skip, Stale, Resolved
  • New fields: OwnerKind, ContainerName, RestartCount, Hint, State, LastUpdate

📡 Event

  • Added OwnerKind, RestartCount

🧩 Enricher (new package)

  • Enricher interface + DefaultEnricher + hintForReason() (OOMKilled→Memory, ImagePullBackOff→Registry/Auth, etc.)

🧠 Correlation Engine

  • Key: namespace:owner:normalizeReason:containerName
  • normalizeReason() strips numeric suffixes
  • Lifecycle ticker (Active→Stale), RemovePod, MarkResolved, LifecycleHook (outside mutex)

⚙️ Config

  • StaleThreshold (15m), LifecycleInterval (1m)

🧑‍💻 Handler

  • Populate OwnerKind, RestartCount, RemovePod on delete, action-based NotifyIncident

🚨 Alert Layer

  • ThreadProvider interface (15 existing providers unaffected), per-action formatting

💬 Slack

  • Threaded messages (root + reply), enriched blocks, webhook fallback, threadMap+mutex

🔇 Noise Filter

  • Skip Normal/Scheduled/Pulled/Pulling pre-correlation

🧪 Tests

  • ✅ 38 files, +1543/−111, all passing, go build clean
  • ✅ Correlation, Slack threads, alert dispatch

🔒 Safety

  • Zero deps, in-memory only, no breaking changes, hook outside lock

abahmed added 3 commits June 9, 2026 00:25
…port

- Add IncidentState (Active/Stale/Resolved) and IncidentAction (Create/Update/Skip/Stale/Resolved)
- Add enriched fields to Incident: OwnerKind, ContainerName, RestartCount, Hint, State, LastUpdate
- Add Enricher interface + DefaultEnricher + hintForReason mapping
- Correlation engine: new key format (ns:owner:reason:container), normalizeReason(),
  lifecycle ticker, RemovePod/MarkResolved, LifecycleHook (called outside lock)
- Config: StaleThreshold (default 15m), LifecycleInterval (default 1m)
- Handler: populate OwnerKind/RestartCount, call RemovePod on delete
- Slack provider: threaded incident messages via postBlocksFn for all 5 actions,
  enriched blocks (OwnerKind, ContainerName, RestartCount, Hint), threadMap + mutex
- Alert layer: ThreadProvider interface, per-action formatting, webhook fallback
- NoiseFilter: skip Normal/Scheduled/Pulled/Pulling events pre-correlation
…er-wide

Removes 'configmaps' from the ClusterRole and adds a namespace-scoped
Role + RoleBinding for ConfigMap access (get/create/update/patch) in
the kwatch namespace. This follows the principle of least privilege
and prevents an attacker from manipulating ConfigMaps in other
namespaces (e.g. CoreDNS) if the kwatch pod is compromised.

Fixes #445
@abahmed abahmed merged commit a14c773 into main Jun 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant