Tags: hookdeck/outpost
Tags
feat(alert): default/disable semantics for consecutive-failure & exha… …usted-retries alerts (#964) * feat(alert): add Settings + enable gates for consecutive/exhausted alerts Introduce alert.Settings (the resolved, operational alert config) plus two monitor gates: WithConsecutiveFailureEnabled and WithExhaustedRetriesEnabled. Both default to true, so behavior is unchanged until a caller opts out. When consecutive-failure alerting is gated off the monitor neither tracks failures nor auto-disables; when exhausted-retries is gated off it never emits, even with retries enabled. Extracts the consecutive-failure path into a helper to keep the replay/ordering semantics identical on the enabled path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(config): resolve alert config to alert.Settings with unset/empty/value rule AlertConfig.ConsecutiveFailureCount and ExhaustedRetriesWindowSeconds become *string so the parse layer can tell three states apart: unset uses the default (100 / 3600), an empty string disables that alert dimension, and any other value must parse to a non-negative integer. AlertConfig.ToConfig resolves the raw values into the operational alert.Settings (domain-owned, so nothing downstream imports config). Validate rejects malformed values at startup. builder wires the resolved gates into the monitor and only builds the exhausted-retries suppression window when enabled with a positive window (0 = alert on every exhaustion). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(config): document alert default/disable behavior in config reference Update the Alerts section of the self-hosting config reference for the new unset/empty/value rule: ALERT_CONSECUTIVE_FAILURE_COUNT defaults to 100 (empty disables), and document the previously-undocumented ALERT_EXHAUSTED_RETRIES_WINDOW_SECONDS (default 3600, empty disables, 0 = no suppression). Also correct stale entries: drop the removed ALERT_CALLBACK_URL, fix the ALERT_AUTO_DISABLE_DESTINATION default (false, not true), and fix the YAML example key (alert, not alerts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(openapi): describe alert behavior in ManagedConfig Document the unset/empty/value behavior for ALERT_CONSECUTIVE_FAILURE_COUNT, ALERT_EXHAUSTED_RETRIES_WINDOW_SECONDS and ALERT_AUTO_DISABLE_DESTINATION in the ManagedConfig schema. Descriptions only — the properties are already typed as string. SDKs are regenerated from this schema at release time. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(config): make empty-string alert disable work on the env-var surface The *string representation had two problems found by manual QA: 1. caarlos0/env ignores a present-but-empty env var, so `ALERT_..._COUNT=` resolved to the default instead of disabling — the empty=off rule only worked via YAML, not env vars (the primary surface for the cloud product). 2. caarlos0/env crashes ("expected a pointer to a Struct") on any non-nil *string it walks, so setting these in a YAML config file would crash startup (env.Parse runs after the YAML load). Replace *string with an OptionalString value type that implements both TextUnmarshaler (bound by caarlos0/env as a scalar — no crash) and yaml.Unmarshaler (so `key: ""` expresses the empty/off state). The one case caarlos0/env cannot surface — a present-but-empty env var — is handled explicitly via OSInterface.LookupEnv, which also gives env precedence over YAML. Net: unset -> default, empty -> disabled, value -> value, identically on both env and YAML, with env > yaml. Adds a full parse-path test covering the matrix and precedence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(e2e): use OptionalString for ConsecutiveFailureCount in regression test Missed call site when migrating AlertConfig fields to OptionalString; the raw int assignment broke the cmd/e2e build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(destregistry): treat event format errors as failed deliveries, no… …t DLQ (#957) * fix(destregistry): treat event format errors as failed deliveries, not DLQ A Format/key-template failure (e.g. an S3 key_template referencing a field absent from the event) returned a nil delivery, which the registry turned into a nil attempt and the deliverymq handler classified as a PreDeliveryError → nack → Pub/Sub DLQ. The failure was never logged, invisible to the customer, and paged us instead of surfacing as an actionable delivery error. Add destregistry.NewFormatErrorDelivery, returning a non-nil failed Delivery plus an ErrDestinationPublishAttempt, so the registry records a failed attempt, acks the message, and retries via the scheduler. The customer-facing response is a generic message; the raw Go error stays on the error for logs/telemetry and is not persisted on the attempt. Apply it across all providers with a Format step: s3, sqs, azure_servicebus, gcp_pubsub, webhook, webhook_standard (previously `return nil, err`) and kinesis, kafka (previously nil-delivery ErrDestinationPublishAttempt). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(e2e): regression for format error delivered as failed attempt, not DLQ Standalone e2e test reproducing the production incident: an aws_s3 destination whose key_template references a field missing from the event. Asserts the fixed behavior end to end — nothing is written to S3, each delivery is recorded as a failed attempt carrying the format error, and retries run on the normal schedule and exhaust their budget rather than being nacked/dead-lettered. Verified as a real guard: reverting the destawss3 fix makes this test fail (0 attempts logged, message dead-lettered) instead of recording 3 attempts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(destregistry): rename NewFormatErrorDelivery to NewFormatError The helper returns the (*Delivery, error) pair a publisher returns on a format failure, not just a delivery — name it accordingly. Behavior unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(e2e): trim format-error regression test comment Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
chore: 🐝 Update SDK - Generate OUTPOST-TS 1.4.1 (#963) * `outpost.configuration.getManagedConfig()`: `response` **Changed** (Breaking⚠️ ) * `outpost.configuration.updateManagedConfig()`: * `request` **Changed** (Breaking⚠️ ) * `response` **Changed** (Breaking⚠️ ) Co-authored-by: speakeasybot <bot@speakeasyapi.dev>
chore: 🐝 Update SDK - Generate OUTPOST-PYTHON 1.4.1 (#962) * `outpost.configuration.get_managed_config()`: `response` **Changed** (Breaking⚠️ ) * `outpost.configuration.update_managed_config()`: * `request` **Changed** (Breaking⚠️ ) * `response` **Changed** (Breaking⚠️ ) Co-authored-by: speakeasybot <bot@speakeasyapi.dev>
chore: 🐝 Update SDK - Generate OUTPOST-GO 1.4.1 (#960) * `Outpost.Configuration.GetManagedConfig()`: `response` **Changed** (Breaking⚠️ ) * `Outpost.Configuration.UpdateManagedConfig()`: * `request.Request` **Changed** (Breaking⚠️ ) * `response` **Changed** (Breaking⚠️ ) Co-authored-by: speakeasybot <bot@speakeasyapi.dev>
ci(spec-sdk-tests-vs-release): test PR's SDK against latest released … …Outpost (#927) * ci(spec-sdk-tests-vs-release): test PR's SDK against latest released Outpost Closes #926. Trigger: PRs touching sdks/outpost-typescript/** (where the Speakeasy bot regen PRs land). Resolves the latest non-prerelease Outpost tag dynamically via the GitHub releases API (the repo uses namespaced tags like sdks/outpost-typescript/v1.3.0 for SDK releases, so the bare vX.Y.Z pattern correctly picks out the Outpost release). Question this answers: "Will the newly-regen'd SDK in this PR work against the version of Outpost that customers are already running?" Distinct from the existing spec-sdk-tests.yml workflow which asks "does this PR's spec match this PR's server" — both are needed, neither subsumes the other. Job shape: pull hookdeck/outpost:<tag> as a docker image, run it alongside the same service containers as the sibling workflow (Postgres, redis-stack-server for RediSearch, RabbitMQ), build the SDK from the PR with no regen step (the regen IS the PR), run the contract suite. Not dogfooded on this PR — the trigger filter only matches SDK paths, which this PR doesn't touch. First real run will be on the next Speakeasy bot regen PR after this lands. * ci(spec-sdk-tests-vs-release): support sdk_version + outpost_version dispatch overrides Lets you trigger the workflow from the Actions UI with optional inputs for ad-hoc compat testing: sdk_version pins the SDK to a specific release tag (or uses the dispatch branch's contents if empty). outpost_version pins the server to a specific Outpost release (or resolves the latest non-prerelease release if empty). Both accept "1.3.0" or "v1.3.0" — leading "v" is normalized. Inputs only affect workflow_dispatch runs; pull_request triggers ignore them, so the gate behaviour for bot regen PRs is unchanged. Single workflow rather than a sibling file — the job body is ~95% identical between PR gate and compat testing; the only material differences are two variables (which SDK, which Outpost). * ci(spec-sdk-tests-vs-release): guard inputs.* references with workflow_dispatch event check Defensive pattern flagged by Copilot review on #927: inputs.* context is officially only populated on workflow_dispatch (and workflow_call). Practically this works on PR events too — inputs.x evaluates to null which compares as empty — but the explicit guard is unambiguous and costs almost nothing. Two changes: * OVERRIDE env in the tag resolver uses the short-circuit ternary (github.event_name == 'workflow_dispatch' && inputs.x || ''). * SDK override step's if: prepends event_name == 'workflow_dispatch' so the inputs.sdk_version check is only evaluated on dispatch runs. * ci(spec-sdk-tests-vs-release): don't self-trigger on workflow file edits PRs that touch only this workflow file would fire it against main's state — currently NEW tests + OLD SDK (regen still pending) + OLD released Outpost — and fail at TS compile with 'type does not exist in type DestinationUpdate'. That's predicted transitional-state noise, not a real bug, but it leaves a permanently-red dogfood result that future reviewers have to recognize as expected. Drop the workflow file from its own trigger paths. The actual scenario this workflow exists for — Speakeasy bot regen PRs — always touches sdks/outpost-typescript/**, so the gate still catches them. Local iteration on the workflow file itself uses 'gh workflow run --ref'. Spotted while inspecting failing PR runs on #927.
fix: increase consumer error tolerance for transient infra outages (#900 ) Previously the consumer gave up after 5 consecutive receive errors with a 5s backoff cap (~3s total tolerance), permanently killing the worker with no recovery path. A brief broker hiccup (e.g. GCP OAuth/DNS blip, managed broker restart) was enough to take down logmq/deliverymq workers across deployments until containers were manually restarted. Mirrors the same fix applied to the retrymq scheduler in #881. Increase to 10 errors with 15s backoff cap (~1 min tolerance window). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PreviousNext