fix(realtime): downgrade benign WS close codes (1000/1001/1006) from ERROR to WARNING#2038
fix(realtime): downgrade benign WS close codes (1000/1001/1006) from ERROR to WARNING#2038moshemorad wants to merge 2 commits into
Conversation
WebSocket close codes 1000 (normal), 1001 (going away), and 1006
(abnormal close) are routine operational events for the
ConversationsWorker realtime subscription — the worker's own
_channel_unhealthy + _full_reconnect loop already recovers from them
transparently. Logging them at ERROR via logging.exception causes
sentry_sdk's logging integration to surface them as actionable Sentry
issues, which is noise.
Adds _benign_ws_close_code() helper that classifies a websockets
ConnectionClosed* exception (walking __cause__/__context__) by close
code, returning a code only for {1000, 1001, 1006}. 1008 (policy
violation), 1011 (server error), and the 4xxx custom range are
intentionally excluded so auth/RLS misconfigs and server bugs still
surface.
Downgrades 6 logging.exception sites in realtime_manager.py to
logging.warning when the exception is a benign close:
- _thread_entry (defense in depth)
- _run main-loop catch (defense in depth)
- _full_reconnect failure
- _maybe_refresh_auth failure
- _connect_and_subscribe set_auth failure
- _shutdown_async failure
And one site (cleanup-after-failed-connect, already in a cleanup
path) is further downgraded to debug.
Non-benign exceptions continue to log at ERROR exactly as before, so
real bugs still page.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Mohse Morad <moshemorad12340@gmail.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
📂 Previous Runs📜 #1 · Run @ __055f63e__ (#25865124409) — May 14, 14:23 UTC✅ Results of HolmesGPT evalsAutomatically triggered by commit 055f63e on branch Results of HolmesGPT evals
Benchmark comparison unavailable: No ci-benchmark experiments found Benchmark Comparison DetailsBaseline: latest ci-benchmark experiment on master Status: No ci-benchmark experiments found Comparison indicators:
✅ Results of HolmesGPT evalsAutomatically triggered by commit 00cf4c2 on branch Results of HolmesGPT evals
Benchmark comparison unavailable: No ci-benchmark experiments found Benchmark Comparison DetailsBaseline: latest ci-benchmark experiment on master Status: No ci-benchmark experiments found Comparison indicators:
📖 Legend
🔄 Re-run evals manually
Option 1: Comment on this PR with Or with more options (one per line): Run evals on a different branch (e.g., master) for comparison:
Quick re-run: Use Option 2: Trigger via GitHub Actions UI → "Run workflow" Option 3: Add PR labels to include extra evals (applies to both automatic runs and
Examples: 🏷️ Valid tags
🤖 Valid models
Commands: CLI: |
|
✅ Docker images ready for
Use these tags to pull the images for testing. 📋 Copy commandsgcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:1854b34c
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:1854b34c me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:1854b34c
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:1854b34c
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:1854b34c
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes-operator:1854b34c me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:1854b34c
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-operator-dev:1854b34cPatch Helm values in one line (choose the chart you use): HolmesGPT chart: helm upgrade --install holmesgpt ./helm/holmes \
--set registry=me-west1-docker.pkg.dev/robusta-development/development \
--set image=holmes-dev:1854b34c \
--set operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set operator.image=holmes-operator-dev:1854b34cRobusta wrapper chart: helm upgrade --install robusta robusta/robusta \
--reuse-values \
--set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set holmes.image=holmes-dev:1854b34c \
--set holmes.operator.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set holmes.operator.image=holmes-operator-dev:1854b34c |
✅ Deploy Preview for holmes-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
🔬 CLI Performance Benchmark🟡 Startup Time (no LLM)Measures
🟡 Full CLI with LLMMeasures
PR: |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughAdd a helper ChangesAbnormal WebSocket 1006 handling
Sequence DiagramsequenceDiagram
participant Handler as Exception handler site
participant Detector as _is_ws_abnormal_close
participant Logger as Logger
participant Reconnector as Reconnect/cleanup flow
Handler->>Detector: pass caught exception
Detector->>Detector: inspect type/message for ConnectionClosed / "no close frame"
alt Detector returns True (1006)
Detector-->>Handler: True
Handler->>Logger: log WARNING / debug (no exception stack)
Handler->>Reconnector: continue retry/reconnect/cleanup path
else Detector returns False
Detector-->>Handler: False
Handler->>Logger: log ERROR / exception stack
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Address review feedback that the helper was doing too much. Three
simplifications:
1. Scope to 1006 only. The Sentry issue is specifically 1006; the
1000/1001 cases were speculative.
2. Replace the typed close-code dispatch (rcvd/sent, deprecated .code,
frozenset lookup) with a plain `isinstance + "no close frame" in
str(exc)` check. The phrase is the stable websockets __str__ for
code 1006.
3. Drop __cause__/__context__ chain walking. The realtime library
surfaces ConnectionClosed* directly at every site we catch.
Helper shrinks from ~30 lines to 5; call sites shrink from ~10 lines
each to ~6. Net diff vs master drops from +258 to +155 lines.
Tests reduced to 5 (was 10): one positive, one for 1008 false-positive
guard, one for non-WS exception, plus the two caplog end-to-end checks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Mohse Morad <moshemorad12340@gmail.com>
What
WebSocket close codes 1000 (normal), 1001 (going away), and 1006 (abnormal close) are routine operational events for the
ConversationsWorkerrealtime subscription. The worker's own_channel_unhealthy+_full_reconnectloop already recovers from them transparently — logging them at ERROR vialogging.exceptioncausessentry_sdk's logging integration to surface them as actionable Sentry issues, which is noise.Sentry example:
https://robusta-eu.sentry.io/issues/118384168/Why these three codes are benign
Per RFC 6455:
websocketslibrary when the TCP/TLS connection drops without a Close frame (NAT idle-timeout, LB kill, network blip). This code is never sent on the wire.In all three cases the existing reconnect loop catches the dropped socket on its next health tick and rebuilds with exponential backoff.
Not included: 1008 (policy violation), 1011 (server error), and the 4xxx custom range — those typically indicate auth/RLS misconfig or server bugs and should continue to page.
Changes
holmes/core/conversations_worker/realtime_manager.py:_benign_ws_close_code(exc)that classifies awebsockets.ConnectionClosed*(walking__cause__/__context__) by close code, returning the code only if in{1000, 1001, 1006}, elseNone. Usesrcvd/sentdispatch to avoid thewebsockets ≥ 13.codedeprecation while still handling the rcvd-is-None (1006) case.logging.exceptionsites downgraded tologging.warningwhen the exception is benign:_thread_entry(defense in depth)_runmain-loop catch (defense in depth)_full_reconnectfailure_maybe_refresh_authfailure_connect_and_subscribe→set_authfailure_shutdown_asyncfailure_connect_and_subscribecleanup-after-failed-connect, already in a cleanup path) further downgraded todebug.Tests
tests/core/conversations_worker/test_realtime_manager.py— 10 new tests:.codeattribute.__cause__and__context__chains.Noneinput.caplog:_shutdown_asynclogs at WARNING for a 1006, and still at ERROR for aValueError.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests