Skip to content

fix: per-channel dispatch state for cross-thread concurrent turns#48

Merged
yourbuddyconner merged 2 commits into
mainfrom
fix/per-channel-dispatch-state
Jun 12, 2026
Merged

fix: per-channel dispatch state for cross-thread concurrent turns#48
yourbuddyconner merged 2 commits into
mainfrom
fix/per-channel-dispatch-state

Conversation

@yourbuddyconner

Copy link
Copy Markdown
Collaborator

Summary

  • Reworks the abort / completion / dispatch path to be per-channel and per-row so concurrent thread turns enabled by PR Fix scheduled trigger prompt queue preservation #42 actually dispatch in parallel, the Stop button is scoped to one thread, and a class of "default to wipe everything" fallbacks no longer destroy sibling threads' state.
  • Adds channel context (channelType / channelId) to the runner→DO aborted frame so completion routes correctly under concurrent dispatch when messageId is missing (getProcessingChannelKey() returns null on 2+ processing rows; the stuck-processing watchdog only fires when the runner is disconnected).
  • Client now tracks per-thread status / pending follow-ups / abort sentinel / in-flight HTTP send, so cross-thread turns each render their own chrome and a Stop on one thread doesn't optimistically clear siblings.

Test plan

  • pnpm typecheck — clean
  • pnpm test — 748 worker tests pass (added regression coverage for aborted-frame channel routing, no-context drain skip, markCompletedById(undefined) no-op, markCompletedMostRecentByChannel, and watchdog action payloads)
  • cd packages/client && pnpm build — clean
  • Manual: trigger three manual automations and confirm they dispatch in parallel (the original PR Fix scheduled trigger prompt queue preservation #42 symptom)
  • Manual: Stop button on one thread while siblings are running — only that thread aborts
  • Manual: Slack-channel-scoped stop with a threaded message in flight — abort fans out to the active thread, not session-wide
  • Manual: thread error followed by a sibling-thread completion — session chrome clears, errored thread chip stays red until next prompt

Deploy notes

  • Bump IMAGE_BUILD_VERSION in backend/images/base.py before make deploy-modalpackages/runner/ changed and the sandbox image is cache-keyed by that version. Without the bump, the DO ships expecting new aborted-frame fields that old runner images don't send (graceful fallback exists, but the R6-F2 fix won't actually activate).
  • Watch CF observability for the first ~24h after merge: grep for markCompletedById log lines and stuck-processing watchdog firings. A row that wedges from concurrent dispatch under a connected runner is invisible to the 5-min watchdog (symptom: thread silently stuck on 'thinking' until session reload).

Process

Six rounds of adversarial code review during the rewrite; each round found real regressions in the prior round's fixes. The recurring pattern was scoping mistakes at the abort/completion boundary under concurrent dispatch, plus latent "when in doubt, do it to everything" fallbacks that were harmless in the single-thread era and catastrophic with concurrent threads in flight. The single squashed commit is the converged form after all six rounds.

PR #42 enabled cross-thread concurrent dispatch (TKAI-65) but several
session-wide state slots and code paths still assumed only one prompt
could be in flight at a time. The result: three manual automations
triggered three sandbox threads but the prompts dispatched serially;
the Stop button on one thread aborted every active thread; errored
threads left the session chrome stuck on 'thinking' forever; and a
class of "default to wipe everything" fallbacks could destroy sibling
threads' state under the right race.

This branch reworks the dispatch / abort / completion path to be
per-channel and per-row, with channel-scoped fan-out and routing
fallbacks that never escalate to session-wide destruction.

Worker (SessionAgentDO + PromptQueue)
- prompt_queue rows now carry per-row dispatched_at and received_at,
  replacing the global lastPromptDispatchedAt / promptReceivedAt
  state-key slots. Watchdog reads min(dispatched_at) for stuck-
  processing detection.
- channel_state table tracks per-channel busy + per-channel
  idle_queued_since + per-channel error_safety_net_at. The watchdog's
  RecoveryAction discriminated union carries the channel_key (or
  messageId) for each variant so force_complete / revert_and_drain /
  clear_safety_net target the right row.
- sendNextQueuedPrompt drains every dispatchable cross-channel queued
  prompt in one pass (skippedBusyIds Set), instead of dispatching one
  and exiting. Per-channel busy guards prevent same-channel double
  dispatch; cross-channel concurrency is unlocked.
- handleAbort with channelType='thread' aborts only that thread.
  Channel-scoped aborts (e.g. {slack, C1}) fan out one frame per
  DISTINCT active thread on that channel. Lookup matches BOTH
  channel_key AND (reply_channel_type, reply_channel_id) so threaded
  Slack/Telegram messages (whose channel_key was rewritten to
  'thread:<tid>') are still found.
- aborted frame from the runner now carries channelType/channelId.
  The DO uses it as the second-choice channel resolver when
  messageId is missing — getProcessingChannelKey() returns null when
  there are 2+ processing rows (the exact concurrent-dispatch case),
  so without this the row would wedge forever (the stuck-processing
  watchdog only fires when the runner is disconnected).
- markCompletedById(undefined) is now a no-op. The previous escalation
  to unscoped markCompleted() ran DELETE FROM prompt_queue WHERE
  status='completed' after wiping every processing row, orphaning
  sibling channels' runner state. Replaced with
  markCompletedMostRecentByChannel(channelKey) so a missing-messageId
  ack completes exactly one row on the resolved channel.
- handlePromptComplete skips queue drain when neither messageId nor
  channel context is resolvable, so a confused-abort frame can't
  surprise the user by dispatching an unrelated channel's queued
  prompt.
- 'aborted' handler emits a scoped agentStatus:idle (via
  getChannelTargetById) — N fan-out frames no longer produce N
  unscoped idle broadcasts that flip session-wide chrome.

Runner (PromptHandler / AgentClient)
- Abort handler dispatched fire-and-forget (IIFE wrapper captures
  synchronous throws) so a channel-wide fan-out of N abort frames
  doesn't serialize the inbound WS loop behind N OpenCode /abort
  round-trips.
- sendAborted forwards channelType/channelId so the DO can route the
  completion under concurrent dispatch.
- Removed dead lastPromptSentAt / pendingReplyChannelType /
  pendingReplyChannelId fields from PromptHandler.

Client (use-chat / chat-container)
- threadStatuses: per-thread {status, detail} map replaces the
  single agentStatus slot for concurrent-thread rendering. Each
  thread renders its own Stop button.
- threadPendingFollowups: per-thread map of queued follow-up
  prompts, replacing the single pendingPrompt slot.
- abortingThreadsRef: 30s sentinel on the aborting thread suppresses
  late agentStatus events that left the runner before the abort
  frame arrived; cleared on the runner's confirmation 'idle' (or
  'error' — for teardown errors).
- pendingHttpSendsRef: per-thread AbortController cancels the
  in-flight HTTP POST when the user aborts, so a large-payload send
  in flight when /stop is clicked doesn't land in the DO as an
  orphaned prompt.
- Session-switch effect clears all three refs so cross-session
  thread-id collisions don't suppress lazy loading or apply stale
  optimistic state.
- 'error' status persists against the runner's trailing 'idle'
  (in both the agentStatus reducer AND the runnerBusy=false status
  reducer) so the error chip stays visible until the next prompt.
  anyOtherActive excludes 'error' so the session chrome can clear
  even when a sibling thread is in a sticky error state.
- isDispatchBusy uses per-thread state when activeThreadId is set,
  falling back to session-wide isSessionBusy otherwise — fixes
  command-bar state on threadless flows while keeping per-thread
  state on threaded ones.

Protocol
- call-tool envelope carries opencodeSessionId; OpenCode tool wrapper
  sets x-opencode-session-id so the gateway routes tool calls to the
  originating session.
- aborted frame carries channelType?/channelId? (optional, backward-
  compatible with old runners — DO falls back to legacy resolution).

Tests
- 748 worker tests pass; pnpm typecheck clean; client pnpm build clean.
- Added regression tests:
  - aborted{channelType,channelId} routes completion under 2+
    processing rows
  - aborted{} with no context does not wipe siblings or drain queued
  - markCompletedById(undefined) is a no-op (does not escalate)
  - markCompletedMostRecentByChannel scopes completion to the channel
  - SessionHealthMonitor: force_complete carries channelKey;
    clear_safety_net carries channelKey
- prompt-queue test mock extended to handle WHERE channel_key = ?,
  SELECT DISTINCT thread_id, MIN/MAX(dispatched_at).

Six rounds of adversarial code review during the rewrite; each found
real regressions in the prior round's fixes. The pattern was almost
always the same: scoping mistakes at the abort/completion boundary
under concurrent dispatch, plus latent "when in doubt, do it to
everything" fallbacks that were harmless in the single-thread era
and catastrophic with concurrent threads in flight.
@github-actions

Copy link
Copy Markdown

Preview deployment: https://pr-48.dev-valet-turnkey-client.pages.dev

packages/runner/src/agent-client.ts and prompt.ts changed in this branch,
both of which run inside the sandbox. Without this bump, the cached image
ships without the new aborted-frame channel context the DO now relies on
to route completion under concurrent dispatch.
@yourbuddyconner yourbuddyconner merged commit a31d7d6 into main Jun 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant