fix: per-channel dispatch state for cross-thread concurrent turns#48
Merged
Conversation
PR #42 enabled cross-thread concurrent dispatch (TKAI-65) but several session-wide state slots and code paths still assumed only one prompt could be in flight at a time. The result: three manual automations triggered three sandbox threads but the prompts dispatched serially; the Stop button on one thread aborted every active thread; errored threads left the session chrome stuck on 'thinking' forever; and a class of "default to wipe everything" fallbacks could destroy sibling threads' state under the right race. This branch reworks the dispatch / abort / completion path to be per-channel and per-row, with channel-scoped fan-out and routing fallbacks that never escalate to session-wide destruction. Worker (SessionAgentDO + PromptQueue) - prompt_queue rows now carry per-row dispatched_at and received_at, replacing the global lastPromptDispatchedAt / promptReceivedAt state-key slots. Watchdog reads min(dispatched_at) for stuck- processing detection. - channel_state table tracks per-channel busy + per-channel idle_queued_since + per-channel error_safety_net_at. The watchdog's RecoveryAction discriminated union carries the channel_key (or messageId) for each variant so force_complete / revert_and_drain / clear_safety_net target the right row. - sendNextQueuedPrompt drains every dispatchable cross-channel queued prompt in one pass (skippedBusyIds Set), instead of dispatching one and exiting. Per-channel busy guards prevent same-channel double dispatch; cross-channel concurrency is unlocked. - handleAbort with channelType='thread' aborts only that thread. Channel-scoped aborts (e.g. {slack, C1}) fan out one frame per DISTINCT active thread on that channel. Lookup matches BOTH channel_key AND (reply_channel_type, reply_channel_id) so threaded Slack/Telegram messages (whose channel_key was rewritten to 'thread:<tid>') are still found. - aborted frame from the runner now carries channelType/channelId. The DO uses it as the second-choice channel resolver when messageId is missing — getProcessingChannelKey() returns null when there are 2+ processing rows (the exact concurrent-dispatch case), so without this the row would wedge forever (the stuck-processing watchdog only fires when the runner is disconnected). - markCompletedById(undefined) is now a no-op. The previous escalation to unscoped markCompleted() ran DELETE FROM prompt_queue WHERE status='completed' after wiping every processing row, orphaning sibling channels' runner state. Replaced with markCompletedMostRecentByChannel(channelKey) so a missing-messageId ack completes exactly one row on the resolved channel. - handlePromptComplete skips queue drain when neither messageId nor channel context is resolvable, so a confused-abort frame can't surprise the user by dispatching an unrelated channel's queued prompt. - 'aborted' handler emits a scoped agentStatus:idle (via getChannelTargetById) — N fan-out frames no longer produce N unscoped idle broadcasts that flip session-wide chrome. Runner (PromptHandler / AgentClient) - Abort handler dispatched fire-and-forget (IIFE wrapper captures synchronous throws) so a channel-wide fan-out of N abort frames doesn't serialize the inbound WS loop behind N OpenCode /abort round-trips. - sendAborted forwards channelType/channelId so the DO can route the completion under concurrent dispatch. - Removed dead lastPromptSentAt / pendingReplyChannelType / pendingReplyChannelId fields from PromptHandler. Client (use-chat / chat-container) - threadStatuses: per-thread {status, detail} map replaces the single agentStatus slot for concurrent-thread rendering. Each thread renders its own Stop button. - threadPendingFollowups: per-thread map of queued follow-up prompts, replacing the single pendingPrompt slot. - abortingThreadsRef: 30s sentinel on the aborting thread suppresses late agentStatus events that left the runner before the abort frame arrived; cleared on the runner's confirmation 'idle' (or 'error' — for teardown errors). - pendingHttpSendsRef: per-thread AbortController cancels the in-flight HTTP POST when the user aborts, so a large-payload send in flight when /stop is clicked doesn't land in the DO as an orphaned prompt. - Session-switch effect clears all three refs so cross-session thread-id collisions don't suppress lazy loading or apply stale optimistic state. - 'error' status persists against the runner's trailing 'idle' (in both the agentStatus reducer AND the runnerBusy=false status reducer) so the error chip stays visible until the next prompt. anyOtherActive excludes 'error' so the session chrome can clear even when a sibling thread is in a sticky error state. - isDispatchBusy uses per-thread state when activeThreadId is set, falling back to session-wide isSessionBusy otherwise — fixes command-bar state on threadless flows while keeping per-thread state on threaded ones. Protocol - call-tool envelope carries opencodeSessionId; OpenCode tool wrapper sets x-opencode-session-id so the gateway routes tool calls to the originating session. - aborted frame carries channelType?/channelId? (optional, backward- compatible with old runners — DO falls back to legacy resolution). Tests - 748 worker tests pass; pnpm typecheck clean; client pnpm build clean. - Added regression tests: - aborted{channelType,channelId} routes completion under 2+ processing rows - aborted{} with no context does not wipe siblings or drain queued - markCompletedById(undefined) is a no-op (does not escalate) - markCompletedMostRecentByChannel scopes completion to the channel - SessionHealthMonitor: force_complete carries channelKey; clear_safety_net carries channelKey - prompt-queue test mock extended to handle WHERE channel_key = ?, SELECT DISTINCT thread_id, MIN/MAX(dispatched_at). Six rounds of adversarial code review during the rewrite; each found real regressions in the prior round's fixes. The pattern was almost always the same: scoping mistakes at the abort/completion boundary under concurrent dispatch, plus latent "when in doubt, do it to everything" fallbacks that were harmless in the single-thread era and catastrophic with concurrent threads in flight.
|
Preview deployment: https://pr-48.dev-valet-turnkey-client.pages.dev |
packages/runner/src/agent-client.ts and prompt.ts changed in this branch, both of which run inside the sandbox. Without this bump, the cached image ships without the new aborted-frame channel context the DO now relies on to route completion under concurrent dispatch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
channelType/channelId) to the runner→DOabortedframe so completion routes correctly under concurrent dispatch whenmessageIdis missing (getProcessingChannelKey()returns null on 2+ processing rows; the stuck-processing watchdog only fires when the runner is disconnected).Test plan
pnpm typecheck— cleanpnpm test— 748 worker tests pass (added regression coverage for aborted-frame channel routing, no-context drain skip,markCompletedById(undefined)no-op,markCompletedMostRecentByChannel, and watchdog action payloads)cd packages/client && pnpm build— cleanDeploy notes
IMAGE_BUILD_VERSIONinbackend/images/base.pybeforemake deploy-modal—packages/runner/changed and the sandbox image is cache-keyed by that version. Without the bump, the DO ships expecting newaborted-frame fields that old runner images don't send (graceful fallback exists, but the R6-F2 fix won't actually activate).markCompletedByIdlog lines and stuck-processing watchdog firings. A row that wedges from concurrent dispatch under a connected runner is invisible to the 5-min watchdog (symptom: thread silently stuck on 'thinking' until session reload).Process
Six rounds of adversarial code review during the rewrite; each round found real regressions in the prior round's fixes. The recurring pattern was scoping mistakes at the abort/completion boundary under concurrent dispatch, plus latent "when in doubt, do it to everything" fallbacks that were harmless in the single-thread era and catastrophic with concurrent threads in flight. The single squashed commit is the converged form after all six rounds.