Tags: tomaioo/mux
Tags
🤖 feat: add OpenAI WebSocket transport opt-in (coder#3241) ## Summary Adds an opt-in OpenAI WebSocket transport setting for the built-in OpenAI provider. When `webSocketTransportEnabled` is true and the effective OpenAI wire format is Responses, eligible streaming Responses API requests use `@vercel/ai-sdk-openai-websocket-fetch`; existing HTTP behavior remains the default. ## Background OpenAI's Responses WebSocket transport can reduce setup overhead for streaming, multi-step workflows, but Mux previously had no first-class provider-level opt-in. This keeps the feature scoped to the built-in OpenAI provider and preserves the saved preference when users temporarily switch to Chat Completions. ## Implementation - Adds `webSocketTransportEnabled` to provider config/status schemas and OpenAI provider settings. - Shows the WebSocket control only in Responses wire format; hides it for Chat Completions without clearing the saved value. - Composes the upstream WebSocket fetch through a small helper that preserves Mux's existing OpenAI fetch wrapper for non-eligible requests. - Attaches per-model cleanup via a Mux-owned symbol and runs cleanup from main stream and workspace title generation paths. - Updates provider factory, stream lifecycle, and settings tests for activation, gating, and cleanup behavior. ## Validation - `make static-check` - Focused tests for config/status, provider factory activation, helper behavior, stream cleanup, title cleanup, and Settings UI behavior. - Dogfooded Settings UI with `agent-browser` for default/off, enabled, Chat Completions hidden, and Responses restored states. - Created live test workspaces, sent OpenAI chat messages, and verified backend-side WebSocket open evidence: `wss://api.openai.com/v1/responses`. ## Risks The main risk is provider transport composition regressions. The implementation pre-filters non-eligible requests so Mux's existing fetch behavior remains responsible for non-WebSocket HTTP paths, and cleanup is scoped per model/run to avoid process-wide socket lifetime complexity. --- <details> <summary>📋 Implementation Plan</summary> # Implementation Plan: OpenAI WebSocket Transport Opt-In ## Goal Add a non-breaking, optional **OpenAI WebSocket Transport** setting for the **Built-in OpenAI Provider**. When `webSocketTransportEnabled` is persisted as `true` and the effective OpenAI wire format is Responses, eligible streaming Responses API requests use the published OpenAI WebSocket fetch transport. Existing HTTP behavior remains the default. ## Verified context and constraints - Product/domain decisions are already captured in `CONTEXT.md` and `PRD.md`: - canonical setting name: `webSocketTransportEnabled` - provider config only; no request-level override - exposed in Settings → Providers → OpenAI near Wire Format - inactive/disabled for Chat Completions while preserving the saved flag - no custom base URL validation - no automatic HTTP fallback after WebSocket failures - use `@vercel/ai-sdk-openai-websocket-fetch`; do not implement the WebSocket protocol locally - per-stream connection lifecycle; explicit cleanup on completion/error/cancel - no ADR for this iteration - Repo investigation found existing OpenAI-specific provider config/status/UI patterns to mirror: - `serviceTier`, `wireFormat`, and `store` in provider config/status/UI - OpenAI status values are validated before surfacing to the frontend - `ProvidersSection.tsx` already has adjacent OpenAI settings for Service tier, Wire format, and Response storage - Repo investigation found the main runtime seams: - `providerModelFactory.ts` creates OpenAI models through `createOpenAI({ ..., fetch })` - the OpenAI branch already wraps fetch for Mux headers, DevTools capture/stripping, Codex OAuth normalization/routing, and custom fetch handling - `streamManager.ts` owns the main guaranteed stream cleanup `finally` path - `workspaceTitleGenerator.ts` is another `streamText` owner using `AIService.createModel()` models - Upstream AI SDK docs confirm that OpenAI provider instances accept a custom `fetch`, `createWebSocketFetch()` is passed to `createOpenAI({ fetch })`, the package exposes `.close()`, and only streaming `POST /responses` requests use WebSocket while other requests fall through to standard fetch. ## Recommended approach **Approach A: Provider-config opt-in + small WebSocket fetch composition module + language-model cleanup symbol** Net product-code LoC estimate: **~230–360 LoC** Estimated product-code breakdown: - config/status schemas and provider service surfacing: ~20–35 LoC - Settings UI control and helpers: ~55–90 LoC - WebSocket fetch composition helper: ~55–90 LoC - language-model cleanup helper: ~35–55 LoC - provider factory integration: ~35–60 LoC - stream-owner cleanup integration: ~20–30 LoC Why this approach: - keeps the existing `createModel()` return API stable - isolates protocol package composition behind a small deep module - preserves existing OpenAI fetch behavior instead of naively replacing fetch - gives deterministic test seams for enablement and cleanup - avoids process-wide socket caching, URL validation, fallback retries, or other speculative complexity Rejected alternatives: - **Process-wide cached WebSocket connections**: more latency upside across separate user messages but requires cache keys, config invalidation, key rotation handling, and app shutdown cleanup. Product-code estimate if chosen later: ~180–300 additional LoC. - **Change `createModel()` to return `{ model, cleanup }`**: explicit but high-churn across call sites and tests. Product-code estimate: ~120–220 LoC plus broad type/test churn. - **Implement the WebSocket protocol locally**: maximum control but duplicates upstream transport behavior and beta protocol maintenance. Product-code estimate: ~220–400 LoC plus higher maintenance risk. ## Implementation phases ### Phase 0 — Documentation alignment 1. Keep `CONTEXT.md` as the canonical glossary and decision summary for this feature. - Preserve the terms **Built-in OpenAI Provider**, **Direct OpenAI API Key Path**, **OpenAI WebSocket Transport**, and `webSocketTransportEnabled`. - If implementation uncovers a domain decision that changes the agreed semantics, update `CONTEXT.md` in the same change set rather than leaving the glossary stale. 2. Keep `PRD.md` aligned with the implemented scope. - It should continue to describe the feature as a non-breaking provider-config opt-in. - Update it if implementation materially changes accepted behavior, package name, acceptance criteria, or dogfooding requirements. 3. Do not create an ADR unless implementation introduces a hard-to-reverse architectural decision beyond the current per-stream cleanup-symbol approach. Quality gate after Phase 0: - Confirm `CONTEXT.md` and `PRD.md` mention the current package name, `@vercel/ai-sdk-openai-websocket-fetch`, before implementation begins. - Confirm later implementation changes do not contradict the glossary or PRD acceptance criteria. ### Phase 1 — Dependency and schema/status plumbing 1. Add `@vercel/ai-sdk-openai-websocket-fetch` using Bun. - Use `bun add @vercel/ai-sdk-openai-websocket-fetch` so `package.json` and lockfile remain consistent. - Keep the dependency in normal dependencies, not dev dependencies, because runtime provider creation uses it. 2. Add `webSocketTransportEnabled: z.boolean().optional()` to the **Built-in OpenAI Provider** config schema. - Place it near existing OpenAI-only fields such as `serviceTier`, `defaultModel`, `apiVersion`, and other persisted OpenAI settings. - Do not add it to request/provider options schemas; this is intentionally provider config only. 3. Add `webSocketTransportEnabled?: boolean` to provider-status/oRPC schema output. - Place it near `wireFormat` and `store` because the settings UI consumes these together. 4. Surface valid persisted values from the provider service. - Mirror the `store` boolean pattern: only copy the value into provider status when `typeof config.webSocketTransportEnabled === "boolean"`. - Invalid persisted values should be omitted from status rather than surfaced to UI. Quality gate after Phase 1: - Run targeted config/provider tests that cover provider schema and provider service status. - Expected tests to extend: - provider config schema tests - provider status/oRPC schema conformance tests - provider service tests for OpenAI-only fields ### Phase 2 — Settings UI control 1. Add the OpenAI provider settings control near Wire Format / Response storage. - Label: **WebSocket transport**. - Use risk-aware helper copy, e.g. "Experimental: uses OpenAI's Responses WebSocket transport for streaming Responses API requests. Unsupported endpoints may fail." - Avoid tests that assert exact prose; the prose can evolve. 2. Persist changes through the existing provider config mutation API. - Enable: set `keyPath: ["webSocketTransportEnabled"]`, `value: true`. - Disable: prefer setting `value: ""` to remove the field if existing provider config mutation semantics treat empty string as delete; otherwise set `false` only if that is the established boolean-toggle convention. Verify the current `setConfig` behavior before implementing this detail. - Optimistically update the local provider config state with the chosen value so the UI responds immediately. 3. Disable the control while effective OpenAI wire format is Chat Completions. - Use the same effective default as the existing Wire Format control: missing wire format means Responses. - Preserve the saved `webSocketTransportEnabled` value while disabled. - Show disabled helper text such as "Only available with Responses wire format." Quality gate after Phase 2: - Run targeted Settings UI tests. - Verify behavior, not copy: - control is visible for the built-in OpenAI provider - control persists enable/disable through `setProviderConfig` - control is disabled when `wireFormat === "chatCompletions"` - selecting Chat Completions does not delete the saved WebSocket preference ### Phase 3 — Deep module: OpenAI WebSocket fetch composition Create a small node-side helper module for WebSocket transport composition. Responsibilities: 1. Accept the existing Mux OpenAI fetch as its base/fallback behavior. 2. Accept an `enabled` boolean that has already applied runtime eligibility (`webSocketTransportEnabled === true` and effective wire format is Responses). 3. When disabled, return the original fetch and a no-op close hook. 4. When enabled, create a WebSocket fetch via `createWebSocketFetch()` and return: - a fetch compatible with `createOpenAI({ fetch })` - a close hook that calls the WebSocket fetch's `.close()` exactly once 5. Preserve existing Mux OpenAI fetch behavior. - Existing request shaping/normalization must still run. - Existing HTTP fallthrough from the WebSocket package should still benefit from Mux's fetch behavior where possible. - If preserving the package's HTTP fallthrough requires a wrapper around global fetch, keep that wrapper local and heavily tested; do not reimplement the WebSocket protocol. 6. Do not catch WebSocket transport failures to retry over HTTP. - Let eligible request failures surface naturally. Important implementation detail to verify while coding: - The published package falls through to `globalThis.fetch` for non-WebSocket requests. If using it directly would bypass Mux's base fetch for HTTP fallthrough, compose a wrapper so non-eligible requests still call Mux's base fetch. Keep this wrapper simple and test it with mocked fetches. Suggested public interface shape: - `createOpenAIWebSocketTransportFetch({ enabled, baseFetch }): { fetch: typeof fetch; close: () => void }` - The helper should assert that `close` is callable when enabled and should make cleanup idempotent. Quality gate after Phase 3: - Add direct unit tests for the helper using a mocked `@vercel/ai-sdk-openai-websocket-fetch` package. - Assert externally observable behavior: - disabled returns base-fetch behavior and no-op close - enabled delegates eligible requests to the WebSocket fetch - non-eligible requests preserve base-fetch behavior - close is idempotent and does not throw on repeated calls ### Phase 4 — Deep module: language-model cleanup helper Create a Mux-owned cleanup helper for provider-created language models. Responsibilities: 1. Attach cleanup to a model object without changing the provider model factory return type. 2. Use a private Symbol so the attachment does not collide with AI SDK/provider fields. 3. Assert the attached cleanup is a function. 4. Run cleanup at most once per model. 5. Swallow/log cleanup exceptions so cleanup failures do not mask the original stream completion/error. 6. Clear the cleanup after running to avoid retaining closures longer than necessary. Suggested public interface shape: - `attachLanguageModelCleanup(model, cleanup): LanguageModel` - `runLanguageModelCleanup(model): void` Quality gate after Phase 4: - Unit tests for the helper: - cleanup runs exactly once - repeated cleanup is a no-op - models without cleanup are safe - thrown cleanup errors are handled according to the chosen helper contract ### Phase 5 — Provider model factory integration 1. In the OpenAI branch, compute runtime eligibility: - persisted/provider config `webSocketTransportEnabled === true` - effective wire format is Responses - no request-level override support 2. Keep existing config-to-provider-options logic for `serviceTier`, `wireFormat`, and `store` unchanged. 3. Compose the existing OpenAI fetch with the WebSocket helper before passing `fetch` to `createOpenAI`. - Do not bypass existing `fetchWithOpenAICodexNormalization` behavior. - Do not add a special Codex OAuth guard beyond the agreed Responses-wire-format gating. - Do not validate custom base URLs. 4. After creating the model (`provider.responses(modelId)` or `provider.chat(modelId)`), attach the close hook only when the helper created an active WebSocket cleanup. 5. Ensure DevTools middleware wrapping does not discard cleanup. - If cleanup is attached before `wrapLanguageModel`, verify whether wrapping preserves object identity/metadata. - If wrapping loses the symbol, attach cleanup after final wrapping, or copy cleanup from inner to outer model. - Add a test for the DevTools-enabled path if this is ambiguous during implementation. Quality gate after Phase 5: - Provider model factory tests: - Responses + enabled activates WebSocket composition - Responses + missing/false setting does not activate it - Chat Completions + enabled does not activate it - invalid config value is not treated as enabled - custom base URL does not prevent activation when enabled + Responses - Codex OAuth is not specially guarded; the code path follows the same eligibility rule ### Phase 6 — Stream owner cleanup integration 1. Main streams (`streamManager`): call `runLanguageModelCleanup(streamInfo.request.model)` or equivalent model reference in the existing guaranteed cleanup `finally` block. - Prefer the actual `LanguageModel` object, not the model string. - Run cleanup before deleting stream state. - Make cleanup safe for retry paths: if a stream is reset for an internal retry, do not close the WebSocket before the final stream run completes unless a new stream/model is created. 2. Workspace title/name generation: wrap each candidate's `streamText` attempt in `try/finally` and call cleanup for that candidate's model. - Ensure cleanup runs when the model does not call the expected tool and the loop continues. - Ensure cleanup runs when `streamText` or `toolResults` throws and the loop tries the next candidate. 3. Search for any other `streamText` owners using provider-created models before finalizing. - Current exploration found main stream manager and workspace title generation. - If new owners appear, apply the same cleanup pattern. Quality gate after Phase 6: - Lifecycle tests: - main stream completion closes once - main stream error closes once - main stream cancellation closes once - title generation success closes once - title generation failure/retry closes once per candidate model - internal multi-step/tool-calling stream does not close between steps ### Phase 7 — Validation and full static checks Run validation in increasing scope: 1. Targeted tests added/modified in phases 1–6. 2. Typecheck. 3. Lint/fmt checks. 4. Full static check if the targeted suite and typecheck pass. Suggested commands: - `bun test src/common/config/schemas/providersConfig.test.ts` - `bun test src/common/orpc/schemas/api.test.ts` - `bun test src/node/services/providerService.test.ts` - `bun test src/node/services/providerModelFactory.test.ts` - `bun test src/node/services/streamManager.test.ts` - `bun test src/browser/features/Settings/Sections/ProvidersSection.test.tsx` - `make typecheck` - `make lint` - `make static-check` Use `run_and_report` when running multiple validation steps in one shell call, per repo guidance. ## Dogfooding plan Dogfooding is required before claiming the feature is ready. Live OpenAI runtime dogfooding is optional if credentials/endpoints are unavailable, but UI dogfooding should still run. ### Dogfood setup 1. Start an isolated dev-server environment. - Prefer `make dev-server-sandbox` for web/settings dogfooding so the run uses an isolated `MUX_ROOT` and free ports instead of the default `make dev` state. - Use `make dev-desktop-sandbox` only if Electron-specific desktop behavior must be verified. 2. Configure a test OpenAI provider. - If a real OpenAI API key is available, use it for live streaming verification. - If not, use deterministic UI-only dogfooding plus automated tests/mocks for runtime behavior. 3. Use browser/Electron automation to open Settings → Providers → OpenAI. - Use `agent-browser` or the repo's Electron automation helper. ### Dogfood scenarios 1. **Default state** - Confirm WebSocket transport is shown as disabled/off by default. - Screenshot: OpenAI settings default state. 2. **Enable in Responses mode** - Ensure Wire Format is Responses. - Enable WebSocket transport. - Confirm the UI persists the setting after refresh/reopen. - Screenshot: enabled setting in Responses mode. 3. **Chat Completions gating** - Switch Wire Format to Chat Completions. - Confirm the WebSocket control is disabled while the saved preference remains preserved. - Screenshot: disabled control in Chat Completions mode. 4. **Return to Responses** - Switch Wire Format back to Responses. - Confirm the previously saved WebSocket preference reappears as enabled. - Screenshot: restored enabled setting. 5. **Live stream, if credentials are available** - Send a short prompt with an OpenAI Responses model. - Confirm the stream completes or a WebSocket endpoint/proxy failure surfaces clearly without automatic HTTP fallback. - Interrupt/cancel one stream and then start another to check cleanup does not block subsequent streams. - Record a short video covering enable → prompt → stream/visible failure → Chat Completions disablement. ### Dogfood artifacts Attach or save: - screenshots for default, enabled, Chat Completions-disabled, and restored states - a short video recording for the end-to-end UI flow - notes on whether live OpenAI credentials were available and whether runtime streaming was verified live or by automated mocks only ## Acceptance criteria - Existing users see no behavior change unless `webSocketTransportEnabled` is explicitly set true. - Provider config accepts optional boolean `webSocketTransportEnabled` for the **Built-in OpenAI Provider**. - Provider status exposes valid boolean values and omits invalid persisted values. - OpenAI settings UI exposes the control near Wire Format with risk-aware helper copy. - UI disables the control for Chat Completions and preserves the saved value. - Runtime WebSocket activation requires `webSocketTransportEnabled === true` and effective Responses wire format. - Runtime does not validate custom base URLs for WebSocket support. - Runtime does not retry eligible WebSocket failures over HTTP. - Existing OpenAI fetch behavior is preserved around the WebSocket composition seam. - WebSocket resources close on stream completion, error, and cancellation for all provider-created-model stream owners. - Automated tests cover config/status, settings UI, provider factory activation/gating, helper behavior, and cleanup lifecycle. - Dogfooding produces screenshots and, when feasible, a video recording. ## Risks and mitigations - **Risk: WebSocket package HTTP fallthrough bypasses Mux fetch wrappers.** - Mitigation: test the composition helper with mocked eligible and non-eligible requests; ensure non-eligible/fallthrough paths use the Mux base fetch. - **Risk: cleanup symbol is lost when models are wrapped by DevTools middleware.** - Mitigation: attach cleanup to the final returned model or explicitly preserve/copy cleanup through wrapping; add a focused test if needed. - **Risk: cleanup runs too early during AI SDK multi-step streams.** - Mitigation: run cleanup only in outer stream-owner `finally`, not inside fetch response completion per step. - **Risk: cleanup misses title generation or future stream owners.** - Mitigation: search all `streamText` call sites that use provider-created models and add a helper usage pattern; consider a short code comment at the helper call explaining the invariant. - **Risk: UI tests become tautological.** - Mitigation: test behavior and state changes rather than exact prose. - **Risk: optional live dogfood cannot run without credentials.** - Mitigation: make live streaming dogfood optional, but require automated mocked runtime tests and UI screenshots. ## Handoff notes for implementation - Keep changes surgical; do not refactor unrelated provider config or settings UI code. - Prefer small deep modules over spreading package-specific logic through provider factory and stream owners. - Use defensive assertions in the helper modules for impossible assumptions, especially cleanup function type and idempotent close state. - Do not add request-level `muxProviderOptions.openai.webSocketTransportEnabled` support in this iteration. - Do not add an ADR unless the implementation discovers a hard-to-reverse architectural choice not covered by this plan. </details> --- _Generated with `mux` • Model: `openai:gpt-5.5` • Thinking: `high` • Cost: `$71.27`_ <!-- mux-attribution: model=openai:gpt-5.5 thinking=high costs=71.27 -->
🤖 fix: stop scroll-up jitter at bottom + harden auto-scroll ownership (… …coder#3226) ## Summary Fixes the small-but-noticeable jitter when the user starts scrolling up from the very bottom while the chat transcript bottom-lock is engaged. Eventually a large enough wheel/touch delta would "win" against the rAF settle tick, but the first 1–3 notches of a slow gesture used to be snapped back to the bottom — felt like the scrollport was fighting the user. ## Background `useAutoScroll.handleScroll`'s user-intent branch decided lock state from a single 8 px geometric threshold. A small wheel notch (typical mousewheel notch is 3–7 px) landed within that window, so the hook re-engaged the lock and the 60-frame rAF settle loop wrote `scrollTop = scrollHeight − clientHeight` on the next frame, snapping the user back to the bottom. The user could only "win" by accumulating a single tick of more than 8 px before the next rAF. ## Implementation Two commits, both behavior-preserving outside the targeted regression and validated against the existing 22 hook unit tests + the `bottomLayoutShift` integration suite. ### Commit 1 — `fix: stop scroll-up jitter from the very bottom` Asymmetric thresholds in `handleScroll`'s user-intent branch: - **Locked → release** on > 1 px drift (`BOTTOM_LOCK_EPSILON_PX`). The very first wheel-up notch releases without rAF snap-back. - **Released → relock** only when the user is moving **toward** the bottom (`currentScrollTop > previousScrollTop`) AND within 8 px (`USER_BOTTOM_RELOCK_THRESHOLD_PX`). Direction is tracked with a single new ref (`lastScrollTopRef`) updated at the top of every `handleScroll` call. The no-intent paths (1 px drift correction while locked, geometric relock at 8 px after the intent window expires) are unchanged. The existing "scroll back to bottom and the lock re-engages" UX is preserved. ### Commit 2 — `fix: harden auto-scroll user-intent ownership` Audit follow-ups (best-of-5 read-only audit converged on these): 1. **Filter delta-0 wheel events.** Cmd-wheel zoom on macOS, Shift-wheel for horizontal-only, Bluetooth-mouse jitter, and pinch gestures all dispatch `wheel` events with `deltaY === 0` (and often `deltaX === 0`). Without filtering, every phantom wheel cleared `programmaticDisableRef` and refreshed the 750 ms intent window, weakening every downstream gate that relied on those refs. New `handleScrollContainerWheel` is exposed by the hook; ChatPane wires it in place of `markUserScrollIntent`. 2. **Seed `lastScrollTopRef`** inside `disableAutoScroll` and `jumpToBottom`. The released-branch direction check compares `scrollTop` against this ref, but neither path always emits a scroll event (`disableAutoScroll` never does; `jumpToBottom` skips the write when `scrollTop` is already max). Without the seed, a small wheel-up notch following an explicit programmatic disable could be misread as "moving toward bottom" (e.g. `895 > 0`) and spuriously relock the lock that was just disabled. ## Validation - `bun test src/browser/hooks/useAutoScroll.test.tsx` — 25 / 25 pass (19 prior + 6 new regression tests covering the four scenarios above). - `bun x jest tests/ui/chat/bottomLayoutShift.test.ts` — passes (drift-correction / pin-on-resize / send / workspace-switch contracts unchanged). - `make static-check` — passes locally end-to-end. ## Risks - Behavior near the 1 px drift epsilon under hi-DPI / browser zoom is unchanged from before; `BOTTOM_LOCK_EPSILON_PX` was already used for the no-intent drift correction. The fix uses the same value for the locked-intent release path, so any pre-existing subpixel sensitivity is consistent across paths. - The wheel filter will not mark intent on a wheel event with both deltas equal to 0 — by design. Users on assistive input devices that emit `deltaY = 0` but expect intent marking are unaffected because such events also do not move the scrollport, and our intent window only matters when scroll motion follows. - `lastScrollTopRef` seeding is purely additive — every code path that writes to it before now still writes to it now; we just close two narrow staleness windows (`disableAutoScroll` with no follow-up scroll event, `jumpToBottom` when already at bottom). ## Pains The audit phase (5 read-only sub-agents in parallel) was the right call here: 4 of 5 audits independently flagged Findings coder#1 (`programmaticDisableRef` bypass) and coder#2 (`lastScrollTopRef` cold-start), which I would have likely missed reasoning forward from the initial fix alone. The audits also agreed that the new direction-aware logic is the right primitive — none recommended walking it back. Deferred to follow-up PRs (out of scope for "scroll-up jitter"): - Workspace-switch hydration race (`hasLoadedTranscriptRows` flip mid-read snaps to bottom). - ResizeObserver disconnect/reconnect on every `autoScroll` toggle. - Tab key not in `TRANSCRIPT_SCROLL_KEYS` (keyboard-nav focus-induced scroll snaps back). - Parallel patterns in `OutputTab` / `BashToolCall` / `InitMessage` (different sub-views, each with its own bottom-lock heuristic). --- _Generated with `mux` • Model: `anthropic:claude-opus-4-7` • Thinking: `max` • Cost: `$11.30`_ <!-- mux-attribution: model=anthropic:claude-opus-4-7 thinking=max costs=11.30 -->
🤖 perf: smooth text streaming (kill cascade re-renders, model-aware r… …eveal) (coder#3219) ## Summary Streamed assistant text (and reasoning) was visibly jittery — periodic catch-up jumps every few seconds, rate stuck at ~72 chars/sec regardless of what the model emitted, and a sub-frame of work for the entire chat list on every delta. This PR makes the cadence smooth in three ordered fixes plus a TPS-display fix discovered during review: leaf-subscribe the streaming-stats pill so it stops invalidating `WorkspaceState`, replace the smoothing engine's hard-snap with a model-aware soft catch-up, compact streaming parts on append, and floor the TPS calculator's time span so a new stream's first deltas don't spike the displayed rate. ## Background The renderer has had a two-clock smoothing model (`SmoothTextEngine` + `useSmoothStreamingText`) for a while, but several regressions defeated it: 1. `WorkspaceState.streamingTokenCount` / `streamingTPS` were computed inside the `getWorkspaceState` snapshot using `Date.now()`. Every coalesced delta produced a new snapshot reference, which cascaded `WorkspaceShell → ChatPane → MessageRenderer` through every row. `useDeferredValue` was bypassed for the entire stream by `shouldBypassDeferredMessages`, so reconciliation ran at the ingestion rate. 2. `getAdaptiveRate(backlog)` ignored the model's actual emission rate. With a fast model (~120 cps) and `BASE_CHARS_PER_SEC=72`, the visible cursor fell behind by ~5 chars per ingestion cycle until backlog crossed `MAX_VISUAL_LAG_CHARS=120`, at which point `enforceMaxVisualLag` snapped `visible := full - 120` and zeroed the budget — that snap is exactly the visible "catch-up jump". 3. `requestIdleCallback({ timeout: 100 })` was used for streaming deltas. The smoothing engine should be the only pacing layer; idle batching just feeds (2). 4. `handleStreamDelta` appended a fresh `{ type: "text" }` part per chunk; `mergeAdjacentParts` re-merged on every render. For a 10k-char reply that's tens of thousands of merges per turn. 5. `calculateTPS` divided by `now - firstDelta.timestamp`. With one delta that span is typically a few milliseconds, so e.g. `50 tokens / 0.005s = 10000 t/s`. Phase 1's microtask cadence exposed this — where the prior idle-callback batching used to mask it by sampling later — and Phase 2 wired TPS into the smoothing engine, amplifying its visibility. ## Implementation Four commits, ordered so each phase is verifiable in isolation: **Phase 1 — leaf-subscribe streaming stats, microtask ingestion (`775e9023c`)** - Removed `streamingTokenCount` / `streamingTPS` from `WorkspaceState`. - Added `WorkspaceStreamingStats` + `streamingStatsStore` (`MapStore`) + `useWorkspaceStreamingStats(workspaceId)` leaf hook (mirrors the existing `useWorkspaceStatsSnapshot` pattern at `WorkspaceStore.ts:4127`). - Replaced `scheduleIdleStateBump` with `scheduleStreamingStateBump` for streaming delta types (`stream-delta`, `tool-call-delta`, `reasoning-delta`). It coalesces on `queueMicrotask` instead of an idle callback. `init-output` and `bash-output` keep the idle path (terminal-style throughput). - Wired `cancelPendingStreamingBump` into stream-end / stream-abort / replay reset / `removeWorkspace`. - `StreamingBarrier` now reads via the leaf hook. **Phase 2 — model-aware smoothing engine, soft catch-up (`85fb141da`)** - `SmoothTextEngine.update()` accepts an optional `liveCharsPerSec`. `getAdaptiveRate(backlog, liveCps)` combines a steady-state floor (`max(BASE, liveCps)`), a soft catch-up ramp that drains lag over `SOFT_CATCHUP_DRAIN_MS` once it exceeds `SOFT_CATCHUP_LAG_CHARS=60`, and the legacy backlog-pressure ramp (kept as upper bound). - Replaced the hard-snap discontinuity with the soft ramp. `MAX_VISUAL_LAG_CHARS` is now 1024 (was 120) — a defensive safety net for paused-tab pathological bursts that normal streams never hit. - Bumped `MIN_FRAME_CHARS` from 1 to 2 so reveals coalesce to ~30 Hz at the BASE rate (half the markdown re-parse cost; humans can't see the difference). Tail-end reveal still works because the gate is now `min(MIN_FRAME_CHARS, backlog)`. - `useSmoothStreamingText` and `TypewriterMarkdown` thread `liveCharsPerSec` through; `TypewriterMarkdown` accepts a new `workspaceId` prop, forwarded from `AssistantMessage` and `ReasoningMessage` (via `MessageRenderer`). **Phase 3 — compact-on-append, clean prop surface (`0a945ed7b`)** - `StreamingMessageAggregator.handleStreamDelta` / `handleReasoningDelta` append into the previous adjacent text/reasoning part in place. For a 10k-char reply this drops `parts.length` from thousands to one and `mergeAdjacentParts` cost from O(N) to O(1). Backend persistence (`partial.json`, `chat.jsonl`) is unaffected — those writers live backend-side; this aggregator's `parts` is pure display state. - `TypewriterMarkdown`: dropped the `deltas: string[]` shape (always passed as `[content]` literal — defeated `React.memo`) for `content: string`. Removed the manual `React.memo` and the inner `useMemo` for the streaming-context value (React Compiler handles both). **Phase 4 — TPS calculator floor + stream-error token cleanup (`a476613be`)** - `calculateTPS` now floors the divisor at `MIN_TPS_TIME_SPAN_MS = 1000`. With one delta the rate becomes `tokens / 1s` instead of `tokens / 0.005s`. The reported TPS smoothly ramps up over the first second of a stream instead of spiking and "dropping abruptly". Slight under-statement during the settling window is the trade-off — strictly preferable to an order-of-magnitude over-statement. - The `stream-error` branch in `applyWorkspaceChatEventToAggregator` now calls `clearTokenState`, matching `stream-end` and `stream-abort`. Without it, the errored message's `deltaHistory` entry leaks into a follow-up stream's TPS calculation. ## Validation - `make typecheck` ✅ - `make lint` ✅ - Targeted streaming surface: 1009+ tests pass / 0 fail across `SmoothTextEngine`, `useSmoothStreamingText`, `StreamingMessageAggregator`, `applyWorkspaceChatEventToAggregator`, `StreamingTPSCalculator`, `TypewriterMarkdown`, `ReasoningMessage`, `StreamingBarrier{,View}`, `PinnedTodoList`, `WorkspaceStore`, plus the broader `src/browser/utils/messages/`, `src/browser/features/Messages/`, `src/browser/stores/`, and `src/browser/hooks/` suites. - New behavioral tests: - `SmoothTextEngine.test.ts`: rate tracks `liveCharsPerSec`; soft catch-up engaged for 60–1024 char lags without snap; hard snap still fires above the safety threshold. - `StreamingTPSCalculator.test.ts`: 1s floor applied for tiny / zero spans; raw span used once it exceeds the floor; negative spans (clock skew) return 0. - `applyWorkspaceChatEventToAggregator.test.ts`: `stream-error` calls `clearTokenState`. ## Risks Localized to the streaming display path; no protocol or persistence changes. - **Re-render shape (Phase 1).** Streaming deltas now bump `WorkspaceState` once per microtask drain instead of once per `requestIdleCallback`. Net effect under heavy load is *less* work because the snapshot stops invalidating per-delta TPS, but it's a behavioral shift — verified via the existing 106-test `WorkspaceStore` suite plus targeted `StreamingBarrier` tests. - **Smoothing engine constants (Phase 2).** `MAX_VISUAL_LAG_CHARS` jumped 120 → 1024 and `MIN_FRAME_CHARS` 1 → 2. Existing test "caps visual lag when incoming text jumps ahead" still passes against the new soft-ramp behavior, and the new "hard-snaps when lag exceeds the safety threshold" test confirms the safety net still functions. - **Compact-on-append (Phase 3).** Touches the in-memory `parts` array shape during streaming. The aggregator already had compaction at stream-end (`compactMessageParts`); we're just doing it eagerly. No on-disk format change. All `StreamingMessageAggregator` and `applyWorkspaceChatEventToAggregator` tests pass. - **TPS floor (Phase 4).** The reported rate during the first second of a stream now under-counts versus the previous (mathematically broken) value. Backend `sessionTimingService` also calls `calculateTPS`; same floor applies there but the backend's window is broader so the visible effect is smaller. No risk to persisted usage / cost calculations — those use `usage.outputTokens / duration` from the API, not the streaming TPS estimator. --- _Generated with `mux` • Model: `anthropic:claude-opus-4-7` • Thinking: `xhigh` • Cost: `$23.55`_ <!-- mux-attribution: model=anthropic:claude-opus-4-7 thinking=xhigh costs=23.55 -->
🤖 refactor: auto-cleanup (coder#3169) ## Summary Long-lived auto-cleanup PR that accumulates low-risk, behavior-preserving refactors picked from recent `main` commits. ## Changes ### Use shared `isAbortError` utility in AuthTokenModal Replace two inline `error instanceof DOMException && error.name === "AbortError"` checks in `AuthTokenModal.tsx` with the existing shared `isAbortError()` utility from `@/browser/utils/isAbortError`, deduplicating the abort detection logic. ### Extract `extractChunkDeltaText` helper to deduplicate advisor chunk parsing Pull the repeated `switch` over `chunk.type` (extracting `chunk.textDelta` or `chunk.argsTextDelta`) into a single `extractChunkDeltaText()` helper in `advisorService.ts`, then call it from both `executeAdvisorStream` and `executeAdvisorStreamWithRetry`. ### Remove unnecessary exports from `skillFileUtils` Un-export `parseSkillFile`, `serializeSkillFile`, and `SKILL_FILENAME` from `src/node/services/agentSkills/skillFileUtils.ts` — all three are only used within the same file, so the `export` keyword was unnecessary. ### Remove dead `getCancelledCompactionKey` storage helper Remove the `getCancelledCompactionKey` function and its entry in the `EPHEMERAL_WORKSPACE_KEY_FUNCTIONS` array from `storage.ts` — the only consumer (`useResumeManager.ts`) was deleted, leaving this as dead code. ### Remove dead `quickReviewNotes` module Remove `src/browser/utils/review/quickReviewNotes.ts` and its test file (482 lines). The `buildQuickLineReviewNote` and `buildQuickHunkReviewNote` functions were never imported by any production code since their introduction in PR coder#2448. ### Un-export `isBashOutputTool` in messageUtils Remove the `export` keyword from `isBashOutputTool` in `src/browser/utils/messages/messageUtils.ts` — the type guard is only used within the same file by `computeBashOutputGroupInfos`, so the export was unnecessary. ### Deduplicate `hasErrorCode` in submoduleSync Replace inline `NodeJS.ErrnoException`-like error-code checks in `submoduleSync.ts` with calls to the existing `hasErrorCode` helper, keeping a single canonical place where error-code narrowing lives. ### Simplify `hasCompletedDescendants` to reuse `listCompletedDescendantAgentTaskIds` Rewrite `hasCompletedDescendants` to delegate to the existing `listCompletedDescendantAgentTaskIds` helper instead of re-implementing the traversal, collapsing the two code paths into one. ### Reuse `anthropicSupportsNativeXhigh` in Anthropic fetch wrapper Replace the duplicated Opus 4.7+ regex inside `wrapFetchWithAnthropicCacheControl` (src/node/services/providerModelFactory.ts) with a call to the existing `anthropicSupportsNativeXhigh` helper from `src/common/types/thinking.ts`. The helper already performs the same regex check plus provider-prefix normalization (e.g., `anthropic/claude-opus-4-7` via the `ai-model-id` gateway header), keeping the wire-level detection and the policy-level detection in one place. ### Extract `getFetchInputUrl` helper to deduplicate URL extraction The OpenAI/Codex and Copilot fetch wrappers in `providerModelFactory.ts` each contained an identical 15-line IIFE that extracted a URL string from the `fetch` `input` argument (handling string, `URL`, and `Request`-like shapes). Extract the logic into a single `getFetchInputUrl` helper so both wrappers share one implementation. Behavior-preserving: the helper returns the same empty-string fallback on unrecognized inputs, so callers continue to fall through to normal fetch behavior without throwing. ### Extract `clonePersistedToolModelUsage` helper in streamManager The deep-clone pattern for `PersistedToolModelUsage` (spread event, fresh `usage` object, conditional `providerMetadata`) was duplicated between `recordToolModelUsage` and the stream-end tool-usage snapshot in `streamManager.ts`. Extract a single file-local helper so both sites share the same implementation. Behavior-preserving: both callsites continue to produce structurally identical clones. ### Reuse `getClosestTranscriptAncestor` in `getTranscriptContextMenuLink` The new `getTranscriptContextMenuLink` helper (added in coder#3188) inlined the same "resolve event target → `element.closest(selector)` → require both to stay within the transcript root" pattern that `getClosestTranscriptAncestor` — defined a few lines above in the same file — already implements. Delegate to the shared helper so the null/contains guards live in one place. Behavior-preserving: the helper returns null for a null/outside-root target, then `element.closest("a[href]")`, then null again if the anchor is outside the transcript root — identical to the previous inline checks. All 22 `transcriptContextMenu` tests continue to pass. ### Remove duplicate `gpt-5.5-pro` thinking-policy test When coder#3192 renamed `gpt-5.4-pro` → `gpt-5.5-pro` across `src/common/utils/thinking/policy.test.ts`, it accidentally introduced a third `returns medium/high/xhigh for gpt-5.5-pro` test that is byte-identical to the renamed first occurrence (the two remaining tests are the bare-prefix and `with version suffix` variants; the deleted block had no version suffix and no gateway prefix). Drop the duplicate so the suite has one canonical no-suffix test, one mux-gateway test, and one version-suffix test. Behavior-preserving — `getThinkingPolicyForModel` coverage for `gpt-5.5-pro` is unchanged; 63 / 63 tests in `policy.test.ts` continue to pass. ### Extract `getAppProxyBasePathFromRequestValue` helper in orpc server The orpc server's public-base-path detection in `src/node/orpc/server.ts` repeated the pattern `parsePathnameFromRequestValue(value) → getAppProxyBasePathFromPathname(...)` across four callsites (forwarded headers, the `originalUrl` / `url` loop, the referer header, and the direct app-proxy handler-prefix calculator). Extract a single `getAppProxyBasePathFromRequestValue` helper that performs the two-step normalize-then-classify operation, then call it from every site. Behavior-preserving: each callsite still produces `null` when the value is absent or yields an invalid pathname, and otherwise returns the same parsed app-proxy base path. All 52 tests in `src/node/orpc/server.test.ts` continue to pass. ### Inline `getRoutePathnameForBaseHref` wrapper in orpc server The new helper added in coder#3195 was a one-line shim that simply renamed `getPathnameFromRequestUrl(req.url)` to fit the surrounding "for base href" naming theme. It was used in only two adjacent functions (`shouldInjectSlashlessRootRedirect` and `getPublicBaseHref`), and the existing `getPathnameFromRequestUrl` already conveys the intent at the callsite. Inline both calls so the request-URL → pathname conversion lives at the points of use, removing one layer of indirection without changing behavior. All 52 tests in `src/node/orpc/server.test.ts` continue to pass. ### Remove dead `AdvisorToolResultSchema` definitions `AdvisorToolResultSchema` and its three constituent schemas (`AdvisorToolAdviceResultSchema`, `AdvisorToolLimitResultSchema`, `AdvisorToolErrorResultSchema`) in `src/common/utils/tools/toolDefinitions.ts` were introduced alongside the experimental advisor tool in coder#3157 but were never imported anywhere — neither by `src/common/types/tools.ts` (which derives the public advisor result shape from a different type local to `AdvisorToolCall.tsx`), nor by the advisor tool implementation itself, nor by any test. Unlike the analogous `TaskToolResultSchema` / `TaskAwaitToolResultSchema` / `TaskApplyGitPatchToolResultSchema` / `TaskTerminateToolResultSchema` (all of which are imported via `z.infer` in `src/common/types/tools.ts`), the advisor variant had no consumer. Drop the four dead schemas; the file shrinks by ~32 lines and keeps `AdvisorToolInputSchema` (which is imported by `advisor.ts`) intact. Behavior-preserving. ### Reuse `getProviderPolicy()` in custom-provider `getConfig()` loop `ProviderService.getConfig()`'s custom-provider branch inlined the same "if enforced, look up `providerAccess` entry → narrow to `{ forcedBaseUrl, allowedModels }`" lookup that the existing private `getProviderPolicy()` helper already implements (and that other callsites such as `addCustomOpenAICompatibleProvider` use). Replace the inline lookup with a call to `getProviderPolicy(providerId)` so the small policy-shape projection lives in one place. Behavior-preserving: the only structural difference is that, when policy is not enforced, `getProviderPolicy()` returns `{}` while the inline form passed `{ forcedBaseUrl: undefined, allowedModels: null }`, but `buildCustomProviderConfigInfo` normalizes both via `policy?.forcedBaseUrl ?? resolveConfigBaseUrl(...)` and `policy?.allowedModels ?? null`, so the resulting `ProviderConfigInfo` is byte-identical. All 74 tests in `providerService.test.ts` continue to pass. ### Collapse task-group parent rail offset into shared helper After coder#3199 introduced `getTaskGroupMemberDepth` and set `TASK_GROUP_MEMBER_PARENT_RAIL_OFFSET_PX = SIDEBAR_LEADING_SLOT_CENTER_OFFSET_PX`, the `task-group-member` branch of `getSubAgentParentRailX` in `src/browser/components/sidebarItemLayout.ts` reduced to `getSidebarLeadingSlotCenterX(depth)`. Replace the inline `getSidebarItemPaddingLeft(depth) + TASK_GROUP_MEMBER_PARENT_RAIL_OFFSET_PX` arithmetic with a call to the existing helper and drop the now-redundant constant, leaving the leading-slot center offset defined exactly once. Behavior-preserving: `getSubAgentParentRailX` still returns `38` at `memberDepth = 2.5`, matching the pinned values in `sidebarItemLayout.test.ts` (and the equivalent `getSubAgentChildStatusCenterX` result). All 40 tests in `sidebarItemLayout.test.ts`, `AgentListItem.test.tsx`, and `ProjectSidebar.test.tsx` continue to pass. ### Remove unnecessary exports from inline-skill utilities Un-export four interfaces in the new inline-skill helper files added in coder#3204 — `InlineSkillSuggestionContext` and `InlineSkillSuggestionRefreshContext` in `src/browser/utils/agentSkills/inlineSkillSuggestions.ts`, plus `InlineSkillCursorMatch` and `InlineSkillResolveOptions` in `src/browser/utils/agentSkills/inlineSkillReferences.ts`. All four are only used as parameter types within their defining files: the test files import the value functions and pass object-literal arguments, and the consumer call-sites in `ChatInput/index.tsx` only import the exported functions, never the parameter type names. So the `export` keyword was unnecessary. Behavior-preserving and type-only — TypeScript compile passes for both browser and main configs, and the 49 tests in `inlineSkillSuggestions.test.ts` and `inlineSkillReferences.test.ts` continue to pass. > The earlier "sync thinking-policy doc comments with gpt-5.5 regex" > cleanup was dropped during rebase: coder#3192 superseded it by retiring > `gpt-5.4` from those comments entirely, so the comment-only diff > became redundant. > The earlier "reuse `hasNonEmptyString` helper for apiKey checks" > cleanup was dropped during rebase: coder#3202 restructured > `resolveProviderCredentials` to delegate to a new > `resolveApiKeyCandidate` helper (subsuming the inline check) and > already updated `hasAnyConfiguredProvider` to use `hasNonEmptyString` > directly, so the cleanup diff no longer applied cleanly and was no > longer needed. ### Replace stale `system-1` reference in telemetry comment The `ExperimentOverriddenPayload.experimentId` JSDoc in `src/common/telemetry/payload.ts` used `'system-1'` as an example experiment ID, but the System 1 feature was removed wholesale in coder#3207 and that experiment ID no longer exists. Swap the example for a current entry from `EXPERIMENT_IDS` (`'agent-browser'`) so the JSDoc points readers at a real experiment. Behavior-preserving — comment-only change. ### Extract `isLightThemeMode` helper for Shiki theme detection Three callsites independently encoded `themeMode === "light" || themeMode.endsWith("-light")` to map a theme-mode string (including namespaced variants like `flexoki-light`) to the light Shiki theme: - `highlightDiffChunk.ts` had a private `isLightTheme(theme: ThemeMode)` helper. - `HighlightedCode.tsx` and `MarkdownComponents.tsx` had it inline (the latter with an intermediate `isLight` local). Promote the predicate to `isLightThemeMode` in `src/browser/utils/highlighting/shiki-shared.ts` (next to `SHIKI_DARK_THEME` / `SHIKI_LIGHT_THEME` and `mapToShikiLang`) and route all three callsites through it. The suffix convention now has a single source of truth for the light/dark mapping. Behavior-preserving. ### Remove unnecessary exports from `fileRead` After coder#3208 removed the file explorer / file viewer flow, the only external consumers of `src/browser/utils/fileRead.ts` are `ImmersiveReviewView` (`buildReadFileScript`, `processFileContents`) and the colocated test (`buildReadFileScript`, `EXIT_CODE_TOO_LARGE`, `processFileContents`). Un-export the helpers that are now only used inside the module itself (`MAX_FILE_SIZE`, `shellEscape`, `base64ToUint8Array`, `detectImageType`, `detectSvg`, `detectBinary`, `parseReadFileOutput`) so the module surface accurately reflects its public API. Behavior-preserving. Auto-cleanup checkpoint: d1c0109 --- _Generated with `mux` • Model: `anthropic:claude-opus-4-7` • Thinking: `xhigh`_ <!-- mux-attribution: model=anthropic:claude-opus-4-7 thinking=xhigh --> --------- Co-authored-by: mux-bot[bot] <264182336+mux-bot[bot]@users.noreply.github.com> Co-authored-by: Mux <noreply@coder.com> Co-authored-by: mux-bot <mux-bot@coder.com> Co-authored-by: ammar-agent <ammar+ai@ammar.io>
🤖 fix: align variant sub-agent connectors (coder#3199) ## Summary Fixes sidebar connector alignment for expanded variants/best-of sub-agent groups by rendering grouped members on the same indentation grid as the task-group header icon. ## Background Grouped sub-agents use a task-group header row with an extra disclosure chevron before the group icon. Expanded members were only indented one depth level below that header, which left the parent-to-child connector rail visually offset from the grouped parent row. ## Implementation - Added a shared task-group member depth helper in the sidebar layout utilities. - Applied that helper when rendering expanded task-group members from `ProjectSidebar`. - Added tests that assert grouped member depth/layout propagation and rendered connector geometry. ## Validation - `bun test src/browser/components/sidebarItemLayout.test.ts src/browser/components/AgentListItem/AgentListItem.test.tsx src/browser/components/ProjectSidebar/ProjectSidebar.test.tsx` - `make typecheck` - `make fmt-check` - `make lint` - `make static-check` ## Risks Low-to-medium risk, scoped to left-sidebar task-group sub-agent row indentation and connector geometry. --- _Generated with `mux` • Model: `openai:gpt-5.5` • Thinking: `high` • Cost: `$8.04`_ <!-- mux-attribution: model=openai:gpt-5.5 thinking=high costs=8.04 -->
🤖 fix: increase advisor question limit (coder#3200) ## Summary Increase the advisor tool `question` input limit from 500 to 2000 characters so agents can include enough context for strategic tradeoff questions while keeping the field bounded. ## Background The advisor tool is meant for planning ambiguity and architectural decisions, where a short one-line prompt can omit important constraints. The previous 500-character cap was tighter than needed for a compact brief. ## Implementation Updated the shared advisor tool input schema to allow up to 2000 characters and documented why the bound is intentionally roomier than before. ## Validation - `make static-check` ## Risks Low. This only changes validation for advisor tool input length; the field remains bounded and the added context is small relative to the advisor transcript. --- _Generated with `mux` • Model: `openai:gpt-5.5` • Thinking: `xhigh` • Cost: `$0.39`_ <!-- mux-attribution: model=openai:gpt-5.5 thinking=xhigh costs=0.39 -->
🤖 bench: use GPT-5.5 for tbench (coder#3193) > Mux working on behalf of Mike. ## Summary Updates nightly Terminal-Bench defaults to run Opus 4.7 at xhigh thinking and GPT-5.5 at high thinking while dropping the older GPT Codex model from the default matrix. Adds leaderboard metadata for Opus 4.7 and GPT-5.5, and refreshes TBench workflow and skill examples. ## Background GPT-5.5 xhigh runs were timing out in TBench, so the nightly workflow keeps GPT-5.5 at high while preserving xhigh for Opus 4.7. ## Validation - `make static-check` - `python3 -m py_compile benchmarks/terminal_bench/prepare_leaderboard_submission.py` - `go run github.com/rhysd/actionlint/cmd/actionlint@v1.7.7 .github/workflows/nightly-terminal-bench.yml .github/workflows/terminal-bench.yml` - `/home/coder/.local/bin/uvx ruff format --check benchmarks/terminal_bench/prepare_leaderboard_submission.py` - `git diff --check` --- _Generated with `mux` • Model: `openai:gpt-5.5` • Thinking: `xhigh` • Cost: `$16.42`_ <!-- mux-attribution: model=openai:gpt-5.5 thinking=xhigh costs=16.42 -->
PreviousNext