Skip to content

feat: add page.cua (vision) and page.domCua (DOM-id) toolsets#114

Merged
SawyerHood merged 8 commits into
mainfrom
dev-browser-cua
Jun 5, 2026
Merged

feat: add page.cua (vision) and page.domCua (DOM-id) toolsets#114
SawyerHood merged 8 commits into
mainfrom
dev-browser-cua

Conversation

@SawyerHood

Copy link
Copy Markdown
Owner

Ports the interaction tiers from OpenAI Codex's chrome plugin (tab.cua / tab.dom_cua) into dev-browser as two new namespaces on every sandbox page, so agents can act by screenshot coordinates or by snapshot node ids when locators aren't enough.

page.cua — pixel/vision tier

  • click / doubleClick / drag / move / scroll / keypress / type / screenshot, all options-object camelCase.
  • screenshot() saves a JPEG whose pixels map 1:1 onto cua coordinates at any DPR and returns {path, width, height}. Playwright silently ignores scale:'css' on viewport:null pages (headed + connected Chrome, 2x on Retina), so the helper detects the mismatch by parsing the JPEG dimensions and downscales in-page via OffscreenCanvas.
  • click waits ~1s for a click-triggered main-frame navigation (then up to 10s for the load); waitForNavigation: false opts out.
  • Key chords use Codex's alias table (ctrlControlOrMeta, ctrl+yControlOrMeta+Shift+z, …); modifiers are tracked and released in reverse in finally, so a failed chord can't leave keys stuck on the persistent page.
  • Scroll is delta-direct (mouse.move + mouse.wheel); buttons are left|middle|right with a clear error otherwise.

page.domCua — DOM-id tier

  • getVisibleDom() runs a self-contained in-page walker (serialized via String(fn)) per frame: interactable + visible-in-viewport predicates, shadow DOM, pseudo-HTML output lines (<button node_id=42>Submit</button>), 200-line/20k-char/50-per-frame budgets with explicit truncation markers.
  • click / doubleClick / scroll / type / keypress act by nodeId (number or numeric string). Ids are sticky across snapshots, survive across CLI invocations on named pages, and are minted from a random high base keyed by a per-document token — a stale id always fails fast with DOM node N is stale or missing — re-run getVisibleDom() instead of silently clicking whatever now owns that number after a navigation.
  • Acting resolves the element in its frame, scrolls it into view (3s cap), and clicks its center through the shared cua pipeline, so iframes (including frame offsets) work.

Prerequisite fix

QuickJS Error.stack has no Name: message header, and formatError preferred the stack — so thrown script error messages were dropped entirely (throw new Error("boom") printed only frame lines). formatError now composes the header (extracted to format-error.ts for testability); reproduced with a failing test first.

Tests & docs

  • ~1,950 lines of new tests (cua.test.ts, dom-cua.test.ts, format-error.test.ts): exact-coordinate assertions, clip semantics verified while scrolled, iframe act-by-id, id-reuse-after-navigation regressions (same-origin, cross-origin, child-frame), isolated-realm walker serialization check, cross-invocation snapshot→act, and the Retina downscale path. Full suite green; tsc, prettier, both bundles, and cargo build clean.
  • cli/llm-guide.txt gains Vision and DOM-id workflow sections, the tier-preference ladder, and method-table rows; README + CHANGELOG updated.

Verified live

Tested end-to-end against a real running Chrome over --connect: domCua snapshot of google.com, click search box by node id, cua.type + Enter, results screenshot. The Retina coordinate-contract bug was found by exactly this test and fixed in the last commit.

Remaining manual checks before release: headed-mode Retina pass and a true cross-origin (OOPIF) iframe click.

🤖 Generated with Claude Code

SawyerHood and others added 8 commits June 4, 2026 22:19
QuickJS Error.stack contains only frame lines (no "Name: message"
header, unlike V8), so formatError's stack-first formatting dropped the
thrown message entirely from stderr. Compose the header in formatError
(extracted to format-error.ts), after #toError has applied its prefix,
and skip it when the stack already carries one. Also add Buffer.isBuffer
to the QuickJS Buffer polyfill.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- start the public id counter at a random high base whenever no inherited
  counter exists, so cross-origin navigations (empty sessionStorage) never
  reuse pre-navigation node ids
- key the sticky id map by a per-document token minted in the walker, so a
  navigated child frame gets fresh ids instead of recycling old ones
- track successfully-pressed modifier keys and release them in a finally
  covering the down loop, so an invalid key never leaves modifiers held on
  the persistent page
- pin clip screenshot semantics as viewport-relative by scrolling first

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Agents regex node ids out of the snapshot text, so they arrive as
strings; coerce digit-only strings instead of erroring.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Playwright ignores scale:'css' on viewport:null pages (headed and
connected Chrome), returning device-pixel images that break the 1:1
cua coordinate contract on Retina displays. Detect the mismatch by
parsing the JPEG dimensions and rescale in-page via OffscreenCanvas.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d0d1bc30b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


function frameKey(frame: Frame): string {
const name = frame.name();
if (name) return name;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid frame-name collisions when resolving DOM IDs

When a page contains multiple iframes with the same non-empty name, this key collapses all of them to the same value, so #resolveNodeCenter() later picks the first matching frame and can click/type against the wrong iframe (or report the node stale) for IDs from the later frames. Include the index path or another per-frame discriminator even for named frames so snapshot IDs remain tied to the frame they came from.

Useful? React with 👍 / 👎.

@SawyerHood SawyerHood merged commit a2cbafb into main Jun 5, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant