# Benchmarks and Comparison Gates Resource safety: - Run these commands one at a time. Do not parallelize `compare:static*`, `compare:tokens*`, or browser-backed tests. - `compare:static*` may launch `agent-browser` and Chromium. Check for existing browser work before starting, and confirm processes are cleaned up afterward. - Use `pnpm readiness:audit` before claiming the agent-readiness objective is complete. It checks that the single-worker validation rules, fixture gate, comparison gate, process checker, and README/docs split are still wired in. - Use `pnpm readiness:real-page-smoke` for the smallest real-page check. It fetches `https://example.com` with `--agent-brief` and does not launch Puppeteer or `agent-browser`. - Use `pnpm readiness:search-smoke` for the smallest real search check. It runs `--search "ax-grep npm" --engine auto --agent-brief`, verifies engine attempts, and does not launch Puppeteer or `agent-browser`. - Use `pnpm readiness:agent-browser-smoke` for the smallest `agent-browser` comparison set. It checks `https://example.com` and `https://books.toscrape.com/`, `https://news.ycombinator.com`, and `https://www.gov.uk/foreign-travel-advice`; run `pnpm check:processes` before and after it. - Use `pnpm readiness:agent-browser-text-heavy-smoke` only when checking the text-heavy document policy. It checks Korean Wikipedia separately from the main smoke because strict StaticText overlap is tracked apart from structural content readiness. - Use `pnpm check:processes` before and after browser-backed comparison runs. - If several target sets are needed, run them sequentially and save each output separately. ```sh pnpm benchmark:agent-cost pnpm benchmark:library-cost pnpm compare:sample pnpm compare:static:fixtures pnpm compare:static:fixtures:gate pnpm readiness:audit pnpm readiness:real-page-smoke pnpm readiness:search-smoke pnpm readiness:agent-browser-smoke pnpm readiness:agent-browser-text-heavy-smoke pnpm check:processes pnpm compare:static https://example.com https://news.ycombinator.com pnpm compare:tokens https://example.com https://news.ycombinator.com pnpm compare:static:agent pnpm compare:static:korea-social pnpm compare:tokens:korea-social pnpm compare:static:china-japan pnpm compare:tokens:china-japan pnpm compare:gate /tmp/ax-grep-agent.json /tmp/ax-grep-tokens.json ``` The comparison scripts compare `ax-grep` output with `agent-browser snapshot` output and score the CLI `--agent` summary. The score covers `agent`, `pageCheck`, `searchResults`, structured evidence, readability, source link quality, verification status, recommended actions, and next steps. Token comparisons estimate prompt cost for compact tree text and agent JSON payloads. See [comparison-baseline.md](./comparison-baseline.md) for the current baseline run. ## Agent Cost Benchmark `pnpm benchmark:agent-cost` compares `ax-grep --agent-brief` with `agent-browser snapshot --compact` on local fixture pages. It runs cases sequentially, writes `tmp/benchmarks/agent-cost.json`, and closes the browser session after each case. Latest local run: | Case | ax-grep peak RSS | agent-browser peak RSS | RAM multiple | ax-grep decision tokens | agent-browser tokens | Token multiple | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | content-page | 80.0 MB | 906.4 MB | 11.3x | 631 | 1,903 | 3.0x | | challenge-page | 72.1 MB | 1,404.7 MB | 19.5x | 910 | 241 | 0.3x | Summary: average RAM reduction was 15.4x. The content fixture used 3.0x fewer decision tokens when the agent reads the compact handoff instead of a browser snapshot. The challenge fixture is primarily a memory/browser-avoidance win: `ax-grep` detects hCaptcha markers and returns an explicit browser handoff without launching Chromium. Search, social, challenge, and volatile targets may be diagnostic-only and excluded from gate averages. Check each run's `included` and `excluded` counts before treating an average as release-gating coverage. ## Library Cost Benchmark `pnpm benchmark:library-cost` measures warm in-process `extract(html)` calls and writes `tmp/benchmarks/library-cost.json`. It does not fetch remote pages and does not launch a browser. This is the better metric for server integrations where a Node process is already running and the question is incremental RSS per library call, not total CLI process RSS. The report includes: - `incrementalRssKb`: RSS after extraction minus RSS before extraction. - `estimatedTokens`: `cl100k_base` tokens for `formatSemanticTreeText(tree)`. - `summary.nodeCount`: semantic tree size after compact extraction. Run it with the package script so Node exposes GC before each measured case: ```sh pnpm benchmark:library-cost ``` Use `benchmark:agent-cost` for CLI-vs-browser release claims. Use `benchmark:library-cost` for server SDK sizing and memory regression checks. Latest local library-only run: | Case | HTML bytes | Incremental RSS | Output tokens | Nodes | | --- | ---: | ---: | ---: | ---: | | content-page | 737 | 0 KB | 79 | 16 | | challenge-page | 251 | 0 KB | 8 | 2 | | large-list-page | 37,390 | 896 KB | 428 | 76 | Summary: max incremental RSS was 896 KB, average incremental RSS was 299 KB. `compare:static:fixtures:gate` is the non-browser smoke gate: it uses synthetic HTML fixtures only, so it should not fetch remote pages or launch `agent-browser`. Use `compare:static:fixtures` when you need the JSON report. `readiness:real-page-smoke` is the smallest remote-page gate. It checks that `--agent-brief` can use fetched HTML on `https://example.com` without requesting browser capture. `readiness:agent-browser-smoke` is the smallest browser-backed comparison gate. It runs `pnpm compare` for `https://example.com` and `https://books.toscrape.com/`, `https://news.ycombinator.com`, and `https://www.gov.uk/foreign-travel-advice`, requires `agent-browser` snapshots, and enforces per-target overlap/readiness floors. Treat it like other browser-backed work: one command at a time, with process checks before and after. `readiness:agent-browser-text-heavy-smoke` is a separate browser-backed comparison for text-heavy document pages. It requires Korean Wikipedia to keep usable action/navigation/structural-content recall while still reporting strict text recall separately. `compare:gate` checks saved JSON output from `compare:static*` and `compare:tokens*`. Static gates require executor, handoff, browser-advantage, search/page decision, and action-list scores to stay near 1.0 with no gate-included challenge, shell, or over-collected classifications. Token gates require the compact agent payload average to stay cheaper than the browser reference after thin browser snapshots are excluded. Current suites include: - static HTML vs browser snapshots - fixture-only agent readiness smoke checks - agent executor regression targets for `averageAgentExecutorScore` - fixture-backed search open, search refine, and browser HTML retry recovery - CLI agent summary scoring for `pageCheck`, sources, readability, and actions - token-cost comparison for compact tree prompts and agent JSON prompts - Korean forum/search/social targets - Chinese and Japanese wiki/news/forum/search targets - challenge and volatile-page diagnostics