Skip to content

Releases: awizemann/harness

Harness v0.6.0

Choose a tag to compare

@awizemann awizemann released this 16 Jun 16:00

Harness 0.6.0 — Drive Harness from agents (MCP), agent-run visibility, and auto-update

0.5 gave Harness local inference and a dev-time CLI. 0.6 opens it up to agents: a new MCP server lets Claude (or any MCP client) create personas, stage credentials, and start/inspect runs against the same on-disk store the app uses — and the GUI now treats those agent-driven runs as first-class history, distinct from your own. Plus Sparkle auto-update, so the app keeps itself current.

Highlights

HarnessMCP — drive Harness from an agent

harness-mcp is a development-time stdio MCP server, built from the same Harness/ source as the app (minus the SwiftUI surface). It speaks JSON-RPC 2.0 over stdio and exposes ~16 tools: list/create Applications, Personas, Actions & chains; stage per-app credentials; and start_run / get_run_status / get_run_result / cancel_run / list_runs. Runs execute asynchronously under a supervisor with an idle watchdog that auto-cancels a wedged run after N seconds of silence — the backstop the step budget can't be.

It opens the GUI's on-disk SwiftData store, so anything an agent creates shows up in the app, and vice-versa. Register it in your MCP client and point the agent at your app.

Shipped alongside it: web-driver hardening — every per-step WKWebView call (settle, probe, JS eval, snapshot) is now bounded by a timeout race, so a navigating click to a page that never finishes loading can no longer wedge a run.

Agent runs are first-class in the GUI

Every run now carries an origin — You / Agent / CLI:

  • History badges non-user runs (a green ✦ Agent pill) and titles them by their goal.
  • A new Agent Sessions section in the sidebar shows live agent sessions with a running step counter, plus recent agent runs.
  • A global banner floats in while an agent is driving the app, so you always know when something else is at the wheel.

Agent runs thread into the normal per-Application History: an ad-hoc agent run (say, a raw URL) matches or auto-creates an Application for its target, so it lands in that app's History — badged, with the full summary / friction / action-path / replay — instead of living in a separate island. Because the app and the MCP server are separate processes sharing one store, the app watches a lightweight marker file the server writes per live run and refreshes History the moment a run finishes.

Sparkle auto-update

Harness now updates itself. Check for Updates… is in the app menu, and the app checks an appcast feed on a schedule (you're asked once whether to enable automatic checks). Updates are EdDSA-signed and delivered through the existing Developer-ID-signed, notarized pipeline; the release script signs each build and publishes the appcast to GitHub Pages automatically.


Architecture notes, gotchas, and the full tool schema live in the repo's memory tier and the wiki.

Harness v0.5.0

Choose a tag to compare

@awizemann awizemann released this 21 May 14:10

Harness 0.5.0 — Local Mac inference, Set-of-Mark on iOS + macOS, and a dev-time CLI

0.3 cracked the agent's targeting open on web. 0.5 finishes the job: Set-of-Mark on every platform, a new "Local Mac" provider that runs a vision LLM on your machine (screenshots never leave the Mac, runs cost $0, you can work offline), and harness-cli — a development-time command-line driver for iterating on prompts and models without rebuilding the Mac app.

Highlights

Local Mac inference (Ollama)

New Local Mac provider runs a vision LLM on your Mac via Ollama at http://127.0.0.1:11434. Screenshots never leave the machine, runs cost $0 at the API level, and you can work offline.

Curated model picker in Settings → Local Mac:

  • Qwen3-VL 8B (qwen3-vl:8b) — recommended; Alibaba's GUI-trained vision LLM. The model the rest of the local path is tuned against.
  • Gemma 4 Vision 9B (gemma4-vision:9b) — Google. Conservative tool emitter.
  • Llama 3.2 Vision 11B (llama3.2-vision:11b) — Meta. Older but battle-tested in Ollama.
  • Custom local model… — type-your-own tag. Sent verbatim to Ollama; useful for experimenting with new releases (qwen2.5-vl:7b, minicpm-v:8b, etc.) without a Harness update.

Server reachability surfaced in the UI. Settings shows a pill (reachable / unreachable) plus the base URL field; Compose Run's Start button is gated on the last probe being green. First-run wizard adds an "Or run fully local" card with copy-paste install commands.

Native Ollama /api/chat (not the OpenAI-compat shim). The compat endpoint silently drops options.num_ctx — 0.5 calls /api/chat directly so Qwen3-VL 8B actually gets the 16k context it needs to hold a multi-step run's history. 600s URL timeout (Qwen3-VL 8B's cold start is ~5s model load + 60-90s first-token on M2). Subsequent warm requests typically settle under 60s.

Honest trade-offs. The per-run picker labels local models as 5-10× slower per step with lower friction-event quality than cloud-class models — they get the job done, but Sonnet 4.6 will still beat them on synthesis. The Local-vs-Cloud-Models wiki page has a same-goal-same-site three-way head-to-head with concrete numbers.

The OpenAI-compatible path stays available for LM Studio users — AppState.localBaseURL is a free-form field. Standard 07 §12 documents the trade-offs between the two transports.

Set-of-Mark everywhere (iOS + macOS + web)

0.3 shipped Set-of-Mark badges on web. 0.5 brings the same scaffolding to iOS and macOS — every screenshot the LLM sees now has numbered green pills over interactive elements; the agent calls tap_mark(id) instead of tap(x, y) on all three platforms. The disk PNGs stay clean (the marked image lives in an in-memory ScreenshotMetadata.markedImageData channel and never lands on disk, same invariant as web).

iOS probe walks WebDriverAgent's /source?format=json accessibility tree, filters to actionable XCUI roles (Button, Cell, TextField, Switch, Slider, SegmentedControl, etc.), and resolves rects from the AX frame. WDA 12.x returns short role names (Button) on some builds and long names (XCUIElementTypeButton) on others — both accepted via parallel actionableIOSRolesShort + actionableIOSRolesLong sets. Cell label rollup: when a Cell's own label is empty, the probe walks descendants up to 3 levels for StaticText/Image labels and joins them with " — " so the LLM sees "Settings — General — About" instead of "(unlabeled)".

macOS probe walks the AX tree via AXUIElementCreateApplication(pid), filtered to actionable AX roles (AXButton, AXTextField, AXLink, AXSecureTextField, AXSearchField, AXCheckBox, AXRadioButton, AXPopUpButton, AXStepper, AXSwitch, AXMenuItem, AXTab). Container roles (AXWindow, AXGroup, AXScrollArea) are walked-into instead of marked. Bounded walk: max depth 24, max 1500 nodes — keeps the probe under 200ms even on dense windows. Coordinates convert from global screen space to window-local by subtracting windowOrigin.

Shared MarkRenderer (Harness/Platforms/MarkRenderer.swift) scales mark rects from point space to image space internally — iOS and macOS hand it point-space marks against pixel-resolution captures and the renderer does the math. Web's per-element overlay code now lives in the shared helper too; one annotation pipeline across all three platforms.

tap_mark is now in the tool schema for all three platforms. The cycle detector (AgentLoop.recordPostStep) gained equivalence rules for tap_mark (same id), scroll, navigate, back/forward/refresh, rightClick, keyShortcut, and fillCredential — same-id taps in a row trip the detector the same way same-coordinate taps did.

Smart settle gates on iOS + macOS

Fixed-sleep settle ("wait 150ms after every tool") routinely captured screens mid-animation, costing the agent a wasted step every time. 0.5 replaces the sleep with screenshot-stability gating on iOS and macOS — poll captures at 150ms cadence and accept the gate once two consecutive screenshots dHash within Hamming-distance 5.

Per-tool profiles balance latency vs. correctness:

  • Tap — idle 250ms, min 250ms, max 2000ms.
  • Swipe — idle 400ms, min 400ms, max 3000ms (longer max for momentum scroll).
  • Key shortcut / right click — same as tap.

The web platform already had a MutationObserver-based DOM-quietness gate from 0.3; it now also handles SPA route transitions correctly via a requireChildListMutation flag (React Suspense keeps the old DOM mounted during route changes, so "idle 200ms after click" was firing on a stale page).

HarnessCLI — development-time driver

New xcodegen target produces harness-cli, a development-time command-line driver that shares the entire Harness/ source root with the GUI app and runs against WebDriver / IOSPlatformAdapter / MacAppDriver end-to-end. The same RunCoordinator, RunLogger, and event stream the GUI consumes — just no SwiftUI.

harness-cli \
  --platform web \
  --url https://alanwizemann.com \
  --goal "Find Alan's most recent article and tell me what it's about in your own words" \
  --persona "A curious first-time user" \
  --provider local \
  --model qwen3-vl:8b \
  --output ./test-run \
  --max-steps 15

Web, iOS, and macOS all supported via --platform web|ios|macos. iOS needs --project-path + --scheme + --simulator-udid; macOS needs --app-path and runs Screen Recording + Accessibility preflight checks. Cloud credentials come from env vars first (ANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY) with a system Keychain fallback — the GUI's saved keys work for the CLI binary too, so you don't have to re-stash them in your shell.

HARNESS_DUMP_MARKED=1 writes the marked PNG next to every step capture for debugging probe coverage.

harness-cli is development-only — not Developer-ID signed, not notarised, builds locally from this repo via xcodebuild -scheme HarnessCLI build. See the HarnessCLI wiki page for the full reference.

Fixes

  • WebContent log flood silenced. The off-screen (-10_000, -10_000) window placement triggered WebKit's aggressive volatile-layer scheduling — every snapshot tick failed to mark layers as volatile and emitted WebProcess::markAllLayersVolatile: Failed to the unified log multiple times per second. Window is now placed at (0, 0) with alphaValue = 0, ignoresMouseEvents = true, and level = NSWindow.Level.normal - 1 — visually invisible, but WebKit sees a "real" on-screen window and doesn't try to free its layers. Live-mirror poller also dropped from 3fps to 1fps.
  • simctl screenshot exit-code flakes tolerated. Rapid back-to-back captures occasionally produce a complete PNG on disk but exit non-zero. The driver now checks for a valid PNG header (89 50 4E 47) at the expected path and treats the capture as successful if the file looks right; only fails the screenshot if both the file is missing/short AND a 200ms-later retry also fails.
  • WDA waitForReady timeout bumped 45s → 120s. iOS 26.2 simulators occasionally take 60-90s to stand up the WebDriverAgent session on first run; the old timeout was tripping during steady-state warm boots.
  • Non-persistent WKWebsiteDataStore for web runs. Every run starts with a clean slate — no cookies, no localStorage, no IndexedDB. Two reasons: (1) reproducibility — SPAs storing theme/locale/dismissed-banner state would render differently across runs; (2) "what a fresh user sees" — Harness is a UX testing tool, the agent's screenshots are most informative when they reflect a first-time visitor's experience. Logged-in flows are still handled via fill_credential(field:).
  • NSAppearance bound to system Dark Mode preference, not the host app. The GUI binary may render itself in a different mode than macOS is set to; the agent's screenshots should match what the user's system is set to. AppleInterfaceStyle read from UserDefaults.standard to resolve.
  • Per-turn user-message reminder. LLMShared.currentTurnInstruction(annotation:) prefixes every step's user message with the three behavior rules ("tap a text field first to focus it before calling type", "prefer tap_mark over tap when marks are available", "call mark_goal_done when the goal is genuinely complete"). Used by all four LLM clients. Helps smaller cloud models AND local models stay on-track without bloating the system prompt.

Architecture notes

  • Shared MarkRenderer at Harness/Platforms/MarkRenderer.swift — single annotation pipeline across iOS, macOS, and web. InteractiveMark st...
Read more

Harness v0.3.1

Choose a tag to compare

@awizemann awizemann released this 08 May 11:03

Harness 0.3.1 — Clean replay screenshots and a paired Compose Run header

A point release that polishes two of the rough edges 0.3.0 left exposed.

Highlights

Set-of-Mark badges off-disk

0.3.0's web targeting overlay drew small green numbered badges over every focusable element in the screenshot — and saved that marked-up image to disk. The agent loved it (no more "tapped y=228 when the input was at y=242"). Everyone else had to explain the dev-tool clutter every time they shared a screenshot in a bug report or walked a designer through a run.

0.3.1 splits the pipeline:

  • Disk PNG = the clean rendered page. Replay, friction reports, and any exported screenshot show what a real user would see.
  • Agent payload = the marked-up copy. Rendered in-memory, never written to disk. The web driver returns the bytes via a new ScreenshotMetadata.markedImageData field; RunCoordinator routes them to the LLM call and only the LLM call.

The lastMarks cache (which tap_mark(id) resolves against) keeps doing its job — that's what makes id → element mapping possible across turns. iOS and macOS pass markedImageData = nil today and inherit the same split the moment they grow accessibility-tree-based SOM (tracked under "Set-of-Mark targeting on iOS + macOS" on the wiki Roadmap).

Standard 14 §6 documents the new invariant: no agent scaffolding on disk.

Compose Run pairs Persona + Credential

Both sections answer the same question — who's running this run? — so they now sit side-by-side instead of stacked. The form is one row shorter, and the pairing makes it visually obvious that the credential picker is part of the persona-shaping decision rather than a separate concern.

ViewThatFits falls back to a single column on narrow windows automatically. When no credentials are staged for the active Application, the credential pane self-hides as before and Persona expands to fill the row via .frame(maxWidth: .infinity) — no special-case branching.

Everything stays inside the existing HarnessDesign token system: matching PanelContainer headers, Theme.spacing.l between columns, top-aligned so a long persona blurb doesn't drag the credential picker down.

Maintenance

  • feat/web-mirror-redesign (a now-abandoned stacked-PR attempt left over from 0.3.0) cleaned up.

Compatibility

  • macOS 14+
  • Apple Silicon (universal)
  • Notarized + signed with Developer ID
  • Run-log schema version stays at 3 — the marked/unmarked split is purely a local-rendering concern and never lands in JSONL. Existing 0.3.0 run records, Applications, Personas, Action chains, and credentials load unchanged.
  • SwiftData stays at V5.

Harness v0.3.0

Choose a tag to compare

@awizemann awizemann released this 07 May 15:33

Harness 0.3.0 — Credentials, Set-of-Mark targeting, and a web mirror that fills the column

0.2 cracked the model path open across three providers. 0.3 cracks the agent's targeting and authentication open: pre-stage credentials against any Application, click form elements by numbered badge instead of pixel, and watch the live mirror render the full middle pane instead of an iPad-shaped device bezel.

Highlights

Per-Application credential storage

Pre-stage one or more (label, username, password) triples against an Application from its detail panel. Pick a credential at run time on Compose Run; the agent gets a new tool — fill_credential(field: "username" | "password") — on iOS, macOS, and web.

The contract on password handling, with no escape hatches:

  • No password in the JSONL. The agent's tool_call.input for fill_credential is exactly {"field": "password"} — never the value. The driver synthesises the typed text from a CredentialBinding it caches in memory and never serialises.
  • No password in the model's context. The system prompt's new {{CREDENTIALS}} block lists label + username only. The agent knows the credential exists and how to fill it; the password is invisible to it.
  • Screenshots rely on platform secure-text-entry. iOS SecureField, macOS NSSecureTextField, and HTML <input type="password"> all mask the value visually. We accept that an unusual SUT not using secure-text-entry could leak a password into a captured PNG; the create-credential UI documents the limit.

Storage split: SwiftData rows carry (id, applicationID, label, username, createdAt); passwords sit in macOS Keychain under service: "com.harness.credentials", account "<applicationID>:<credentialID>", via existing KeychainStoring extensions. Even an unencrypted backup of history.store carries no secret material.

New friction kind: auth_required. The agent emits this when it hits a login wall and has nothing to fill — the friction report sections it distinctly from dead_end so a "this surface needs auth-bypass" run is visible at a glance.

Set-of-Mark targeting (web)

Vision-language models miss small click targets by a handful of pixels. Inputs are typically 50px tall; the model picks y=228 when the input is at y=242–290; nothing happens; the agent retries; the run burns steps re-targeting.

Every screenshot now overlays a small numbered badge on each focusable element currently visible in the viewport — form fields, action buttons, dropdowns, checkboxes, custom-role widgets. The agent calls tap_mark(id) and the WebDriver resolves to the element's center via a cached (id → rect) map. Pixel guesswork eliminated for marked targets.

Selection is deliberately tight: marks go on things where pixel precision matters — not every link or generic [tabindex] element. Plain text links (<a href>) are skipped because they're typically large enough that coordinate-only tapping is reliable; the homepage doesn't need 60 numbered boxes.

Probe pierces open shadow roots so inputs nested inside custom elements (modern signin / payment widgets) get marks. Closed shadow roots and cross-origin iframes stay invisible; that's a platform limit.

The PNG saved to disk is the marked-up image — replay shows the agent's view exactly. A run is now readable by element identity ("agent tapped mark 4 → Small radio") instead of coordinate triangulation.

iOS and macOS get the same treatment via accessibility-tree probes (WDA and AX respectively) in a follow-up; tracked on the wiki Roadmap.

Web mirror — full-column rendering

Web runs no longer render inside an iPad-shaped device bezel. The mirror now shows a flat browser chrome at the top — URL pill, lock glyph, back / forward / refresh affordances, loading spinner — and the screenshot fills the rest of the column.

Two related changes lower per-run API spend:

  • Default viewport bumped to 1280×1600. Taller snapshots mean fewer scrolls per goal — measurable reduction in agent turns on long pages.
  • Dynamic viewport-height-to-canvas-aspect. The configured 1280 CSS-pixel width stays as the layout trigger (so the page renders desktop-wide), but the height scales to the canvas aspect at run time so the snapshot fills the column without letterbox AND without forcing a narrow / mobile responsive layout.

Both happen via a tiny MainActor LiveWebMirror registry — the live WebMirrorView measures its canvas, the WebDriver resizes the WKWebView to match. Replay reads the saved snapshot and renders it 1:1 in the chrome.

Loop & form correctness

Three classes of "the typed value disappeared" / "the run wedged" failure are gone:

  • React-aware dispatchType. Setting el.value = ... directly bypasses React's value tracker; React re-renders to its own internal state and the typed text vanishes. WebDriver now resolves the native setter via Object.getOwnPropertyDescriptor(prototype, 'value').set and calls it with .call(el, value) — the standard pattern every browser test framework uses to drive React inputs. Same fix applies to fill_credential.
  • Click-target focus routing. document.elementFromPoint returns the topmost element, which on modern signin forms is usually a wrapper <div> or styled <label> — not the <input>. div.focus() is a no-op; label.focus() doesn't focus the associated input. After dispatching click events, we now walk the click target to find the focusable input (direct match → <label> via htmlFor/contained input → querySelector descendant → closest() ancestor) and focus it explicitly. Subsequent type / fill_credential writes to the right activeElement.
  • Multi-tool emissions accepted. The system prompt always read "exactly one tool call ... optionally accompanied by one or more note_friction calls"; the three LLM-client parsers were rejecting any blocks.count > 1. Each parser now splits action vs note_friction, requires one action, and forwards frictions through LLMStepResponse.inlineFrictionAgentDecision.inlineFrictionRunCoordinator (which logs each one as a friction row). Cheaper models that naturally pair "I'm flagging this" with "and trying X" no longer wedge the run.

Architecture

  • SwiftData V5. New @Model Credential with @Relationship(deleteRule: .cascade) var credentials: [Credential] on Application. V4 frozen by copying its file-scope @Model types into the HarnessSchemaV4 enum's nested types — the established convention for a shape change. Lightweight v4→v5 migration; existing V4 stores reopen with credentials == [].
  • RunHistoryStore@ModelActor. The actor was constructing ModelContext(container) from a sync init, which Swift's strict concurrency correctly flagged at runtime: "ModelContexts are not Sendable. Consider using a ModelActor." The macro now generates a nonisolated let modelExecutor: any ModelExecutor and binds the actor's unownedExecutor to it, so every isolated method runs on the queue the ModelContext was created on. Migration-failure recovery (delete-and-retry) lifted to a private static helper outside the actor.
  • Run-log schema v3. RunStartedPayload gains optional credentialLabel + credentialUsername (decodeIfPresent so v2 logs round-trip cleanly). tool_call.input shape for fill_credential and tap_mark documented. Parser accepts v1, v2, v3.
  • Friction taxonomy. New FrictionKind.authRequired synced across the five sites the friction-vocab standard requires.

Tests

223 unit tests passing (was 218 in 0.2). New / extended suites:

  • SwiftDataMigrationTests — V4→V5 migration test (Applications gain empty credentials relation), V5 round-trip with two staged credentials, V5 cascade-delete from Application removes credentials.
  • KeychainStoreTests — credential password round-trip uses per-credential (applicationID:credentialID) account keying, empty / whitespace password write rejected.
  • AgentToolsSchemaTests — extended for fill_credential and tap_mark membership in iOS / web tool sets.
  • RunLogParserV2Tests — parser now accepts v3, throws schemaVersionUnsupported for v4+.

Known limits

  • Set-of-Mark is web-only today. iOS and macOS still rely on coordinate-only tap(x, y). Tracked on the wiki Roadmap as "Set-of-Mark targeting on iOS + macOS".
  • 2FA / human-in-the-loop interrupts are unsupported. Runs that hit SMS verification, CAPTCHAs, or push-approval flows block at that screen. The recommended path is to use test accounts without 2FA. Tracked on the Roadmap as a generic request_user_input(reason, secret) tool.
  • eBay-style hostile DOMs may still defeat probe-based marking. Closed shadow roots and cross-origin auth iframes are platform-impossible to introspect; the agent falls back to coordinate-based targeting in those cases.
  • Web is still WebKit-only. Chrome / CDP support remains on the roadmap.
  • macOS still needs Screen Recording permission. First run prompts; subsequent runs are silent.

Compatibility

  • macOS 14+
  • Apple Silicon (universal)
  • Notarized + signed with Developer ID
  • Existing 0.2 run records, Applications, Personas, and Action chains load unchanged. V5 migration runs once at first launch (one-shot, transparent).

Harness v0.2.1

Choose a tag to compare

@awizemann awizemann released this 06 May 17:12

Harness 0.2.1 — First-run WebDriverAgent build works for everyone

A patch release that fixes the first issue filed against Harness on GitHub: the first-run WebDriverAgent build pointed at /Users/alanwizemann/... on every machine that wasn't mine. Thanks to @impiri for the report (#1).

Fix

HarnessPaths.wdaSourceURL was resolved from a $SRCROOT path baked into the binary at build time. That works on the developer's own Mac and nowhere else — for everyone who installed from the 0.2.0 release zip, the first-run wizard's "Build WebDriverAgent" button errored out with a path that didn't exist on their disk.

0.2.1 ships the WebDriverAgent submodule inside the .app bundle (Contents/Resources/WebDriverAgent) and resolves it from Bundle.main.resourceURL first, falling back to the $SRCROOT path only for dev-mode runs from Xcode. The bundled WDA snapshot SHA is baked at app build time so the on-disk build cache stays valid across launches without needing git to resolve a HEAD on the user's machine.

Net effect for users:

  • First-run "Build WebDriverAgent" button works immediately after install.
  • Cache hits on the second launch — first-build cost is paid once per Harness app version, not once per launch.
  • No public API or contract changes; existing run records load unchanged.

Compatibility

  • macOS 14+
  • Apple Silicon (universal)
  • Notarized + signed with Developer ID
  • Existing 0.2 run records, Applications, Personas, and Action chains load unchanged.

Harness v0.2.0

Choose a tag to compare

@awizemann awizemann released this 06 May 11:01

Harness 0.2.0 — Multi-provider LLM, configurable budgets, persistent defaults

The 0.1 release shipped the platform plumbing (iOS Simulator, macOS apps, web) wired to a single Anthropic-only model path. 0.2 cracks that path open: seven supported models across three providers, with cost rails, persistence, and unlimited-step support so the agent loop can run as long as the goal needs.

Highlights

Multi-provider LLM support

Pick any of three providers in Settings, then a model from that provider. Compose Run can override per-run.

Provider Models Notes
Anthropic Opus 4.7, Sonnet 4.6, Haiku 4.5 (new) Same cache_control ephemeral caching as 0.1
OpenAI GPT-5 Mini, GPT-4.1 Nano (new) Automatic prompt caching at ≥1024 tokens (50% off)
Google Gemini 2.5 Flash, Gemini 2.5 Flash Lite (new) Implicit caching on 2.5+ models (90% off)

Each provider gets its own Keychain entry (com.harness.anthropic, com.harness.openai, com.harness.google). Add keys in Settings; the per-provider status indicator confirms what's wired up.

Per-model token budgets

The legacy model == .opus47 ? 250_000 : 1_000_000 ternary is gone. Every model has a justified default and a hard ceiling:

Model Default Max
Opus 4.7 250k 1M
Sonnet 4.6 1M 3M
Haiku 4.5 2M 10M
GPT-5 Mini, GPT-4.1 Nano 2M 10M
Gemini 2.5 Flash, Flash Lite 2M 10M

Override globally in Settings (applies to every run regardless of model) or per-run on Compose Run's Advanced section. The resolved value clamps to the active model's max so a 5M override on a cheap model can't follow you when you switch to Opus mid-form.

Unlimited steps

Toggle in Settings, Compose Run Advanced, or any Application's defaults. The token budget + cycle detector remain the safety rails — unlimited steps is not unlimited cost.

Persistent defaults

Settings now actually persist across launches. In 0.1, only the active Application id was saved; everything else (default model, mode, step budget, simulator visibility) reset every launch. 0.2 saves all of them via an extended PersistedSettings structure (legacy settings.json files decode cleanly).

Real screenshot thumbnails in the step feed

The step feed's right column previously showed a static gradient placeholder. Now it renders the captured screenshot for each step, sized aspect-aware so portrait iOS shots and landscape macOS / web shots both look right.

Loop hardening

Cheaper models (GPT-4.1 Nano, Gemini Flash Lite, sometimes Haiku) misbehave more than Opus did. Three changes keep the loop resilient:

  • Multi-tool responses — a model emitting >1 tool call now throws invalidToolCall instead of silently dropping the rest.
  • Zero-tool responses — a model punting to plain text instead of calling a tool now goes through the parse-retry path with a corrective hint, instead of failing the run.
  • Retry-detail propagation — on retry, the prior parse error is prepended to the user message so the model sees what went wrong. The previous loop retried blind and small models would loop on the same mistake until the cap.

Architecture

  • New LLMShared enum centralizes the canonical-tool-name → typed ToolCall decode used by every client (Anthropic / OpenAI / Gemini).
  • ToolSchema refactored to CanonicalTool + per-provider shape translators (anthropicShape, openAIShape, geminiShape). The Gemini translator uppercases JSON Schema types and strips additionalProperties so the strict OpenAPI parser accepts our schemas.
  • LLMClientFactory.client(for:keychain:) dispatches per request.model.provider. Each run gets a fresh client so token-usage accounting and the cycle-detector window reset correctly.
  • ClaudeError renamed to LLMError (provider-neutral messages). Existing call sites updated; no deprecation alias.
  • LLMStepRequest.platformKind is now plumbed through, so each client picks the right canonical tool set per platform — fixes a latent bug where macOS / web runs always advertised the iOS tool set.
  • TokenUsage.thinkingTokens (telemetry) — surfaces reasoning-token counts from GPT-5 / Gemini 2.5 / extended-thinking Claude calls without double-counting in cost math.

Tests

218 unit tests passing (was 175 in 0.1). New / extended suites:

  • ToolSchemaShapesTests — Anthropic + OpenAI + Gemini shape translators
  • LLMSharedToolCallTests — canonical decode coverage
  • OpenAIClientTests + OpenAIClientRequestShapeTests — wire-format round trip
  • GeminiClientTests + GeminiClientRequestShapeTests + ToolSchemaGeminiShapeTests — wire-format + OpenAPI gotchas
  • AgentLoopRetryHintTests — parse-failure detail propagation
  • RunCoordinatorReplayTests.unlimitedStepBudgetSkipsShortCircuit — drives 50 steps with unlimited budget
  • PersistedSettingsTests — round-trip + legacy-file + nil-roundtrip
  • AgentModelTokenBudgetTests — per-model lookup sanity, resolution, and clamping

Known limits / scoped out

  • Per-Application token budgetdefaultStepBudget exists at the per-Application level (in Application SwiftData @Model), but defaultTokenBudget doesn't yet. Adding it requires a SwiftData V4→V5 migration; deferred to a future release. Settings + per-run overrides cover most use cases in the meantime.
  • Web is still WebKit-only. Chrome / CDP support remains on the roadmap.
  • macOS still needs Screen Recording permission. First run prompts; subsequent runs are silent.

Standards updates

  • standards/07-ai-integration.md §7 — token-budget resolution chain + per-model table
  • standards/13-agent-loop.md §3stepBudget == 0 sentinel for unlimited
  • standards/14-run-logging-format.md — clarified stepBudget and tokenBudget semantics

Compatibility

  • macOS 14+
  • Apple Silicon (universal)
  • Notarized + signed with Developer ID
  • Existing 0.1 run records, Applications, Personas, and Action chains load unchanged. Existing settings.json decodes cleanly (the new fields default to nil and AppState falls back to its property initializers — Settings re-saves them on first edit).

Harness v0.1.0

Choose a tag to compare

@awizemann awizemann released this 05 May 18:15

Harness 0.1.0 — Multi-platform alpha

The first tagged release. All three target platforms — iOS Simulator, macOS apps, and web apps — are wired end-to-end. Per-app setting; same Compose-Run flow for all three; same replay + friction artefacts.

What it does

Write a goal in plain language ("Sign up and create my first list", "Find a vegetarian restaurant near me and save it"). Pick a persona ("first-time user", "returning power user", "keyboard-first"). Hit Start. An LLM agent reads screenshots, drives the UI, and reports:

  • Did the goal complete? (success / failure / blocked + summary)
  • What was the path? (replayable timeline of every screen + action)
  • Where was the friction? (timestamped events the agent flagged as confusing)

Targets

Kind How
iOS Simulator xcodebuild your project + scheme; simctl boot/install/launch; WebDriverAgent for input.
macOS app NSWorkspace launch (pre-built .app or xcodebuild macOS scheme); CGEvent for clicks / scroll / keyboard / shortcuts; CGWindowListCreateImage for window capture.
Web app Embedded WKWebView at any CSS-pixel viewport; JS-synthesised events for input; WKWebView.takeSnapshot for capture.

Architecture

  • Harness/Platforms/PlatformKind discriminator, UXDriving + PlatformAdapter protocols, per-platform adapters.
  • RunCoordinator dispatches through PlatformAdapterFactory.make(for:services:). The agent loop reads its tool schema from adapter.toolDefinitions(...); the system prompt's {{PLATFORM_CONTEXT}} block loads from docs/PROMPTS/platforms/<kind>.md.
  • Run history, replay, and friction events are platform-neutral — JSONL events are the same shape regardless of target.
  • SwiftData V4 schema: Application.platformKindRaw + per-platform optional fields (mac bundle path, web URL + viewport).

What's in this build

  • 175 unit tests passing.
  • macOS 14+ minimum.
  • Apple Silicon (universal).
  • Notarized + signed with Developer ID.

Known limits

  • Web is WebKit only. A future opt-in CDP-backed driver (Chrome) is on the roadmap. Browser-chrome shortcuts (Cmd+L, Cmd+T) won't fire — that's a runtime limit, not a UX problem to flag.
  • macOS needs Screen Recording permission. First run prompts; subsequent runs are silent.
  • iOS first build is 1–2 minutes while WebDriverAgent compiles for your simulator runtime. Cached after that.
  • Personas are shared across platforms. Built-in iOS / macOS / web personas exist (docs/PROMPTS/personas/<kind>-defaults.md); the picker doesn't filter by platform yet — pick anything sensible. Filtering is a follow-up if it gets cluttered.

Setup

  1. Install Homebrew.
  2. brew tap facebook/fb && brew install idb-companion (only needed for iOS).
  3. Get an Anthropic API key.
  4. Open Harness → first-run wizard walks you through API key + WebDriverAgent build.

Acknowledgements

Built with Claude (Anthropic) — agent loop runs against the public Messages API. Thanks to Appium for WebDriverAgent, the iOS responder-chain bridge that makes simulator input actually fire UIKit events.


Full source + docs: https://github.com/awizemann/harness
Wiki (architecture, services, agent loop): https://github.com/awizemann/harness/wiki
Issues: https://github.com/awizemann/harness/issues