Releases: awizemann/harness
Release list
Harness v0.6.0
Harness 0.6.0 — Drive Harness from agents (MCP), agent-run visibility, and auto-update
0.5 gave Harness local inference and a dev-time CLI. 0.6 opens it up to agents: a new MCP server lets Claude (or any MCP client) create personas, stage credentials, and start/inspect runs against the same on-disk store the app uses — and the GUI now treats those agent-driven runs as first-class history, distinct from your own. Plus Sparkle auto-update, so the app keeps itself current.
Highlights
HarnessMCP — drive Harness from an agent
harness-mcp is a development-time stdio MCP server, built from the same Harness/ source as the app (minus the SwiftUI surface). It speaks JSON-RPC 2.0 over stdio and exposes ~16 tools: list/create Applications, Personas, Actions & chains; stage per-app credentials; and start_run / get_run_status / get_run_result / cancel_run / list_runs. Runs execute asynchronously under a supervisor with an idle watchdog that auto-cancels a wedged run after N seconds of silence — the backstop the step budget can't be.
It opens the GUI's on-disk SwiftData store, so anything an agent creates shows up in the app, and vice-versa. Register it in your MCP client and point the agent at your app.
Shipped alongside it: web-driver hardening — every per-step WKWebView call (settle, probe, JS eval, snapshot) is now bounded by a timeout race, so a navigating click to a page that never finishes loading can no longer wedge a run.
Agent runs are first-class in the GUI
Every run now carries an origin — You / Agent / CLI:
- History badges non-user runs (a green ✦ Agent pill) and titles them by their goal.
- A new Agent Sessions section in the sidebar shows live agent sessions with a running step counter, plus recent agent runs.
- A global banner floats in while an agent is driving the app, so you always know when something else is at the wheel.
Agent runs thread into the normal per-Application History: an ad-hoc agent run (say, a raw URL) matches or auto-creates an Application for its target, so it lands in that app's History — badged, with the full summary / friction / action-path / replay — instead of living in a separate island. Because the app and the MCP server are separate processes sharing one store, the app watches a lightweight marker file the server writes per live run and refreshes History the moment a run finishes.
Sparkle auto-update
Harness now updates itself. Check for Updates… is in the app menu, and the app checks an appcast feed on a schedule (you're asked once whether to enable automatic checks). Updates are EdDSA-signed and delivered through the existing Developer-ID-signed, notarized pipeline; the release script signs each build and publishes the appcast to GitHub Pages automatically.
Architecture notes, gotchas, and the full tool schema live in the repo's memory tier and the wiki.
Harness v0.5.0
Harness 0.5.0 — Local Mac inference, Set-of-Mark on iOS + macOS, and a dev-time CLI
0.3 cracked the agent's targeting open on web. 0.5 finishes the job: Set-of-Mark on every platform, a new "Local Mac" provider that runs a vision LLM on your machine (screenshots never leave the Mac, runs cost $0, you can work offline), and harness-cli — a development-time command-line driver for iterating on prompts and models without rebuilding the Mac app.
Highlights
Local Mac inference (Ollama)
New Local Mac provider runs a vision LLM on your Mac via Ollama at http://127.0.0.1:11434. Screenshots never leave the machine, runs cost $0 at the API level, and you can work offline.
Curated model picker in Settings → Local Mac:
- Qwen3-VL 8B (
qwen3-vl:8b) — recommended; Alibaba's GUI-trained vision LLM. The model the rest of the local path is tuned against. - Gemma 4 Vision 9B (
gemma4-vision:9b) — Google. Conservative tool emitter. - Llama 3.2 Vision 11B (
llama3.2-vision:11b) — Meta. Older but battle-tested in Ollama. - Custom local model… — type-your-own tag. Sent verbatim to Ollama; useful for experimenting with new releases (
qwen2.5-vl:7b,minicpm-v:8b, etc.) without a Harness update.
Server reachability surfaced in the UI. Settings shows a pill (reachable / unreachable) plus the base URL field; Compose Run's Start button is gated on the last probe being green. First-run wizard adds an "Or run fully local" card with copy-paste install commands.
Native Ollama /api/chat (not the OpenAI-compat shim). The compat endpoint silently drops options.num_ctx — 0.5 calls /api/chat directly so Qwen3-VL 8B actually gets the 16k context it needs to hold a multi-step run's history. 600s URL timeout (Qwen3-VL 8B's cold start is ~5s model load + 60-90s first-token on M2). Subsequent warm requests typically settle under 60s.
Honest trade-offs. The per-run picker labels local models as 5-10× slower per step with lower friction-event quality than cloud-class models — they get the job done, but Sonnet 4.6 will still beat them on synthesis. The Local-vs-Cloud-Models wiki page has a same-goal-same-site three-way head-to-head with concrete numbers.
The OpenAI-compatible path stays available for LM Studio users — AppState.localBaseURL is a free-form field. Standard 07 §12 documents the trade-offs between the two transports.
Set-of-Mark everywhere (iOS + macOS + web)
0.3 shipped Set-of-Mark badges on web. 0.5 brings the same scaffolding to iOS and macOS — every screenshot the LLM sees now has numbered green pills over interactive elements; the agent calls tap_mark(id) instead of tap(x, y) on all three platforms. The disk PNGs stay clean (the marked image lives in an in-memory ScreenshotMetadata.markedImageData channel and never lands on disk, same invariant as web).
iOS probe walks WebDriverAgent's /source?format=json accessibility tree, filters to actionable XCUI roles (Button, Cell, TextField, Switch, Slider, SegmentedControl, etc.), and resolves rects from the AX frame. WDA 12.x returns short role names (Button) on some builds and long names (XCUIElementTypeButton) on others — both accepted via parallel actionableIOSRolesShort + actionableIOSRolesLong sets. Cell label rollup: when a Cell's own label is empty, the probe walks descendants up to 3 levels for StaticText/Image labels and joins them with " — " so the LLM sees "Settings — General — About" instead of "(unlabeled)".
macOS probe walks the AX tree via AXUIElementCreateApplication(pid), filtered to actionable AX roles (AXButton, AXTextField, AXLink, AXSecureTextField, AXSearchField, AXCheckBox, AXRadioButton, AXPopUpButton, AXStepper, AXSwitch, AXMenuItem, AXTab). Container roles (AXWindow, AXGroup, AXScrollArea) are walked-into instead of marked. Bounded walk: max depth 24, max 1500 nodes — keeps the probe under 200ms even on dense windows. Coordinates convert from global screen space to window-local by subtracting windowOrigin.
Shared MarkRenderer (Harness/Platforms/MarkRenderer.swift) scales mark rects from point space to image space internally — iOS and macOS hand it point-space marks against pixel-resolution captures and the renderer does the math. Web's per-element overlay code now lives in the shared helper too; one annotation pipeline across all three platforms.
tap_mark is now in the tool schema for all three platforms. The cycle detector (AgentLoop.recordPostStep) gained equivalence rules for tap_mark (same id), scroll, navigate, back/forward/refresh, rightClick, keyShortcut, and fillCredential — same-id taps in a row trip the detector the same way same-coordinate taps did.
Smart settle gates on iOS + macOS
Fixed-sleep settle ("wait 150ms after every tool") routinely captured screens mid-animation, costing the agent a wasted step every time. 0.5 replaces the sleep with screenshot-stability gating on iOS and macOS — poll captures at 150ms cadence and accept the gate once two consecutive screenshots dHash within Hamming-distance 5.
Per-tool profiles balance latency vs. correctness:
- Tap — idle 250ms, min 250ms, max 2000ms.
- Swipe — idle 400ms, min 400ms, max 3000ms (longer max for momentum scroll).
- Key shortcut / right click — same as tap.
The web platform already had a MutationObserver-based DOM-quietness gate from 0.3; it now also handles SPA route transitions correctly via a requireChildListMutation flag (React Suspense keeps the old DOM mounted during route changes, so "idle 200ms after click" was firing on a stale page).
HarnessCLI — development-time driver
New xcodegen target produces harness-cli, a development-time command-line driver that shares the entire Harness/ source root with the GUI app and runs against WebDriver / IOSPlatformAdapter / MacAppDriver end-to-end. The same RunCoordinator, RunLogger, and event stream the GUI consumes — just no SwiftUI.
harness-cli \
--platform web \
--url https://alanwizemann.com \
--goal "Find Alan's most recent article and tell me what it's about in your own words" \
--persona "A curious first-time user" \
--provider local \
--model qwen3-vl:8b \
--output ./test-run \
--max-steps 15Web, iOS, and macOS all supported via --platform web|ios|macos. iOS needs --project-path + --scheme + --simulator-udid; macOS needs --app-path and runs Screen Recording + Accessibility preflight checks. Cloud credentials come from env vars first (ANTHROPIC_API_KEY / OPENAI_API_KEY / GOOGLE_API_KEY) with a system Keychain fallback — the GUI's saved keys work for the CLI binary too, so you don't have to re-stash them in your shell.
HARNESS_DUMP_MARKED=1 writes the marked PNG next to every step capture for debugging probe coverage.
harness-cli is development-only — not Developer-ID signed, not notarised, builds locally from this repo via xcodebuild -scheme HarnessCLI build. See the HarnessCLI wiki page for the full reference.
Fixes
- WebContent log flood silenced. The off-screen
(-10_000, -10_000)window placement triggered WebKit's aggressive volatile-layer scheduling — every snapshot tick failed to mark layers as volatile and emittedWebProcess::markAllLayersVolatile: Failedto the unified log multiple times per second. Window is now placed at(0, 0)withalphaValue = 0,ignoresMouseEvents = true, andlevel = NSWindow.Level.normal - 1— visually invisible, but WebKit sees a "real" on-screen window and doesn't try to free its layers. Live-mirror poller also dropped from 3fps to 1fps. simctl screenshotexit-code flakes tolerated. Rapid back-to-back captures occasionally produce a complete PNG on disk but exit non-zero. The driver now checks for a valid PNG header (89 50 4E 47) at the expected path and treats the capture as successful if the file looks right; only fails the screenshot if both the file is missing/short AND a 200ms-later retry also fails.- WDA
waitForReadytimeout bumped 45s → 120s. iOS 26.2 simulators occasionally take 60-90s to stand up the WebDriverAgent session on first run; the old timeout was tripping during steady-state warm boots. - Non-persistent
WKWebsiteDataStorefor web runs. Every run starts with a clean slate — no cookies, no localStorage, no IndexedDB. Two reasons: (1) reproducibility — SPAs storing theme/locale/dismissed-banner state would render differently across runs; (2) "what a fresh user sees" — Harness is a UX testing tool, the agent's screenshots are most informative when they reflect a first-time visitor's experience. Logged-in flows are still handled viafill_credential(field:). NSAppearancebound to system Dark Mode preference, not the host app. The GUI binary may render itself in a different mode than macOS is set to; the agent's screenshots should match what the user's system is set to.AppleInterfaceStyleread fromUserDefaults.standardto resolve.- Per-turn user-message reminder.
LLMShared.currentTurnInstruction(annotation:)prefixes every step's user message with the three behavior rules ("tap a text field first to focus it before calling type", "prefertap_markovertapwhen marks are available", "callmark_goal_donewhen the goal is genuinely complete"). Used by all four LLM clients. Helps smaller cloud models AND local models stay on-track without bloating the system prompt.
Architecture notes
- Shared
MarkRendereratHarness/Platforms/MarkRenderer.swift— single annotation pipeline across iOS, macOS, and web.InteractiveMarkst...
Harness v0.3.1
Harness 0.3.1 — Clean replay screenshots and a paired Compose Run header
A point release that polishes two of the rough edges 0.3.0 left exposed.
Highlights
Set-of-Mark badges off-disk
0.3.0's web targeting overlay drew small green numbered badges over every focusable element in the screenshot — and saved that marked-up image to disk. The agent loved it (no more "tapped y=228 when the input was at y=242"). Everyone else had to explain the dev-tool clutter every time they shared a screenshot in a bug report or walked a designer through a run.
0.3.1 splits the pipeline:
- Disk PNG = the clean rendered page. Replay, friction reports, and any exported screenshot show what a real user would see.
- Agent payload = the marked-up copy. Rendered in-memory, never written to disk. The web driver returns the bytes via a new
ScreenshotMetadata.markedImageDatafield;RunCoordinatorroutes them to the LLM call and only the LLM call.
The lastMarks cache (which tap_mark(id) resolves against) keeps doing its job — that's what makes id → element mapping possible across turns. iOS and macOS pass markedImageData = nil today and inherit the same split the moment they grow accessibility-tree-based SOM (tracked under "Set-of-Mark targeting on iOS + macOS" on the wiki Roadmap).
Standard 14 §6 documents the new invariant: no agent scaffolding on disk.
Compose Run pairs Persona + Credential
Both sections answer the same question — who's running this run? — so they now sit side-by-side instead of stacked. The form is one row shorter, and the pairing makes it visually obvious that the credential picker is part of the persona-shaping decision rather than a separate concern.
ViewThatFits falls back to a single column on narrow windows automatically. When no credentials are staged for the active Application, the credential pane self-hides as before and Persona expands to fill the row via .frame(maxWidth: .infinity) — no special-case branching.
Everything stays inside the existing HarnessDesign token system: matching PanelContainer headers, Theme.spacing.l between columns, top-aligned so a long persona blurb doesn't drag the credential picker down.
Maintenance
feat/web-mirror-redesign(a now-abandoned stacked-PR attempt left over from 0.3.0) cleaned up.
Compatibility
- macOS 14+
- Apple Silicon (universal)
- Notarized + signed with Developer ID
- Run-log schema version stays at 3 — the marked/unmarked split is purely a local-rendering concern and never lands in JSONL. Existing 0.3.0 run records, Applications, Personas, Action chains, and credentials load unchanged.
- SwiftData stays at V5.
Harness v0.3.0
Harness 0.3.0 — Credentials, Set-of-Mark targeting, and a web mirror that fills the column
0.2 cracked the model path open across three providers. 0.3 cracks the agent's targeting and authentication open: pre-stage credentials against any Application, click form elements by numbered badge instead of pixel, and watch the live mirror render the full middle pane instead of an iPad-shaped device bezel.
Highlights
Per-Application credential storage
Pre-stage one or more (label, username, password) triples against an Application from its detail panel. Pick a credential at run time on Compose Run; the agent gets a new tool — fill_credential(field: "username" | "password") — on iOS, macOS, and web.
The contract on password handling, with no escape hatches:
- No password in the JSONL. The agent's
tool_call.inputforfill_credentialis exactly{"field": "password"}— never the value. The driver synthesises the typed text from aCredentialBindingit caches in memory and never serialises. - No password in the model's context. The system prompt's new
{{CREDENTIALS}}block listslabel + usernameonly. The agent knows the credential exists and how to fill it; the password is invisible to it. - Screenshots rely on platform secure-text-entry. iOS
SecureField, macOSNSSecureTextField, and HTML<input type="password">all mask the value visually. We accept that an unusual SUT not using secure-text-entry could leak a password into a captured PNG; the create-credential UI documents the limit.
Storage split: SwiftData rows carry (id, applicationID, label, username, createdAt); passwords sit in macOS Keychain under service: "com.harness.credentials", account "<applicationID>:<credentialID>", via existing KeychainStoring extensions. Even an unencrypted backup of history.store carries no secret material.
New friction kind: auth_required. The agent emits this when it hits a login wall and has nothing to fill — the friction report sections it distinctly from dead_end so a "this surface needs auth-bypass" run is visible at a glance.
Set-of-Mark targeting (web)
Vision-language models miss small click targets by a handful of pixels. Inputs are typically 50px tall; the model picks y=228 when the input is at y=242–290; nothing happens; the agent retries; the run burns steps re-targeting.
Every screenshot now overlays a small numbered badge on each focusable element currently visible in the viewport — form fields, action buttons, dropdowns, checkboxes, custom-role widgets. The agent calls tap_mark(id) and the WebDriver resolves to the element's center via a cached (id → rect) map. Pixel guesswork eliminated for marked targets.
Selection is deliberately tight: marks go on things where pixel precision matters — not every link or generic [tabindex] element. Plain text links (<a href>) are skipped because they're typically large enough that coordinate-only tapping is reliable; the homepage doesn't need 60 numbered boxes.
Probe pierces open shadow roots so inputs nested inside custom elements (modern signin / payment widgets) get marks. Closed shadow roots and cross-origin iframes stay invisible; that's a platform limit.
The PNG saved to disk is the marked-up image — replay shows the agent's view exactly. A run is now readable by element identity ("agent tapped mark 4 → Small radio") instead of coordinate triangulation.
iOS and macOS get the same treatment via accessibility-tree probes (WDA and AX respectively) in a follow-up; tracked on the wiki Roadmap.
Web mirror — full-column rendering
Web runs no longer render inside an iPad-shaped device bezel. The mirror now shows a flat browser chrome at the top — URL pill, lock glyph, back / forward / refresh affordances, loading spinner — and the screenshot fills the rest of the column.
Two related changes lower per-run API spend:
- Default viewport bumped to 1280×1600. Taller snapshots mean fewer scrolls per goal — measurable reduction in agent turns on long pages.
- Dynamic viewport-height-to-canvas-aspect. The configured 1280 CSS-pixel width stays as the layout trigger (so the page renders desktop-wide), but the height scales to the canvas aspect at run time so the snapshot fills the column without letterbox AND without forcing a narrow / mobile responsive layout.
Both happen via a tiny MainActor LiveWebMirror registry — the live WebMirrorView measures its canvas, the WebDriver resizes the WKWebView to match. Replay reads the saved snapshot and renders it 1:1 in the chrome.
Loop & form correctness
Three classes of "the typed value disappeared" / "the run wedged" failure are gone:
- React-aware
dispatchType. Settingel.value = ...directly bypasses React's value tracker; React re-renders to its own internal state and the typed text vanishes. WebDriver now resolves the native setter viaObject.getOwnPropertyDescriptor(prototype, 'value').setand calls it with.call(el, value)— the standard pattern every browser test framework uses to drive React inputs. Same fix applies tofill_credential. - Click-target focus routing.
document.elementFromPointreturns the topmost element, which on modern signin forms is usually a wrapper<div>or styled<label>— not the<input>.div.focus()is a no-op;label.focus()doesn't focus the associated input. After dispatching click events, we now walk the click target to find the focusable input (direct match →<label>viahtmlFor/contained input →querySelectordescendant →closest()ancestor) and focus it explicitly. Subsequenttype/fill_credentialwrites to the rightactiveElement. - Multi-tool emissions accepted. The system prompt always read "exactly one tool call ... optionally accompanied by one or more
note_frictioncalls"; the three LLM-client parsers were rejecting anyblocks.count > 1. Each parser now splits action vsnote_friction, requires one action, and forwards frictions throughLLMStepResponse.inlineFriction→AgentDecision.inlineFriction→RunCoordinator(which logs each one as a friction row). Cheaper models that naturally pair "I'm flagging this" with "and trying X" no longer wedge the run.
Architecture
- SwiftData V5. New
@Model Credentialwith@Relationship(deleteRule: .cascade) var credentials: [Credential]onApplication. V4 frozen by copying its file-scope@Modeltypes into theHarnessSchemaV4enum's nested types — the established convention for a shape change. Lightweight v4→v5 migration; existing V4 stores reopen withcredentials == []. RunHistoryStore→@ModelActor. The actor was constructingModelContext(container)from a sync init, which Swift's strict concurrency correctly flagged at runtime: "ModelContexts are not Sendable. Consider using a ModelActor." The macro now generates anonisolated let modelExecutor: any ModelExecutorand binds the actor'sunownedExecutorto it, so every isolated method runs on the queue theModelContextwas created on. Migration-failure recovery (delete-and-retry) lifted to a private static helper outside the actor.- Run-log schema v3.
RunStartedPayloadgains optionalcredentialLabel+credentialUsername(decodeIfPresent so v2 logs round-trip cleanly).tool_call.inputshape forfill_credentialandtap_markdocumented. Parser accepts v1, v2, v3. - Friction taxonomy. New
FrictionKind.authRequiredsynced across the five sites the friction-vocab standard requires.
Tests
223 unit tests passing (was 218 in 0.2). New / extended suites:
SwiftDataMigrationTests— V4→V5 migration test (Applications gain emptycredentialsrelation), V5 round-trip with two staged credentials, V5 cascade-delete from Application removes credentials.KeychainStoreTests— credential password round-trip uses per-credential(applicationID:credentialID)account keying, empty / whitespace password write rejected.AgentToolsSchemaTests— extended forfill_credentialandtap_markmembership in iOS / web tool sets.RunLogParserV2Tests— parser now accepts v3, throwsschemaVersionUnsupportedfor v4+.
Known limits
- Set-of-Mark is web-only today. iOS and macOS still rely on coordinate-only
tap(x, y). Tracked on the wiki Roadmap as "Set-of-Mark targeting on iOS + macOS". - 2FA / human-in-the-loop interrupts are unsupported. Runs that hit SMS verification, CAPTCHAs, or push-approval flows block at that screen. The recommended path is to use test accounts without 2FA. Tracked on the Roadmap as a generic
request_user_input(reason, secret)tool. - eBay-style hostile DOMs may still defeat probe-based marking. Closed shadow roots and cross-origin auth iframes are platform-impossible to introspect; the agent falls back to coordinate-based targeting in those cases.
- Web is still WebKit-only. Chrome / CDP support remains on the roadmap.
- macOS still needs Screen Recording permission. First run prompts; subsequent runs are silent.
Compatibility
- macOS 14+
- Apple Silicon (universal)
- Notarized + signed with Developer ID
- Existing 0.2 run records, Applications, Personas, and Action chains load unchanged. V5 migration runs once at first launch (one-shot, transparent).
Harness v0.2.1
Harness 0.2.1 — First-run WebDriverAgent build works for everyone
A patch release that fixes the first issue filed against Harness on GitHub: the first-run WebDriverAgent build pointed at /Users/alanwizemann/... on every machine that wasn't mine. Thanks to @impiri for the report (#1).
Fix
HarnessPaths.wdaSourceURL was resolved from a $SRCROOT path baked into the binary at build time. That works on the developer's own Mac and nowhere else — for everyone who installed from the 0.2.0 release zip, the first-run wizard's "Build WebDriverAgent" button errored out with a path that didn't exist on their disk.
0.2.1 ships the WebDriverAgent submodule inside the .app bundle (Contents/Resources/WebDriverAgent) and resolves it from Bundle.main.resourceURL first, falling back to the $SRCROOT path only for dev-mode runs from Xcode. The bundled WDA snapshot SHA is baked at app build time so the on-disk build cache stays valid across launches without needing git to resolve a HEAD on the user's machine.
Net effect for users:
- First-run "Build WebDriverAgent" button works immediately after install.
- Cache hits on the second launch — first-build cost is paid once per Harness app version, not once per launch.
- No public API or contract changes; existing run records load unchanged.
Compatibility
- macOS 14+
- Apple Silicon (universal)
- Notarized + signed with Developer ID
- Existing 0.2 run records, Applications, Personas, and Action chains load unchanged.
Harness v0.2.0
Harness 0.2.0 — Multi-provider LLM, configurable budgets, persistent defaults
The 0.1 release shipped the platform plumbing (iOS Simulator, macOS apps, web) wired to a single Anthropic-only model path. 0.2 cracks that path open: seven supported models across three providers, with cost rails, persistence, and unlimited-step support so the agent loop can run as long as the goal needs.
Highlights
Multi-provider LLM support
Pick any of three providers in Settings, then a model from that provider. Compose Run can override per-run.
| Provider | Models | Notes |
|---|---|---|
| Anthropic | Opus 4.7, Sonnet 4.6, Haiku 4.5 (new) | Same cache_control ephemeral caching as 0.1 |
| OpenAI | GPT-5 Mini, GPT-4.1 Nano (new) | Automatic prompt caching at ≥1024 tokens (50% off) |
| Gemini 2.5 Flash, Gemini 2.5 Flash Lite (new) | Implicit caching on 2.5+ models (90% off) |
Each provider gets its own Keychain entry (com.harness.anthropic, com.harness.openai, com.harness.google). Add keys in Settings; the per-provider status indicator confirms what's wired up.
Per-model token budgets
The legacy model == .opus47 ? 250_000 : 1_000_000 ternary is gone. Every model has a justified default and a hard ceiling:
| Model | Default | Max |
|---|---|---|
| Opus 4.7 | 250k | 1M |
| Sonnet 4.6 | 1M | 3M |
| Haiku 4.5 | 2M | 10M |
| GPT-5 Mini, GPT-4.1 Nano | 2M | 10M |
| Gemini 2.5 Flash, Flash Lite | 2M | 10M |
Override globally in Settings (applies to every run regardless of model) or per-run on Compose Run's Advanced section. The resolved value clamps to the active model's max so a 5M override on a cheap model can't follow you when you switch to Opus mid-form.
Unlimited steps
Toggle in Settings, Compose Run Advanced, or any Application's defaults. The token budget + cycle detector remain the safety rails — unlimited steps is not unlimited cost.
Persistent defaults
Settings now actually persist across launches. In 0.1, only the active Application id was saved; everything else (default model, mode, step budget, simulator visibility) reset every launch. 0.2 saves all of them via an extended PersistedSettings structure (legacy settings.json files decode cleanly).
Real screenshot thumbnails in the step feed
The step feed's right column previously showed a static gradient placeholder. Now it renders the captured screenshot for each step, sized aspect-aware so portrait iOS shots and landscape macOS / web shots both look right.
Loop hardening
Cheaper models (GPT-4.1 Nano, Gemini Flash Lite, sometimes Haiku) misbehave more than Opus did. Three changes keep the loop resilient:
- Multi-tool responses — a model emitting >1 tool call now throws
invalidToolCallinstead of silently dropping the rest. - Zero-tool responses — a model punting to plain text instead of calling a tool now goes through the parse-retry path with a corrective hint, instead of failing the run.
- Retry-detail propagation — on retry, the prior parse error is prepended to the user message so the model sees what went wrong. The previous loop retried blind and small models would loop on the same mistake until the cap.
Architecture
- New
LLMSharedenum centralizes the canonical-tool-name → typedToolCalldecode used by every client (Anthropic / OpenAI / Gemini). ToolSchemarefactored toCanonicalTool+ per-provider shape translators (anthropicShape,openAIShape,geminiShape). The Gemini translator uppercases JSON Schema types and stripsadditionalPropertiesso the strict OpenAPI parser accepts our schemas.LLMClientFactory.client(for:keychain:)dispatches perrequest.model.provider. Each run gets a fresh client so token-usage accounting and the cycle-detector window reset correctly.ClaudeErrorrenamed toLLMError(provider-neutral messages). Existing call sites updated; no deprecation alias.LLMStepRequest.platformKindis now plumbed through, so each client picks the right canonical tool set per platform — fixes a latent bug where macOS / web runs always advertised the iOS tool set.TokenUsage.thinkingTokens(telemetry) — surfaces reasoning-token counts from GPT-5 / Gemini 2.5 / extended-thinking Claude calls without double-counting in cost math.
Tests
218 unit tests passing (was 175 in 0.1). New / extended suites:
ToolSchemaShapesTests— Anthropic + OpenAI + Gemini shape translatorsLLMSharedToolCallTests— canonical decode coverageOpenAIClientTests+OpenAIClientRequestShapeTests— wire-format round tripGeminiClientTests+GeminiClientRequestShapeTests+ToolSchemaGeminiShapeTests— wire-format + OpenAPI gotchasAgentLoopRetryHintTests— parse-failure detail propagationRunCoordinatorReplayTests.unlimitedStepBudgetSkipsShortCircuit— drives 50 steps with unlimited budgetPersistedSettingsTests— round-trip + legacy-file + nil-roundtripAgentModelTokenBudgetTests— per-model lookup sanity, resolution, and clamping
Known limits / scoped out
- Per-Application token budget —
defaultStepBudgetexists at the per-Application level (inApplicationSwiftData@Model), butdefaultTokenBudgetdoesn't yet. Adding it requires a SwiftData V4→V5 migration; deferred to a future release. Settings + per-run overrides cover most use cases in the meantime. - Web is still WebKit-only. Chrome / CDP support remains on the roadmap.
- macOS still needs Screen Recording permission. First run prompts; subsequent runs are silent.
Standards updates
standards/07-ai-integration.md §7— token-budget resolution chain + per-model tablestandards/13-agent-loop.md §3—stepBudget == 0sentinel for unlimitedstandards/14-run-logging-format.md— clarifiedstepBudgetandtokenBudgetsemantics
Compatibility
- macOS 14+
- Apple Silicon (universal)
- Notarized + signed with Developer ID
- Existing 0.1 run records, Applications, Personas, and Action chains load unchanged. Existing
settings.jsondecodes cleanly (the new fields default to nil and AppState falls back to its property initializers — Settings re-saves them on first edit).
Harness v0.1.0
Harness 0.1.0 — Multi-platform alpha
The first tagged release. All three target platforms — iOS Simulator, macOS apps, and web apps — are wired end-to-end. Per-app setting; same Compose-Run flow for all three; same replay + friction artefacts.
What it does
Write a goal in plain language ("Sign up and create my first list", "Find a vegetarian restaurant near me and save it"). Pick a persona ("first-time user", "returning power user", "keyboard-first"). Hit Start. An LLM agent reads screenshots, drives the UI, and reports:
- Did the goal complete? (success / failure / blocked + summary)
- What was the path? (replayable timeline of every screen + action)
- Where was the friction? (timestamped events the agent flagged as confusing)
Targets
| Kind | How |
|---|---|
| iOS Simulator | xcodebuild your project + scheme; simctl boot/install/launch; WebDriverAgent for input. |
| macOS app | NSWorkspace launch (pre-built .app or xcodebuild macOS scheme); CGEvent for clicks / scroll / keyboard / shortcuts; CGWindowListCreateImage for window capture. |
| Web app | Embedded WKWebView at any CSS-pixel viewport; JS-synthesised events for input; WKWebView.takeSnapshot for capture. |
Architecture
Harness/Platforms/—PlatformKinddiscriminator,UXDriving+PlatformAdapterprotocols, per-platform adapters.RunCoordinatordispatches throughPlatformAdapterFactory.make(for:services:). The agent loop reads its tool schema fromadapter.toolDefinitions(...); the system prompt's{{PLATFORM_CONTEXT}}block loads fromdocs/PROMPTS/platforms/<kind>.md.- Run history, replay, and friction events are platform-neutral — JSONL events are the same shape regardless of target.
- SwiftData V4 schema:
Application.platformKindRaw+ per-platform optional fields (mac bundle path, web URL + viewport).
What's in this build
- 175 unit tests passing.
- macOS 14+ minimum.
- Apple Silicon (universal).
- Notarized + signed with Developer ID.
Known limits
- Web is WebKit only. A future opt-in CDP-backed driver (Chrome) is on the roadmap. Browser-chrome shortcuts (
Cmd+L,Cmd+T) won't fire — that's a runtime limit, not a UX problem to flag. - macOS needs Screen Recording permission. First run prompts; subsequent runs are silent.
- iOS first build is 1–2 minutes while WebDriverAgent compiles for your simulator runtime. Cached after that.
- Personas are shared across platforms. Built-in iOS / macOS / web personas exist (
docs/PROMPTS/personas/<kind>-defaults.md); the picker doesn't filter by platform yet — pick anything sensible. Filtering is a follow-up if it gets cluttered.
Setup
- Install Homebrew.
brew tap facebook/fb && brew install idb-companion(only needed for iOS).- Get an Anthropic API key.
- Open Harness → first-run wizard walks you through API key + WebDriverAgent build.
Acknowledgements
Built with Claude (Anthropic) — agent loop runs against the public Messages API. Thanks to Appium for WebDriverAgent, the iOS responder-chain bridge that makes simulator input actually fire UIKit events.
Full source + docs: https://github.com/awizemann/harness
Wiki (architecture, services, agent loop): https://github.com/awizemann/harness/wiki
Issues: https://github.com/awizemann/harness/issues