macOS 14+ · Apple Silicon & Intel · ~12 MB
awizemann.github.io/harness · Wiki · All releases
A native macOS developer tool that drives an iOS Simulator, a macOS app, or a web app with an AI agent so you can run user tests — not scripted UI tests, but real-user simulation.
You write a goal in plain language ("I want to sign up and create my first list", "delete my account", "find a vegetarian restaurant near me and save it") and a persona ("first-time user, never seen this app"). Harness builds (or just launches) your target, and an LLM agent reads screenshots, clicks/types/scrolls, and pursues the goal — narrating what it sees, flagging UX friction (dead ends, ambiguous labels, unresponsive controls), and stopping when it succeeds, fails, or would give up.
Three artifacts come out of every run:
- Did the goal complete? — success / failure / blocked + summary
- What was the path? — replayable sequence of screens + actions
- Where was the friction? — timestamped events the agent flagged as confusing
| Kind | How Harness drives it |
|---|---|
| iOS Simulator | xcodebuild your project + scheme; simctl boot/install/launch; WebDriverAgent for input. |
| macOS app | NSWorkspace launch (pre-built .app or xcodebuild macOS scheme); CGEvent for input; CGWindowListCreateImage for capture. |
| Web app | Embedded WKWebView at a chosen viewport (default 1280×1600 tall desktop, or 375×812 mobile); JS-synthesised events for input; WKWebView.takeSnapshot for capture. The mirror shows a flat browser chrome (no device bezel) so the screenshot fills the full pane and one snapshot covers more page — fewer scrolls per goal, lower API cost. |
Per-app setting: each Application declares its kind once at create time. The agent's tool schema (clicks vs swipes vs key shortcuts vs navigate) and the system-prompt context block re-shape per platform. Run history, replay, and friction reporting are platform-neutral.
Status: v0.3.1 (alpha). All three platforms wired end-to-end; per-Application credential storage + Set-of-Mark targeting on web (numbered overlays on focusable elements; agent clicks by id, no pixel guessing — agent-only, never on disk); multi-provider LLM support (Anthropic Opus 4.7 / Sonnet 4.6 / Haiku 4.5 + OpenAI GPT-5 Mini / GPT-4.1 Nano + Google Gemini 2.5 Flash / Flash Lite); per-provider Keychain storage; configurable per-model token budgets; unlimited-step option. macOS needs Screen Recording permission. Web is WebKit-only; Chrome via CDP is on the roadmap. See
docs/ROADMAP.md.
- Set-of-Mark badges no longer leak into human-visible surfaces. The disk PNG is the clean rendered page — replay, friction reports, and exported screenshots show what a real user would see. The agent still receives the marked-up image (numbered green badges over focusable elements) via an in-memory
ScreenshotMetadata.markedImageDatachannel; the on-disk artifact stays free of dev-tool clutter. Standard 14 §6 documents the new "no agent scaffolding on disk" invariant. - Compose Run pairs Persona + Credential side-by-side. Both sections answer "who's running this?", so they read as one row instead of two stacked panels. Saves vertical scroll; auto-falls-back to a single column on narrow windows via
ViewThatFits. When no credentials are staged, Persona expands to fill the row naturally.
- Per-Application credential storage. Pre-stage username/password pairs against an Application; pick one per run in Compose Run. The agent gets a new
fill_credential(field: "username"|"password")tool for iOS, macOS, and web. Password bytes never enter the model's context, the JSONL log, or any prompt template —tool_call.inputfor password fills records{"field":"password"}and nothing else. New friction kindauth_requiredfor the "agent hit a login wall and has nothing to fill" case. - Set-of-Mark targeting (web). Every screenshot now overlays small numbered badges on focusable elements (form fields, buttons, dropdowns, checkboxes). The agent calls
tap_mark(id)and the WebDriver resolves to the element's center — no more "agent picked y=228, input was at y=242" misses. Coordinatetap(x, y)stays available for unmarked content. Probe pierces open shadow roots so inputs in modern signin / payment widgets get marks. iOS / macOS get the same treatment in a follow-up via accessibility-tree probes (tracked on the wiki Roadmap). - Web mirror reworked. Replaced the iPad-shaped device bezel with a flat browser chrome (URL pill, lock glyph, back/forward/refresh affordances) so web runs use the full middle column. Default viewport bumped to 1280×1600 — taller snapshots mean fewer scroll turns, which translates directly to lower API spend per run.
- React-aware form fill.
dispatchTypenow uses the native value setter viaObject.getOwnPropertyDescriptor, so React's value tracker actually sees the change and re-renders won't reset typed text. Same fix applies tofill_credential. Click-target focus routing now walks<label>, wrappers, and shadow children to focus the actual input, not the styled<div>on top of it. - Multi-tool emissions accepted. The system prompt always allowed "exactly one tool call ... optionally accompanied by one or more
note_frictioncalls"; the parsers were rejecting anything > 1 block. Each provider's parser now splits action vsnote_frictionand forwards inline frictions throughAgentDecision.inlineFriction→ JSONL friction rows. - Run-log schema v3.
run_startedpayload gains optionalcredentialLabel+credentialUsername(decode-if-present so v2 logs round-trip). Standards doc §5 documents the v2→v3 migration and the three credential-redaction invariants. RunHistoryStoreadopts@ModelActor. Eliminates the "Unbinding from the main queue. ModelContexts are not Sendable" runtime warning that Swift's strict concurrency was right to flag.
- Seven supported models across three providers. Pick a provider in Settings, then a model. Compose Run can override per-run. Each provider has its own Keychain entry; swap mid-session without restart.
- Per-model token budgets. The legacy "Opus → 250k, else 1M" ternary is gone — every model has a justified default and a hard ceiling, configurable globally in Settings and per-run in Compose Run.
- Unlimited steps. Toggle in Settings, Compose Run, or Application defaults. The token budget + cycle detector remain the safety rails.
- Settings persist across launches. Default provider, model, mode, step + token budgets, and simulator visibility all survive a restart now (they didn't in 0.1).
- Real screenshot thumbnails in the step feed, sized to each platform's aspect ratio.
- Loop hardening for cheaper models. Multi-tool / zero-tool / parse-failure responses now surface a corrective hint to the model on retry instead of failing the run silently.
- 218 unit tests passing (was 175 in 0.1).
Harness vendors appium/WebDriverAgent as a git submodule under vendor/WebDriverAgent (it's how we drive the iOS Simulator's responder chain). The Xcode project is generated from project.yml via xcodegen.
git clone https://github.com/awizemann/harness.git
cd harness
git submodule update --init --recursive
brew install xcodegen
xcodegen generate
open Harness.xcodeprojYou'll also need idb_companion for simulator control:
brew tap facebook/fb && brew install idb-companionThe first run builds WDA against your simulator's iOS runtime (~1–2 min). Result is cached under ~/Library/Application Support/Harness/wda-build/<iOS-version>/ and reused on subsequent runs.
Full setup: see Build-and-Run on the Wiki.
standards/INDEX.md— development, code, and architecture standards. Read these before adding code.- GitHub Wiki — "where things live, why, and how to extend them." Maintained per PR alongside code.
docs/ARCHITECTURE.md— system architecture overview.docs/ROADMAP.md— build order and milestones.docs/PROMPTS/— canonical agent prompts (loaded as a bundle resource at runtime).HarnessDesign/— design system tokens, primitives, and screen layouts.
PRs welcome. Read CONTRIBUTING.md first — it covers setup, the architecture rules (MVVM-F, Swift 6 strict concurrency, single subprocess actor), and the public-surfaces sync rule (code changes that affect README / wiki / site update them in the same PR).
MIT — see LICENSE.