Verity tells you the truth about your app.
Model-agnostic AI-powered UI automation. Write tests in plain English. Trigger them with GitHub webhooks. Get comprehensive HTML reports with screenshots, AI reasoning, and pass/fail verdicts.
Works with Anthropic Claude, OpenAI GPT, and any OpenAI-compatible endpoint — including local models via Ollama, LM Studio, vLLM, or llama.cpp.
specs/login.spec.md ──▶ GitHub webhook / API / CLI
│
▼
LLM agent (vision + tool use)
Claude · GPT-4o · Ollama · ...
│
▼
Playwright (Chromium)
│
▼
HTML report + JSON + SQLite history
Most UI test frameworks force you to write brittle CSS selectors. This one lets the AI see the page (vision) and decide what to do — so your tests look like a tester's notebook, not a fragile XPath dump.
- Click the "Sign in" button.
- Type "alice@example.com" into the email field.
- Verify the dashboard shows "Welcome, Alice".That's it. No selectors, no waits, no element IDs. The agent figures it out from a screenshot + accessibility tree.
- Plain-English specs — Markdown files with
## Stepsand## Expectationsbullet lists. - Vision-driven — Every action returns a screenshot to Claude, who verifies the result before the next step.
- Webhook triggers — GitHub
pushandpull_requestevents route to spec sets via tags (smoke,pr). - Manual API —
POST /runs {"specId": "..."}to fire a run. - CLI —
pnpm test specs/login.spec.mdfor local runs. - Comprehensive reports — Per-step screenshots, AI reasoning traces, expectation verification, token usage. Self-contained HTML.
- Persistent history — SQLite-backed run database with a built-in dashboard at
/. - Prompt caching — System prompt and tool definitions cached across runs for cost efficiency.
Requires Node.js 20+ and pnpm (corepack enable pnpm).
# 1. Install
pnpm install
pnpm install-browsers # downloads Chromium
# 2. Configure
cp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY (or LLM_PROVIDER=openai with OPENAI_API_KEY)
# 3. Run a sample test
pnpm test specs/example-login.spec.md
# 4. Or start the server
pnpm server
# → Dashboard: http://localhost:3000Drop a markdown file into specs/:
---
name: Checkout flow
baseUrl: https://shop.example.com
tags: [smoke, pr]
---
## Description
Verify a logged-in user can complete checkout with a saved card.
## Steps
- Open the home page.
- Click on the first product card.
- Click the "Add to cart" button.
- Click the cart icon in the header.
- Click "Checkout".
- Click "Place order".
## Expectations
- The order confirmation page should display an order number.
- The page should show the product that was added.
- A confirmation email notice should appear.The frontmatter (name, baseUrl, tags) is optional. Steps and expectations are bullet lists in plain English.
pnpm test specs/login.spec.md # one spec
pnpm test list # list specs
pnpm test all --base-url https://staging.example.com # all specscurl -X POST http://localhost:3000/runs \
-H 'Content-Type: application/json' \
-d '{"specId": "example-login"}'In your GitHub repo: Settings → Webhooks → Add webhook
- Payload URL:
https://your-host/webhooks/github - Content type:
application/json - Secret: same value as
GITHUB_WEBHOOK_SECRETin.env - Events:
pushandpull_request
Spec selection by event:
- push to
main→ specs taggedsmoke - pull_request → specs tagged
pr - otherwise → all specs
Four named profiles ship out of the box: anthropic, openai, ollama, groq. The agent's behavior, tool surface, and reports are identical across all four — only the backend changes.
- CLI flag — wins everything:
pnpm test specs/foo.md --provider ollama - Spec frontmatter — per-test override:
provider: groqin the YAML header LLM_PROVIDERenv var — your default
# Run one spec on Ollama for a quick local check
pnpm test specs/login.spec.md --provider ollama
# Run another on Groq for sub-second cloud inference
pnpm test specs/checkout.spec.md --provider groq
# Per-spec override in frontmatter (no flag needed)
# ---
# name: Pricing page
# provider: anthropic # always run this one on Claude
# ---Best vision quality, prompt caching across runs, adaptive thinking. Use this for CI signal you trust.
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-opus-4-7 # or claude-sonnet-4-6, claude-haiku-4-5
ANTHROPIC_EFFORT=high # low | medium | high | maxLLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4oThe same openai profile works for any OpenAI-compatible endpoint — LM Studio, vLLM, llama.cpp server, OpenRouter, Together — just change OPENAI_BASE_URL + OPENAI_MODEL.
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434/v1
OLLAMA_MODEL=qwen2.5vl
OLLAMA_API_KEY=ollama # any string; Ollama ignores itSetup:
brew install ollama # or download from https://ollama.com
ollama pull qwen2.5vl
ollama serve # exposes http://localhost:11434
pnpm test specs/login.spec.md --provider ollamaSub-second responses, generous free tier. Best for high-volume smoke runs where speed beats marginal quality.
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_...
GROQ_BASE_URL=https://api.groq.com/openai/v1
GROQ_MODEL=llama-3.3-70b-versatileGet a key at https://console.groq.com.
| Use case | Recommended provider |
|---|---|
| CI on every PR, want trusted signal | anthropic (Sonnet 4.6 or Opus 4.7) |
| Local dev, no internet, no spend | ollama (qwen2.5vl) |
| High-volume smoke runs, latency matters | groq (llama-3.3-70b-versatile) |
| Need OpenAI-specific feature or already have OpenAI billing | openai (gpt-4o) |
| Switch per-spec: heavy specs on Claude, smoke specs on Groq | mix via frontmatter provider: |
Note on local / open-weight models: UI testing requires both vision and reliable structured tool use. As of late 2026,
qwen2.5vl(7B+) is the best free path for vision-capable tool use. For tool-use only (no vision), Groq'sllama-3.3-70b-versatileis cheap and very fast. Hosted Claude or GPT-4o remains the most reliable for production CI.
- Spec parser reads the markdown file and extracts steps + expectations.
- Browser session boots a Chromium instance via Playwright.
- Agent loop sends each turn to the configured LLM. The agent sees the current screenshot and decides which tool to call:
navigate(url)click(target, role?)fill(target, value)press_key(key)wait(ms | for_text)scroll(direction)observe()— refresh screenshotfinish_test(passed, summary, expectations_checked, failure_reason?)— verdict
- Each tool result includes a fresh screenshot fed back to the agent so it can verify what just happened.
- Run record is persisted to SQLite; the HTML report is written to
reports/<run_id>/index.htmlwith all screenshots inlined.
src/
├── server.ts # Express server + dashboard
├── cli.ts # Command-line runner
├── config.ts # Env-backed configuration
├── types.ts # Shared types
├── specs/loader.ts # Markdown spec parser
├── runner/
│ ├── browser.ts # Playwright session wrapper
│ └── orchestrator.ts # Coordinates one run end-to-end
├── agent/
│ ├── prompts.ts # System + user prompt builders
│ ├── tools.ts # Browser tool definitions + dispatcher
│ ├── executor.ts # Provider dispatcher (LLM-agnostic entry point)
│ └── providers/
│ ├── types.ts # LLMProvider interface
│ ├── anthropic.ts # Native Claude provider (vision + caching + thinking)
│ └── openai-compat.ts # OpenAI / Ollama / LM Studio / vLLM / ...
├── reports/generator.ts # HTML + JSON report generator
├── storage/db.ts # SQLite persistence
└── triggers/github.ts # GitHub webhook handler + signature verify
specs/ # Plain-English test specs (markdown)
data/ # Runtime state: SQLite DB + screenshots
reports/ # Generated HTML reports
| Env var | Default | Description |
|---|---|---|
LLM_PROVIDER |
anthropic |
anthropic or openai |
ANTHROPIC_API_KEY |
(required if anthropic) | Claude API key |
ANTHROPIC_MODEL |
claude-opus-4-7 |
Claude model ID |
ANTHROPIC_EFFORT |
high |
low / medium / high / max |
OPENAI_API_KEY |
(required if openai cloud) | OpenAI key (or any string for local servers) |
OPENAI_BASE_URL |
https://api.openai.com/v1 |
OpenAI-compatible endpoint |
OPENAI_MODEL |
gpt-4o |
Model name on the chosen endpoint |
PORT |
3000 |
Server port |
HOST |
0.0.0.0 |
Server bind host |
GITHUB_WEBHOOK_SECRET |
(empty) | If set, webhooks must include valid HMAC sig |
TARGET_BASE_URL |
(empty) | Default base URL for specs without one |
HEADLESS |
true |
Run browser headless |
BROWSER_TIMEOUT_MS |
30000 |
Default Playwright timeout |
MAX_AGENT_STEPS |
40 |
Max agent turns before forced abort |
DATA_DIR |
./data |
DB + screenshots |
REPORTS_DIR |
./reports |
Generated reports |
SPECS_DIR |
./specs |
Test specs |
pnpm typecheck # tsc --noEmit
pnpm check # Biome: lint + format + import-organize check
pnpm check:fix # Biome: auto-fix everything safe
pnpm lint # Biome: lint only
pnpm format # Biome: format only
pnpm lint:md # markdownlint-cli2 over all *.md
pnpm lint:md:fix # markdownlint-cli2 with --fix
pnpm verify # Full local CI: typecheck + biome check + markdownlintTooling is wired to:
- Biome — single binary for lint, format, and import organization. Replaces ESLint + Prettier. Config in
biome.json. - markdownlint-cli2 — markdown style consistency. Config in
.markdownlint-cli2.jsonc. - pnpm — fast, disk-efficient package manager. Locked at
pnpm@9.xvia thepackageManagerfield;corepackwill pick the right version automatically.
- Be specific in expectations. "The page should look right" is checkable but vague — "The header should show 'Welcome, Alice'" is better.
- Use the language a human would use. "Click the Sign in button" beats "Click button.btn-primary".
- Keep tests focused. One flow per spec. Add a second spec for an alternative path.
- Tag for routing.
tags: [smoke]runs on every push to main;tags: [pr]runs on every PR.
Each test step costs ~1 LLM API call (input: prompt + screenshot ≈ 2K tokens, output ≈ 200 tokens). Approximate cost of a 10-step test by provider:
| Provider | Model | ~Cost per 10-step test |
|---|---|---|
| Anthropic | Opus 4.7 | $0.05 – $0.15 |
| Anthropic | Sonnet 4.6 | $0.03 – $0.09 |
| Anthropic | Haiku 4.5 | $0.01 – $0.03 |
| OpenAI | GPT-4o | $0.04 – $0.12 |
| OpenAI | GPT-4o-mini | $0.005 – $0.015 |
| Ollama (local) | qwen2.5vl, llava | $0 (compute only) |
Prompt caching on Anthropic further reduces cost on repeated runs.
MIT