CodematicBench (cmb) is a local-first framework for evaluating AI coding agents on repository-scale tasks. It runs agents against real repos, captures outcomes, measures code changes and test results, and stores runs for later comparison.
The project is implemented in Go and currently supports opencode, claude-code, codex, aider, and kiro through their respective CLIs.
- Runs one agent against a task definition
- Runs the same task multiple times to measure variance
- Runs multiple agents against the same task for side-by-side comparison
- Executes work in isolated git worktrees by default
- Stores results in SQLite for later inspection
This repo is usable as an alpha-stage benchmarking harness. The core runner, task/config formats, sandboxing, metrics, and result storage are implemented. Some surrounding docs still describe older command shapes; the sections below reflect the current CLI.
- Go 1.24+
- Git
- At least one supported agent CLI installed and available in
PATH
Depending on which agent you use, you may also need provider credentials such as ANTHROPIC_API_KEY, OPENAI_API_KEY, or AWS Bedrock credentials.
Build the CLI:
go build -o cmb ./cmd/cmbRun a single task:
./cmb run --agent codex --task task/test-simple.yamlRun one agent multiple times:
./cmb run --agent codex --task task/test-simple.yaml --runs 3Compare multiple agents on the same task:
./cmb run \
--agent codex \
--agent claude-code \
--task task/test-simple.yaml \
--runs 3View saved results:
./cmb results --last 10The current CLI has two primary commands:
cmb run: execute one or more agents on a taskcmb results: query previously saved results
Comparison is handled through cmb run by passing multiple --agent flags and/or a --runs count.
cmd/cmb/ CLI entry point
pkg/agent/ Agent integrations
pkg/runner/ Task execution and sandboxing
pkg/task/ Task loading and validation
pkg/config/ Agent configuration loading and defaults
pkg/metrics/ Aggregation and reporting
pkg/storage/ SQLite persistence
task/ Example benchmark tasks
config/ Example agent configurations
Tasks are YAML files that point at a repo, provide instructions, and define evaluation commands.
name: "add-pagination"
language: "go"
repo: "./test-repos/chi"
instructions: |
Add cursor-based pagination to the relevant API endpoints.
evaluation:
run_tests: "go test ./..."
check_diff: true
timeout: 600sSee task/README.md for the full task format.
Agent configs live in config/ and let you tune model choice, prompts, and agent-specific settings.
./cmb run --agent codex --task task/test-simple.yaml --config config/codex-default.yamlSee config/README.md for details.
An Agent Execution Contract (AEC) is the practical set of rules that governs how an agent operates in a given environment:
- Permitted capabilities
- Observable state
- Required preconditions
- State transitions
- Invariants
CodematicBench is intended to help you test those constraints empirically by changing the setup, rerunning the task, and observing what changed.