Open-source eval runtime for coding agents.
marginlab.ai · Documentation
Margin Eval is the most robust orchestrator for running evals against CLI agents like Claude Code, Codex, Gemini CLI, OpenCode, and Pi. It measures accuracy, token usage, runtime, and captures full execution traces, all in a standardized, reproducible local format.
- Test any configuration: agents, models, MCPs, skills, prompting strategies
- Compare side-by-side: unified CLI, config format, and output across all agents
- Reproduce any run: every run is compiled into an immutable, self-contained bundle
- Resume on failure: automatically retry infra failures without re-running completed cases
- Extensible: Define your own agents, test suites, configurations, etc.
- Docker installed and running
- An API key or OAuth credentials for your agent provider
Install the latest stable release:
curl -fsSL https://raw.githubusercontent.com/Margin-Lab/evals/main/scripts/install.sh | bashCheck your installation is ready to run an eval
margin --version
margin checkUpdate an installer-managed binary:
margin updateDry-run your first eval (no token usage; tests still run)
margin run \
--suite git::https://github.com/Margin-Lab/test-suites.git//swe-minimal-test-suite \
--agent-config ~/.margin/configs/example-agent-configs/codex-unified \
--eval ~/.margin/configs/example-eval-configs/default.toml \
--dry-runRun your first eval using an API key
export ANTHROPIC_API_KEY=<API_KEY>
margin run \
--suite git::https://github.com/Margin-Lab/test-suites.git//swe-minimal-test-suite \
--agent-config ~/.margin/configs/example-agent-configs/claude-code-default \
--eval ~/.margin/configs/example-eval-configs/default.toml \Run Gemini CLI with the unified config
export GEMINI_API_KEY=<API_KEY>
margin run \
--suite git::https://github.com/Margin-Lab/test-suites.git//swe-minimal-test-suite \
--agent-config ~/.margin/configs/example-agent-configs/gemini-cli-unified \
--eval ~/.margin/configs/example-eval-configs/default.toml \Run your first eval using your agents OAuth, Margin will auto-detect your OAuth file
margin run \
--suite git::https://github.com/Margin-Lab/test-suites.git//swe-minimal-test-suite \
--agent-config ~/.margin/configs/example-agent-configs/codex-unified \
--eval ~/.margin/configs/example-eval-configs/default.toml \Or, run with a specific OAuth file
margin run \
--suite git::https://github.com/Margin-Lab/test-suites.git//swe-minimal-test-suite \
--agent-config ~/.margin/configs/example-agent-configs/codex-unified \
--eval ~/.margin/configs/example-eval-configs/default.toml \
--auth-file-path /path/to/credentials.jsonRun Claude Code on SWE-Bench Pro:
margin run \
--suite git::https://github.com/Margin-Lab/swe-suites.git//swe-bench-pro \
--agent-config ~/.margin/configs/example-agent-configs/claude-code-default \
--eval ~/.margin/configs/example-eval-configs/default.toml \Run Codex with unified config:
margin run \
--suite git::https://github.com/Margin-Lab/swe-suites.git//terminal-bench-2 \
--agent-config ~/.margin/configs/example-agent-configs/codex-unified \
--eval ~/.margin/configs/example-eval-configs/default.toml \
--output ./runs/codex-terminal-bench-2Resume a run:
margin run --resume-from ./runs/codex-terminal-bench-2Resume with updated suite/config inputs:
margin run \
--resume-from ./runs/codex-terminal-bench-2 \
--suite ./suites/smoke \
--agent-config ./configs/my-agent-configs/claude-sonnet \
--eval ./configs/my-evals/local.tomlScaffold a new agent config:
margin init agent-config \
--agent-config ./configs/my-agent-configs/claude-sonnet \
--definition ./configs/agent-definitions/claude-codeOfficial SWE eval suites are hosted at https://github.com/Margin-Lab/swe-suites.git
| Suite | Example --suite value |
|---|---|
swe-bench-verified |
git::https://github.com/Margin-Lab/swe-suites.git//swe-bench-verified |
swe-bench-pro |
git::https://github.com/Margin-Lab/swe-suites.git//swe-bench-pro |
swe-bench-pro-curated-50 |
git::https://github.com/Margin-Lab/swe-suites.git//swe-bench-pro-curated-50 |
terminal-bench-2 |
git::https://github.com/Margin-Lab/swe-suites.git//terminal-bench-2 |
Many more built-in evals coming soon.
See Creating Your Own Eval for a guide on how to create your own eval suite.
| Agent | Config examples |
|---|---|
| Claude Code | claude-code-default, claude-code-unified |
| Codex | codex-default, codex-unified |
| Gemini CLI | gemini-cli-default, gemini-cli-unified |
| OpenCode | opencode-default, opencode-unified |
| Pi | pi-default, pi-unified |
Agent configs support two modes: direct (full agent-specific control) and unified (one config format that works across all supported agents).
Many more built-in agents coming soon. See Adding a New Agent for a guide on how to add a new agent, and Configuring Your Agent for a guide on how to configure an existing agent.
cli/ CLI tool — compiles configs into run bundles
runner/runner-core/ Shared engine — state machine, worker pool, run store
runner/runner-local/ Local runner — Docker executor, filesystem persistence
agent-server/ In-container runtime — agent lifecycle, trajectory capture
configs/ Agent definitions and example configs
suites/ Local custom suites during development
docs/ Documentation
- What is Margin?
- Installation
- Running Your First Eval
- Configuring Your Agent
- Configuring Your Eval
- Creating Your Own Eval
- Adding a New Agent
This project is licensed under the GNU Affero General Public License v3.0.
Please contact us at hello@marginlab.ai for questions