Skill Eval Harness is a Python CLI for testing whether an Agent Skill changes observable output. It reads evals/shared-benchmark.json, emits answer-key-safe task rows, grades files under eval-runs/, and writes benchmark reports you can diff across variants.
The main question is narrow: when the same case runs with and without the skill, what changed, what passed, and did the eval itself leak the answer?
- Describe cases in
evals/shared-benchmark.json: prompt, split, fixture files, variants, assertions, and ablations. - Prepare tasks with
skill-benchmark prepare; generation rows omitexpected_behaviorand judge rubrics unless you explicitly request them. - Run tasks with Pi, Claude Code, Jetty, or another runner; each run writes
output.mdand optionalmetadata.json. - Grade outputs with deterministic assertions: string, regex, file, JSON field, and opt-in
scriptoracles. - Inspect the report for pass rates, flaky repeated runs, no-lift cases, saturated assertions, judge tasks, and trigger/no-trigger results.
- Variant pairing:
with_skill,without_skill, optionalold_skill, andablation:<id>. - Split discipline:
tune,holdout, andholdbackstay separate. - Local grading: deterministic assertions run without model calls.
- Eval hygiene: leakage lint, manifest audit, trigger checks, repeated-run stats, and fixture recommendations.
- Interop: Anthropic-style exports, static HTML review pages, Pi trigger evals, and Jetty runbook-mode import/export.
- Judge plumbing:
judge/rubricassertions can be exported or run through a user-supplied--judge-cmd; the harness does not choose a model for you.
- Quick start
- Installation
- Manifest format
- Assertions
- Run output contract
- Commands
- Jetty adapter
- Contributing
Requires Python 3.10+ and uv. Install from GitHub first:
uv tool install git+https://github.com/adewale/skill-eval-harness.git@v0.4.1
Run these from a skill repo that has evals/shared-benchmark.json:
# 1. Check manifest shape and fixture paths.
skill-benchmark validate evals/shared-benchmark.json
# 2. Emit answer-key-safe task rows for a runner.
skill-benchmark prepare evals/shared-benchmark.json \
--split tune \
--runs-per-variant 3 \
--out /tmp/tasks.jsonl
# 3. Run each task with your agent runner and save:
# eval-runs/latest/<case_id>/<variant>/run-<n>/output.md
# eval-runs/latest/<case_id>/<variant>/run-<n>/metadata.json
# 4. Grade saved outputs. Add --allow-scripts only if you trust repo-owned oracles.
skill-benchmark benchmark evals/shared-benchmark.json \
--runs eval-runs/latest \
--split tune \
--allow-scripts \
--out benchmark.json
# 5. Open a static review page.
skill-benchmark render-viewer \
--benchmark benchmark.json \
--runs eval-runs/latest \
--out review.htmlExpected landmarks:
validate -> OK: <skill-name> — <case-count> cases, <ablation-count> ablations
prepare -> /tmp/tasks.jsonl, one JSON object per case/variant/run
benchmark -> benchmark.json with summary, results, and case_flags
viewer -> review.html with assertion evidence and output previews
benchmark.json records one row per case/variant/run, plus aggregate pass rates, timing/token summaries, and flags for saturated, no-lift, flaky, or with-skill-failed cases.
uv tool install git+https://github.com/adewale/skill-eval-harness.git@v0.4.1
skill-benchmark --help
skill-pi-trigger-eval --help
# One-shot without installing globally:
uvx --from git+https://github.com/adewale/skill-eval-harness.git@v0.4.1 skill-benchmark --helpThe installed commands are:
| Command | What it does |
|---|---|
skill-benchmark |
Validate manifests, prepare tasks, grade outputs, compare variants, run judges, and import/export runner formats. |
skill-pi-trigger-eval |
Runs Pi without forced --skill and checks whether the model loads the skill from stream events. |
git clone https://github.com/adewale/skill-eval-harness.git
cd skill-eval-harness
uv tool install --editable .
skill-benchmark --help| File | Use it for |
|---|---|
README.md |
Manifest shape, run layout, and command contracts. |
CHANGELOG.md |
Release history and unreleased repo-surface changes. |
CONTRIBUTING.md |
Local setup, validation commands, and eval-safety rules. |
LESSONS_LEARNED.md |
Design lessons from the multi-skill saturation work. |
docs/jetty-support-spec.md |
Jetty payload/import contract and live-token unknowns. |
docs/trace-aware-eval-spec.md |
Trace artifact contract, shipped v0.4.1 runner support, process/efficiency assertions, and remaining trace work. |
docs/repo-effectiveness-audit.md |
good-repo audit, score, package metadata fixes, and manual GitHub settings checklist. |
TODO.md |
Remaining Jetty work: streaming/concurrency, live API validation, materialized ablations, judge export, and per-variant overrides. |
examples/adewale-workspace/ |
Adewale-specific runners for Pi smoke, trigger, ablation, and aggregate reports. |
tests/test_skill_benchmark.py |
Executable examples for grading, leakage lint, script assertions, judge commands, Jetty export/import, trace artifacts, and trigger detection. |
Each skill repo owns an evals/shared-benchmark.json manifest. Add a harness block so readers know which external harness/version to install.
{
"version": 1,
"skill_name": "good-pr",
"harness": {
"name": "skill-eval-harness",
"url": "https://github.com/adewale/skill-eval-harness",
"version": ">=0.4.1"
},
"skill_paths": ["skills/good-pr/SKILL.md"],
"variants": ["with_skill", "without_skill"],
"optional_variants": ["old_skill"],
"split_policy": {
"tune": "Visible cases used during iteration.",
"holdout": "Hidden cases scored only at end-of-round or merge.",
"holdback": "Examples not exposed in skill/docs/eval descriptions until after scoring."
},
"cases": [
{
"id": "pos-security-meaningless-test",
"split": "tune",
"kind": "pr-review",
"domain": "pull-request-quality",
"difficulty": "core",
"trigger_type": "explicit",
"success_goals": ["outcome", "style"],
"prompt": "Security fix PR includes `expect(result).toBeDefined()` as the only auth-bypass test...",
"files": ["fixtures/security-pr/diff.patch"],
"expected_behavior": ["Flag the weak test and require regression proof."],
"assertions": [
{"name": "detect-weak-test", "type": "contains_any", "values": ["weak", "toBeDefined"]},
{"name": "qualitative-review", "type": "judge", "rubric": ["Specific", "maintainer-friendly"]}
],
"tags": ["security", "testing"]
}
],
"ablations": [
{
"id": "no-regression-proof",
"removed_component": "regression-proof requirement",
"expected_regressions": ["Accepts weak tests that still pass without the fix"]
}
]
}| Split | Purpose | Prompt storage |
|---|---|---|
tune |
Visible cases used while editing the skill and evals. | Inline prompt is fine. |
holdout |
Hidden cases scored at end-of-round or merge. | Prefer private prompt_ref. |
holdback |
Not shown in skill/docs/evals until after scoring; detects memorization. | Prefer private prompt_ref and ignored answer keys. |
prepare fails on missing hidden prompts unless --allow-missing-prompts is used for dry-run planning.
Use optional files for fixture-backed evals. Paths are relative to the manifest's evals/ directory, validated by validate, and emitted by prepare as absolute input_files for the runner.
Objective assertion types:
| Type | Checks |
|---|---|
contains |
One substring is present. |
contains_any |
At least one substring is present. |
contains_all |
Every listed substring is present. |
excludes_any |
No listed substring is present. |
regex |
Regex matches output. |
not_regex |
Regex does not match output. |
file_exists |
A file exists relative to the run directory. |
json_field_equals |
A JSON field equals an expected value. |
script |
Opt-in deterministic oracle command against the output directory. |
skill_invoked |
Trace/process check that the runner loaded the skill, or did not, as expected. |
command_ran / command_not_ran |
Trace/process checks over normalized command events. |
command_order |
Trace/process check that commands appeared in a required order. |
tool_count_le / no_repeated_command_loop |
Trace/process budgets for tool use and thrashing. |
total_tokens_le / elapsed_seconds_le / command_count_le |
Efficiency checks over metrics.json, metadata.json, or normalized events. |
Use script when a keyword check is too weak for the property you care about. The command sees the candidate run directory, so it can inspect output.md, generated files under outputs/, or metadata. Script assertions are blocked unless you pass --allow-scripts to grade, benchmark, aggregate, or export-anthropic:
{
"name": "oracle-pass",
"type": "script",
"command": ["python3", "oracles/oracle.py", "{output_dir}"],
"pass_exit_code": 0,
"timeout_s": 30
}command runs with cwd set to the manifest directory. {output_dir} is replaced with the absolute run directory. The assertion passes when the command exits with pass_exit_code (default 0); stdout and stderr are stored as evidence.
Trace/process/efficiency assertions are optional and fail closed when declared evidence is missing. For example, command_not_ran cannot pass without events.json, and total_tokens_le cannot pass without token telemetry.
Assertions can be scoped to variants when the expected process differs by arm:
{"name":"with-skill-loaded","type":"skill_invoked","expected":true,"variants":["with_skill"]}
{"name":"without-skill-clean","type":"skill_invoked","expected":false,"variants":["without_skill"]}Use this for process checks such as skill_invoked; otherwise a with-skill requirement would incorrectly penalize the no-skill baseline.
Qualitative assertion types:
| Type | Behavior |
|---|---|
judge |
Deferred into judge-tasks.jsonl; merge results with --judge-results. |
rubric |
Same deferred qualitative flow. |
Judge results are keyed by judge_task_id:
{"judge_task_id":"case::with_skill::run-1::qualitative-review","passed":true,"score":4,"evidence":"Specific evidence from output"}The harness grades either the legacy layout:
runs/<case_id>/<variant>/output.md
runs/<case_id>/<variant>/metadata.json
or repeated/artifact layout:
runs/<case_id>/<variant>/run-1/output.md
runs/<case_id>/<variant>/run-1/metadata.json
runs/<case_id>/<variant>/run-2/outputs/<artifact files>
Trace-aware runners may also write:
runs/<case_id>/<variant>/run-1/trace.jsonl # raw runner event stream
runs/<case_id>/<variant>/run-1/events.json # normalized events used by process assertions
runs/<case_id>/<variant>/run-1/metrics.json # tokens, commands, tool calls, elapsed time, retries
runs/<case_id>/<variant>/run-1/environment.json # runner/model/sandbox details where available
metadata.json is optional, but include what your runner can capture:
{
"elapsed_ms": 12345,
"input_tokens": 1000,
"output_tokens": 500,
"total_tokens": 1500,
"model": "anthropic/claude-sonnet-4"
}skill-benchmark validate ../repo/evals/shared-benchmark.json
skill-benchmark validate ../repo/evals/shared-benchmark.json --strict-holdback
skill-benchmark validate ../repo/evals/shared-benchmark.json --strict-leakagevalidate checks manifest shape, fixture paths, regex syntax, script oracle paths, and hidden-prompt refs. It also warns when a contains* assertion value appears literally in the prompt:
WARN pos-ui-no-screenshot: assertion 'detect-ui-no-screenshot' value 'screenshot' appears in prompt (leakage; case may saturate)
That warning means a weak answer can pass by echoing the task. Use --strict-leakage only after you have replaced noisy keyword checks with scoped regexes, fixture-backed checks, script oracles, or judge assertions.
skill-benchmark prepare ../repo/evals/shared-benchmark.json --split tune --out tasks.jsonl
skill-benchmark prepare ../repo/evals/shared-benchmark.json --runs-per-variant 5 --out tasks.jsonl
skill-benchmark prepare ../repo/evals/shared-benchmark.json --include-ablations --out ablation-tasks.jsonlUse --include-answer-key only for judge/debug tasks, never for generation runs.
Normalize a raw JSONL trace into events.json and metrics.json for process and efficiency assertions:
skill-benchmark import-trace \
--source codex \
--trace ../repo/eval-runs/latest/case/with_skill/run-1/trace.jsonl \
--run-dir ../repo/eval-runs/latest/case/with_skill/run-1 \
--write-metadatarun-codex executes prepared rows through a command compatible with codex exec --json, saves trace.jsonl, normalizes events/metrics, extracts the final answer into output.md, and records nonzero/timeouts as failed run artifacts:
skill-benchmark prepare ../repo/evals/shared-benchmark.json --split tune --out tasks.jsonl
skill-benchmark run-codex --tasks tasks.jsonl --runs ../repo/eval-runs/codex-tuneOverride --codex-cmd for local wrappers or tests. A concrete Codex smoke command is:
skill-benchmark run-codex \
--tasks tasks.jsonl \
--runs ../repo/eval-runs/codex-trace \
--codex-cmd 'codex exec --json --sandbox read-only --skip-git-repo-check --ephemeral'The Adewale Pi smoke example writes the trace-aware run layout directly:
python3 examples/adewale-workspace/run_pi_smoke.py \
--run-name trace-smoke \
--selection /tmp/selection.jsonThe runner uses an isolated temporary workspace. with_skill receives copied skill files and fixtures. without_skill receives fixtures only and runs with --no-skills, so grep/find/read cannot discover the source repo's skills/*/SKILL.md or public eval manifests.
skill-pi-trigger-eval can also write per-query trace artifacts:
skill-pi-trigger-eval ../repo/evals/shared-benchmark.json \
--eval-set trigger-queries.json \
--out trigger-results.json \
--trace-runs trigger-tracesgrade produces per-run grading rows and can emit pending judge tasks:
skill-benchmark grade ../repo/evals/shared-benchmark.json \
--runs ../repo/eval-runs/latest \
--out grade-report.json \
--judge-tasks judge-tasks.jsonlWrite Anthropic-compatible grading.json files into each run directory:
skill-benchmark grade ../repo/evals/shared-benchmark.json \
--runs ../repo/eval-runs/latest \
--write-grading-filesbenchmark aggregates graded rows into variant summaries, paired deltas, slice summaries, telemetry availability, and case flags. Add --allow-scripts only when you trust the repo-owned oracle commands in the manifest.
skill-benchmark benchmark ../repo/evals/shared-benchmark.json \
--runs ../repo/eval-runs/latest \
--allow-scripts \
--judge-results judge-results.jsonl \
--out benchmark.jsonRun deferred judge/rubric assertions with a command that reads one grading prompt from stdin and emits JSON on stdout. The prompt contains the original case prompt, expected_behavior, review_rubric, the assertion, and the saved candidate output.
skill-benchmark judge ../repo/evals/shared-benchmark.json \
--runs ../repo/eval-runs/latest \
--judge-cmd 'claude -p' \
--transcripts judge-transcripts \
--out judge-results.jsonl
skill-benchmark benchmark ../repo/evals/shared-benchmark.json \
--runs ../repo/eval-runs/latest \
--judge-results judge-results.jsonl \
--out benchmark.jsonThe judge command should return JSON like {"passed": true, "score": 4, "rationale": "..."}. Bare or fenced JSON is accepted using json.raw_decode scanning rather than brace counting. --transcripts saves the exact prompt, stdout, stderr, and parsed result for each judge task.
skill-benchmark audit-manifest ../repo/evals/shared-benchmark.json \
--format markdown \
--out eval-audit.mdAdd --runs ../repo/eval-runs/latest to include saturated-case, no-lift, flaky repeated-run, and per-assertion discrimination analysis.
The audit reports:
- missing positive, negative, and adversarial eval coverage,
- missing holdout/holdback split coverage,
- missing trigger/no-trigger coverage,
- missing domain/difficulty/success-goal taxonomy for slice summaries,
- ablation-plan suggestions from major skill sections,
- saturated and no-lift cases when run data is available,
- assertions with identical with/without pass rates, and
- recommended fixture repos/files.
skill-benchmark profile-skill ../repo/evals/shared-benchmark.json \
--format markdown \
--out skill-profile.mdprofile-skill reports SKILL.md token estimates, reference-file counts/sizes, heading/module counts, and warnings for overly broad or oversized skills. These warnings are advisory; focused 2–3-module skills are often easier for agents to apply, but large skills can be justified when references are conditional.
token-overhead combines static skill profile data with paired runtime traces. It reports the static SKILL.md/reference footprint, with_skill - without_skill token deltas, objective lift, and objective lift per 1k extra total tokens when paired metrics.json files exist.
skill-benchmark token-overhead ../repo/evals/shared-benchmark.json \
--runs-subdir eval-runs/latest \
--format markdown \
--out token-overhead.md
skill-benchmark token-overhead \
../skill-a/evals/shared-benchmark.json \
../skill-b/evals/shared-benchmark.json \
--runs-subdir eval-runs/trace-smoke \
--out token-overhead.jsonIf a repo has no paired trace metrics, the report still includes the static footprint and shows 0 runtime pairs.
skill-benchmark aggregate \
$(cat examples/adewale-workspace/all-manifests.txt) \
--runs-root .. \
--runs-subdir eval-runs/latest \
--out aggregate-benchmark.jsonskill-benchmark export-anthropic ../repo/evals/shared-benchmark.json \
--runs ../repo/eval-runs/latest \
--out benchmark.anthropic.jsonskill-benchmark compare-tasks ../repo/evals/shared-benchmark.json \
--runs ../repo/eval-runs/latest \
--out compare-tasks.jsonl \
--truth-out compare-truth.json
skill-benchmark compare-results \
--truth compare-truth.json \
--results compare-results.jsonl \
--out compare-summary.jsonskill-benchmark render-viewer \
--benchmark benchmark.json \
--runs ../repo/eval-runs/latest \
--out review.htmlskill-pi-trigger-eval ../repo/evals/shared-benchmark.json \
--split tune \
--runs-per-query 3 \
--out trigger-report.jsonThis creates a temporary PI_CODING_AGENT_DIR, copies the skill under skills/, runs Pi without forced --skill, and detects whether the model loaded the skill from JSON stream events.
Jetty support is optional. The harness exports runbook-mode chat-completion payloads, Jetty executes them, and import-jetty-results copies output.md, artifacts, and metadata back into the normal run layout.
# Export runbook-mode Jetty chat-completion payloads. No network calls.
skill-benchmark export-jetty ../repo/evals/shared-benchmark.json \
--split tune \
--out jetty-payloads.jsonl
# Dry-run payload loading without a token.
skill-benchmark run-jetty \
--payloads jetty-payloads.jsonl \
--dry-run \
--out jetty-dry-run.jsonl
# Live execution requires JETTY_API_TOKEN.
export JETTY_API_TOKEN=...
skill-benchmark run-jetty \
--payloads jetty-payloads.jsonl \
--out jetty-runs.jsonl
# Import Jetty artifacts into the normal run layout, then grade locally.
skill-benchmark import-jetty-results \
--manifest ../repo/evals/shared-benchmark.json \
--jetty-runs jetty-runs.jsonl \
--runs ../repo/eval-runs/jetty
skill-benchmark benchmark ../repo/evals/shared-benchmark.json \
--runs ../repo/eval-runs/jetty \
--out jetty-benchmark.jsonDefaults follow Jetty docs and jettyio/jettyio-skills: claude-code, claude-sonnet-4-6, model_provider=anthropic, and snapshot=python312-uv. The runbook is the system message. Runtime values go in jetty.template_variables. Uploaded files go in jetty.file_paths. Use JETTY_BASE_URL to override https://flows-api.jetty.io.
Ablations are opt-in variants that simulate removing part of a skill. Add entries under manifest.ablations, then prepare with --include-ablations.
skill-benchmark prepare ../repo/evals/shared-benchmark.json \
--split tune \
--include-ablations \
--out ablation-tasks.jsonlAblation task variants are named ablation:<id>. Trigger cases are skipped for ablation tasks because trigger behavior depends on the description/frontmatter rather than the body component being ablated.
- Anthropic skill-creator: use
grade --write-grading-filesandexport-anthropicfor compatiblegrading.json/benchmark.jsonshapes. - Pi: use
examples/adewale-workspace/run_pi_smoke.pyfor the Adewale multi-repo smoke workflow andskill-pi-trigger-evalfor autonomous trigger checks. - Other runners: use
prepareJSONL as the import format and write results back to the run output contract. - Jetty: use
export-jetty,run-jetty, andimport-jetty-resultsfor REST runbook-mode execution. Live response shapes still need token-backed smoke validation before treating Jetty runs as production evidence.
See CONTRIBUTING.md for local setup, validation commands, and eval-safety rules. The short version:
python3 -m py_compile *.py examples/adewale-workspace/*.py
python3 -m unittest discover tests -vFor manifest or grading changes, add or update tests/test_skill_benchmark.py. For docs-only changes, still run the same commands so CLI examples stay tied to current behavior.
- Local grading does not call a model. Model execution happens outside the harness, except for explicit runner commands such as
run-jetty. - The harness does not decide qualitative truth by itself; it emits judge prompts, runs an opt-in judge command, and merges the returned JSON.
- Hidden prompts are not protected if you pass
--include-answer-keyto generation jobs. - A passing answer benchmark does not prove autonomous skill loading; run
skill-pi-trigger-evalfor that.
skill-eval-harness/
├── README.md
├── CHANGELOG.md
├── CONTRIBUTING.md
├── pyproject.toml
├── skill_benchmark.py
├── run_pi_trigger_eval.py
├── .github/
│ ├── PULL_REQUEST_TEMPLATE.md
│ ├── ISSUE_TEMPLATE/
│ └── workflows/ci.yml
├── examples/
│ └── adewale-workspace/
│ ├── all-manifests.txt
│ ├── generate_shared_manifests.py
│ ├── run_pi_smoke.py
│ └── smoke_report.py
└── tests/
└── test_skill_benchmark.py
python3 -m py_compile *.py examples/adewale-workspace/*.py
python3 -m unittest discover tests -vThe test suite covers repeated runs, artifact outputs, answer-key omission, leakage lint, script assertions, judge-command parsing, Anthropic export shape, Jetty export/import, trace normalization, variant-scoped process assertions, Codex JSONL runs, Pi trigger traces, and Pi smoke workspace isolation.
This README was written against:
skill_benchmark.pyCLI and assertion implementationrun_pi_trigger_eval.pytrigger runnerpyproject.tomlpackage metadatadocs/repo-effectiveness-audit.mdfor the currentgood-repoaudittests/test_skill_benchmark.pybehavior coverageCHANGELOG.md,CONTRIBUTING.md, and.github/contribution/CI surfacesanti-slop-writing/skills/anti-slop-writing/SKILL.mdfor the v0.4.1 docs cleanup and consistency pass- the
good-readmeskill guidance fromhttps://www.skills.sh/adewale/good-readme/good-readme - the
good-reposkill guidance fromgood-repo/skills/good-repo/references/quality-checklist.md