Skill Eval Harness

Skill Eval Harness is a Python CLI for testing whether an Agent Skill changes observable output. It reads evals/shared-benchmark.json, emits answer-key-safe task rows, grades files under eval-runs/, and writes benchmark reports you can diff across variants.

The main question is narrow: when the same case runs with and without the skill, what changed, what passed, and did the eval itself leak the answer?

Core loop

Describe cases in evals/shared-benchmark.json: prompt, split, fixture files, variants, assertions, and ablations.
Prepare tasks with skill-benchmark prepare; generation rows omit expected_behavior and judge rubrics unless you explicitly request them.
Run tasks with Pi, Claude Code, Jetty, or another runner; each run writes output.md and optional metadata.json.
Grade outputs with deterministic assertions: string, regex, file, JSON field, and opt-in script oracles.
Inspect the report for pass rates, flaky repeated runs, no-lift cases, saturated assertions, judge tasks, and trigger/no-trigger results.

What the CLI owns

Variant pairing: with_skill, without_skill, optional old_skill, and ablation:<id>.
Split discipline: tune, holdout, and holdback stay separate.
Local grading: deterministic assertions run without model calls.
Eval hygiene: leakage lint, manifest audit, trigger checks, repeated-run stats, and fixture recommendations.
Interop: Anthropic-style exports, static HTML review pages, Pi trigger evals, and Jetty runbook-mode import/export.
Judge plumbing: judge/rubric assertions can be exported or run through a user-supplied --judge-cmd; the harness does not choose a model for you.

Quick start

Requires Python 3.10+ and uv. Install from GitHub first:
uv tool install git+https://github.com/adewale/skill-eval-harness.git@v0.4.1

Run these from a skill repo that has evals/shared-benchmark.json:

# 1. Check manifest shape and fixture paths.
skill-benchmark validate evals/shared-benchmark.json

# 2. Emit answer-key-safe task rows for a runner.
skill-benchmark prepare evals/shared-benchmark.json \
  --split tune \
  --runs-per-variant 3 \
  --out /tmp/tasks.jsonl

# 3. Run each task with your agent runner and save:
# eval-runs/latest/<case_id>/<variant>/run-<n>/output.md
# eval-runs/latest/<case_id>/<variant>/run-<n>/metadata.json

# 4. Grade saved outputs. Add --allow-scripts only if you trust repo-owned oracles.
skill-benchmark benchmark evals/shared-benchmark.json \
  --runs eval-runs/latest \
  --split tune \
  --allow-scripts \
  --out benchmark.json

# 5. Open a static review page.
skill-benchmark render-viewer \
  --benchmark benchmark.json \
  --runs eval-runs/latest \
  --out review.html

Expected landmarks:

validate  -> OK: <skill-name> — <case-count> cases, <ablation-count> ablations
prepare   -> /tmp/tasks.jsonl, one JSON object per case/variant/run
benchmark -> benchmark.json with summary, results, and case_flags
viewer    -> review.html with assertion evidence and output previews

benchmark.json records one row per case/variant/run, plus aggregate pass rates, timing/token summaries, and flags for saturated, no-lift, flaky, or with-skill-failed cases.

Installation

From GitHub

uv tool install git+https://github.com/adewale/skill-eval-harness.git@v0.4.1
skill-benchmark --help
skill-pi-trigger-eval --help

# One-shot without installing globally:
uvx --from git+https://github.com/adewale/skill-eval-harness.git@v0.4.1 skill-benchmark --help

The installed commands are:

Command	What it does
`skill-benchmark`	Validate manifests, prepare tasks, grade outputs, compare variants, run judges, and import/export runner formats.
`skill-pi-trigger-eval`	Runs Pi without forced `--skill` and checks whether the model loads the skill from stream events.

Local development

git clone https://github.com/adewale/skill-eval-harness.git
cd skill-eval-harness
uv tool install --editable .
skill-benchmark --help

Documentation map

File	Use it for
`README.md`	Manifest shape, run layout, and command contracts.
`CHANGELOG.md`	Release history and unreleased repo-surface changes.
`CONTRIBUTING.md`	Local setup, validation commands, and eval-safety rules.
`LESSONS_LEARNED.md`	Design lessons from the multi-skill saturation work.
`docs/jetty-support-spec.md`	Jetty payload/import contract and live-token unknowns.
`docs/trace-aware-eval-spec.md`	Trace artifact contract, shipped v0.4.1 runner support, process/efficiency assertions, and remaining trace work.
`docs/repo-effectiveness-audit.md`	`good-repo` audit, score, package metadata fixes, and manual GitHub settings checklist.
`TODO.md`	Remaining Jetty work: streaming/concurrency, live API validation, materialized ablations, judge export, and per-variant overrides.
`examples/adewale-workspace/`	Adewale-specific runners for Pi smoke, trigger, ablation, and aggregate reports.
`tests/test_skill_benchmark.py`	Executable examples for grading, leakage lint, script assertions, judge commands, Jetty export/import, trace artifacts, and trigger detection.

Manifest format

Each skill repo owns an evals/shared-benchmark.json manifest. Add a harness block so readers know which external harness/version to install.

{
  "version": 1,
  "skill_name": "good-pr",
  "harness": {
    "name": "skill-eval-harness",
    "url": "https://github.com/adewale/skill-eval-harness",
    "version": ">=0.4.1"
  },
  "skill_paths": ["skills/good-pr/SKILL.md"],
  "variants": ["with_skill", "without_skill"],
  "optional_variants": ["old_skill"],
  "split_policy": {
    "tune": "Visible cases used during iteration.",
    "holdout": "Hidden cases scored only at end-of-round or merge.",
    "holdback": "Examples not exposed in skill/docs/eval descriptions until after scoring."
  },
  "cases": [
    {
      "id": "pos-security-meaningless-test",
      "split": "tune",
      "kind": "pr-review",
      "domain": "pull-request-quality",
      "difficulty": "core",
      "trigger_type": "explicit",
      "success_goals": ["outcome", "style"],
      "prompt": "Security fix PR includes `expect(result).toBeDefined()` as the only auth-bypass test...",
      "files": ["fixtures/security-pr/diff.patch"],
      "expected_behavior": ["Flag the weak test and require regression proof."],
      "assertions": [
        {"name": "detect-weak-test", "type": "contains_any", "values": ["weak", "toBeDefined"]},
        {"name": "qualitative-review", "type": "judge", "rubric": ["Specific", "maintainer-friendly"]}
      ],
      "tags": ["security", "testing"]
    }
  ],
  "ablations": [
    {
      "id": "no-regression-proof",
      "removed_component": "regression-proof requirement",
      "expected_regressions": ["Accepts weak tests that still pass without the fix"]
    }
  ]
}

Splits

Split	Purpose	Prompt storage
`tune`	Visible cases used while editing the skill and evals.	Inline `prompt` is fine.
`holdout`	Hidden cases scored at end-of-round or merge.	Prefer private `prompt_ref`.
`holdback`	Not shown in skill/docs/evals until after scoring; detects memorization.	Prefer private `prompt_ref` and ignored answer keys.

prepare fails on missing hidden prompts unless --allow-missing-prompts is used for dry-run planning.

Use optional files for fixture-backed evals. Paths are relative to the manifest's evals/ directory, validated by validate, and emitted by prepare as absolute input_files for the runner.

Assertions

Objective assertion types:

Type	Checks
`contains`	One substring is present.
`contains_any`	At least one substring is present.
`contains_all`	Every listed substring is present.
`excludes_any`	No listed substring is present.
`regex`	Regex matches output.
`not_regex`	Regex does not match output.
`file_exists`	A file exists relative to the run directory.
`json_field_equals`	A JSON field equals an expected value.
`script`	Opt-in deterministic oracle command against the output directory.
`skill_invoked`	Trace/process check that the runner loaded the skill, or did not, as expected.
`command_ran` / `command_not_ran`	Trace/process checks over normalized command events.
`command_order`	Trace/process check that commands appeared in a required order.
`tool_count_le` / `no_repeated_command_loop`	Trace/process budgets for tool use and thrashing.
`total_tokens_le` / `elapsed_seconds_le` / `command_count_le`	Efficiency checks over `metrics.json`, `metadata.json`, or normalized events.

Use script when a keyword check is too weak for the property you care about. The command sees the candidate run directory, so it can inspect output.md, generated files under outputs/, or metadata. Script assertions are blocked unless you pass --allow-scripts to grade, benchmark, aggregate, or export-anthropic:

{
  "name": "oracle-pass",
  "type": "script",
  "command": ["python3", "oracles/oracle.py", "{output_dir}"],
  "pass_exit_code": 0,
  "timeout_s": 30
}

command runs with cwd set to the manifest directory. {output_dir} is replaced with the absolute run directory. The assertion passes when the command exits with pass_exit_code (default 0); stdout and stderr are stored as evidence.

Trace/process/efficiency assertions are optional and fail closed when declared evidence is missing. For example, command_not_ran cannot pass without events.json, and total_tokens_le cannot pass without token telemetry.

Assertions can be scoped to variants when the expected process differs by arm:

{"name":"with-skill-loaded","type":"skill_invoked","expected":true,"variants":["with_skill"]}
{"name":"without-skill-clean","type":"skill_invoked","expected":false,"variants":["without_skill"]}

Use this for process checks such as skill_invoked; otherwise a with-skill requirement would incorrectly penalize the no-skill baseline.

Qualitative assertion types:

Type	Behavior
`judge`	Deferred into `judge-tasks.jsonl`; merge results with `--judge-results`.
`rubric`	Same deferred qualitative flow.

Judge results are keyed by judge_task_id:

{"judge_task_id":"case::with_skill::run-1::qualitative-review","passed":true,"score":4,"evidence":"Specific evidence from output"}

Run output contract

The harness grades either the legacy layout:

runs/<case_id>/<variant>/output.md
runs/<case_id>/<variant>/metadata.json

or repeated/artifact layout:

runs/<case_id>/<variant>/run-1/output.md
runs/<case_id>/<variant>/run-1/metadata.json
runs/<case_id>/<variant>/run-2/outputs/<artifact files>

Trace-aware runners may also write:

runs/<case_id>/<variant>/run-1/trace.jsonl       # raw runner event stream
runs/<case_id>/<variant>/run-1/events.json       # normalized events used by process assertions
runs/<case_id>/<variant>/run-1/metrics.json      # tokens, commands, tool calls, elapsed time, retries
runs/<case_id>/<variant>/run-1/environment.json  # runner/model/sandbox details where available

metadata.json is optional, but include what your runner can capture:

{
  "elapsed_ms": 12345,
  "input_tokens": 1000,
  "output_tokens": 500,
  "total_tokens": 1500,
  "model": "anthropic/claude-sonnet-4"
}

Commands

Validate

skill-benchmark validate ../repo/evals/shared-benchmark.json
skill-benchmark validate ../repo/evals/shared-benchmark.json --strict-holdback
skill-benchmark validate ../repo/evals/shared-benchmark.json --strict-leakage

validate checks manifest shape, fixture paths, regex syntax, script oracle paths, and hidden-prompt refs. It also warns when a contains* assertion value appears literally in the prompt:

WARN pos-ui-no-screenshot: assertion 'detect-ui-no-screenshot' value 'screenshot' appears in prompt (leakage; case may saturate)

That warning means a weak answer can pass by echoing the task. Use --strict-leakage only after you have replaced noisy keyword checks with scoped regexes, fixture-backed checks, script oracles, or judge assertions.

Prepare tasks

skill-benchmark prepare ../repo/evals/shared-benchmark.json --split tune --out tasks.jsonl
skill-benchmark prepare ../repo/evals/shared-benchmark.json --runs-per-variant 5 --out tasks.jsonl
skill-benchmark prepare ../repo/evals/shared-benchmark.json --include-ablations --out ablation-tasks.jsonl

Use --include-answer-key only for judge/debug tasks, never for generation runs.

Import runner traces

Normalize a raw JSONL trace into events.json and metrics.json for process and efficiency assertions:

skill-benchmark import-trace \
  --source codex \
  --trace ../repo/eval-runs/latest/case/with_skill/run-1/trace.jsonl \
  --run-dir ../repo/eval-runs/latest/case/with_skill/run-1 \
  --write-metadata

Run Codex JSONL tasks

run-codex executes prepared rows through a command compatible with codex exec --json, saves trace.jsonl, normalizes events/metrics, extracts the final answer into output.md, and records nonzero/timeouts as failed run artifacts:

skill-benchmark prepare ../repo/evals/shared-benchmark.json --split tune --out tasks.jsonl
skill-benchmark run-codex --tasks tasks.jsonl --runs ../repo/eval-runs/codex-tune

Override --codex-cmd for local wrappers or tests. A concrete Codex smoke command is:

skill-benchmark run-codex \
  --tasks tasks.jsonl \
  --runs ../repo/eval-runs/codex-trace \
  --codex-cmd 'codex exec --json --sandbox read-only --skip-git-repo-check --ephemeral'

Pi trace runners

The Adewale Pi smoke example writes the trace-aware run layout directly:

python3 examples/adewale-workspace/run_pi_smoke.py \
  --run-name trace-smoke \
  --selection /tmp/selection.json

The runner uses an isolated temporary workspace. with_skill receives copied skill files and fixtures. without_skill receives fixtures only and runs with --no-skills, so grep/find/read cannot discover the source repo's skills/*/SKILL.md or public eval manifests.

skill-pi-trigger-eval can also write per-query trace artifacts:

skill-pi-trigger-eval ../repo/evals/shared-benchmark.json \
  --eval-set trigger-queries.json \
  --out trigger-results.json \
  --trace-runs trigger-traces

Grade

grade produces per-run grading rows and can emit pending judge tasks:

skill-benchmark grade ../repo/evals/shared-benchmark.json \
  --runs ../repo/eval-runs/latest \
  --out grade-report.json \
  --judge-tasks judge-tasks.jsonl

Write Anthropic-compatible grading.json files into each run directory:

skill-benchmark grade ../repo/evals/shared-benchmark.json \
  --runs ../repo/eval-runs/latest \
  --write-grading-files

Benchmark

benchmark aggregates graded rows into variant summaries, paired deltas, slice summaries, telemetry availability, and case flags. Add --allow-scripts only when you trust the repo-owned oracle commands in the manifest.

skill-benchmark benchmark ../repo/evals/shared-benchmark.json \
  --runs ../repo/eval-runs/latest \
  --allow-scripts \
  --judge-results judge-results.jsonl \
  --out benchmark.json

Judge command backend

Run deferred judge/rubric assertions with a command that reads one grading prompt from stdin and emits JSON on stdout. The prompt contains the original case prompt, expected_behavior, review_rubric, the assertion, and the saved candidate output.

skill-benchmark judge ../repo/evals/shared-benchmark.json \
  --runs ../repo/eval-runs/latest \
  --judge-cmd 'claude -p' \
  --transcripts judge-transcripts \
  --out judge-results.jsonl

skill-benchmark benchmark ../repo/evals/shared-benchmark.json \
  --runs ../repo/eval-runs/latest \
  --judge-results judge-results.jsonl \
  --out benchmark.json

The judge command should return JSON like {"passed": true, "score": 4, "rationale": "..."}. Bare or fenced JSON is accepted using json.raw_decode scanning rather than brace counting. --transcripts saves the exact prompt, stdout, stderr, and parsed result for each judge task.

Audit manifest quality

skill-benchmark audit-manifest ../repo/evals/shared-benchmark.json \
  --format markdown \
  --out eval-audit.md

Add --runs ../repo/eval-runs/latest to include saturated-case, no-lift, flaky repeated-run, and per-assertion discrimination analysis.

The audit reports:

missing positive, negative, and adversarial eval coverage,
missing holdout/holdback split coverage,
missing trigger/no-trigger coverage,
missing domain/difficulty/success-goal taxonomy for slice summaries,
ablation-plan suggestions from major skill sections,
saturated and no-lift cases when run data is available,
assertions with identical with/without pass rates, and
recommended fixture repos/files.

Profile skill size and references

skill-benchmark profile-skill ../repo/evals/shared-benchmark.json \
  --format markdown \
  --out skill-profile.md

profile-skill reports SKILL.md token estimates, reference-file counts/sizes, heading/module counts, and warnings for overly broad or oversized skills. These warnings are advisory; focused 2–3-module skills are often easier for agents to apply, but large skills can be justified when references are conditional.

Token overhead

token-overhead combines static skill profile data with paired runtime traces. It reports the static SKILL.md/reference footprint, with_skill - without_skill token deltas, objective lift, and objective lift per 1k extra total tokens when paired metrics.json files exist.

skill-benchmark token-overhead ../repo/evals/shared-benchmark.json \
  --runs-subdir eval-runs/latest \
  --format markdown \
  --out token-overhead.md

skill-benchmark token-overhead \
  ../skill-a/evals/shared-benchmark.json \
  ../skill-b/evals/shared-benchmark.json \
  --runs-subdir eval-runs/trace-smoke \
  --out token-overhead.json

If a repo has no paired trace metrics, the report still includes the static footprint and shows 0 runtime pairs.

Aggregate many skills

skill-benchmark aggregate \
  $(cat examples/adewale-workspace/all-manifests.txt) \
  --runs-root .. \
  --runs-subdir eval-runs/latest \
  --out aggregate-benchmark.json

Export Anthropic-compatible benchmark

skill-benchmark export-anthropic ../repo/evals/shared-benchmark.json \
  --runs ../repo/eval-runs/latest \
  --out benchmark.anthropic.json

Blind comparison

skill-benchmark compare-tasks ../repo/evals/shared-benchmark.json \
  --runs ../repo/eval-runs/latest \
  --out compare-tasks.jsonl \
  --truth-out compare-truth.json

skill-benchmark compare-results \
  --truth compare-truth.json \
  --results compare-results.jsonl \
  --out compare-summary.json

Static viewer

skill-benchmark render-viewer \
  --benchmark benchmark.json \
  --runs ../repo/eval-runs/latest \
  --out review.html

Pi trigger evals

skill-pi-trigger-eval ../repo/evals/shared-benchmark.json \
  --split tune \
  --runs-per-query 3 \
  --out trigger-report.json

This creates a temporary PI_CODING_AGENT_DIR, copies the skill under skills/, runs Pi without forced --skill, and detects whether the model loaded the skill from JSON stream events.

Jetty adapter

Jetty support is optional. The harness exports runbook-mode chat-completion payloads, Jetty executes them, and import-jetty-results copies output.md, artifacts, and metadata back into the normal run layout.

# Export runbook-mode Jetty chat-completion payloads. No network calls.
skill-benchmark export-jetty ../repo/evals/shared-benchmark.json \
  --split tune \
  --out jetty-payloads.jsonl

# Dry-run payload loading without a token.
skill-benchmark run-jetty \
  --payloads jetty-payloads.jsonl \
  --dry-run \
  --out jetty-dry-run.jsonl

# Live execution requires JETTY_API_TOKEN.
export JETTY_API_TOKEN=...
skill-benchmark run-jetty \
  --payloads jetty-payloads.jsonl \
  --out jetty-runs.jsonl

# Import Jetty artifacts into the normal run layout, then grade locally.
skill-benchmark import-jetty-results \
  --manifest ../repo/evals/shared-benchmark.json \
  --jetty-runs jetty-runs.jsonl \
  --runs ../repo/eval-runs/jetty

skill-benchmark benchmark ../repo/evals/shared-benchmark.json \
  --runs ../repo/eval-runs/jetty \
  --out jetty-benchmark.json

Defaults follow Jetty docs and jettyio/jettyio-skills: claude-code, claude-sonnet-4-6, model_provider=anthropic, and snapshot=python312-uv. The runbook is the system message. Runtime values go in jetty.template_variables. Uploaded files go in jetty.file_paths. Use JETTY_BASE_URL to override https://flows-api.jetty.io.

Ablations

Ablations are opt-in variants that simulate removing part of a skill. Add entries under manifest.ablations, then prepare with --include-ablations.

skill-benchmark prepare ../repo/evals/shared-benchmark.json \
  --split tune \
  --include-ablations \
  --out ablation-tasks.jsonl

Ablation task variants are named ablation:<id>. Trigger cases are skipped for ablation tasks because trigger behavior depends on the description/frontmatter rather than the body component being ablated.

Compatibility notes

Anthropic skill-creator: use grade --write-grading-files and export-anthropic for compatible grading.json/benchmark.json shapes.
Pi: use examples/adewale-workspace/run_pi_smoke.py for the Adewale multi-repo smoke workflow and skill-pi-trigger-eval for autonomous trigger checks.
Other runners: use prepare JSONL as the import format and write results back to the run output contract.
Jetty: use export-jetty, run-jetty, and import-jetty-results for REST runbook-mode execution. Live response shapes still need token-backed smoke validation before treating Jetty runs as production evidence.

Contributing

See CONTRIBUTING.md for local setup, validation commands, and eval-safety rules. The short version:

python3 -m py_compile *.py examples/adewale-workspace/*.py
python3 -m unittest discover tests -v

For manifest or grading changes, add or update tests/test_skill_benchmark.py. For docs-only changes, still run the same commands so CLI examples stay tied to current behavior.

Non-goals

Local grading does not call a model. Model execution happens outside the harness, except for explicit runner commands such as run-jetty.
The harness does not decide qualitative truth by itself; it emits judge prompts, runs an opt-in judge command, and merges the returned JSON.
Hidden prompts are not protected if you pass --include-answer-key to generation jobs.
A passing answer benchmark does not prove autonomous skill loading; run skill-pi-trigger-eval for that.

Repository layout

skill-eval-harness/
├── README.md
├── CHANGELOG.md
├── CONTRIBUTING.md
├── pyproject.toml
├── skill_benchmark.py
├── run_pi_trigger_eval.py
├── .github/
│   ├── PULL_REQUEST_TEMPLATE.md
│   ├── ISSUE_TEMPLATE/
│   └── workflows/ci.yml
├── examples/
│   └── adewale-workspace/
│       ├── all-manifests.txt
│       ├── generate_shared_manifests.py
│       ├── run_pi_smoke.py
│       └── smoke_report.py
└── tests/
    └── test_skill_benchmark.py

Development

python3 -m py_compile *.py examples/adewale-workspace/*.py
python3 -m unittest discover tests -v

The test suite covers repeated runs, artifact outputs, answer-key omission, leakage lint, script assertions, judge-command parsing, Anthropic export shape, Jetty export/import, trace normalization, variant-scoped process assertions, Codex JSONL runs, Pi trigger traces, and Pi smoke workspace isolation.

Source checked

This README was written against:

skill_benchmark.py CLI and assertion implementation
run_pi_trigger_eval.py trigger runner
pyproject.toml package metadata
docs/repo-effectiveness-audit.md for the current good-repo audit
tests/test_skill_benchmark.py behavior coverage
CHANGELOG.md, CONTRIBUTING.md, and .github/ contribution/CI surfaces
anti-slop-writing/skills/anti-slop-writing/SKILL.md for the v0.4.1 docs cleanup and consistency pass
the good-readme skill guidance from https://www.skills.sh/adewale/good-readme/good-readme
the good-repo skill guidance from good-repo/skills/good-repo/references/quality-checklist.md

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
docs		docs
examples/adewale-workspace		examples/adewale-workspace
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LESSONS_LEARNED.md		LESSONS_LEARNED.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
pyproject.toml		pyproject.toml
run_pi_trigger_eval.py		run_pi_trigger_eval.py
skill_benchmark.py		skill_benchmark.py

Folders and files

Latest commit

History

Repository files navigation

Skill Eval Harness

Core loop

What the CLI owns

Contents

Quick start

Installation

From GitHub

Local development

Documentation map

Manifest format

Splits

Assertions

Run output contract

Commands

Validate

Prepare tasks

Import runner traces

Run Codex JSONL tasks

Pi trace runners

Grade

Benchmark

Judge command backend

Audit manifest quality

Profile skill size and references

Token overhead

Aggregate many skills

Export Anthropic-compatible benchmark

Blind comparison

Static viewer

Pi trigger evals

Jetty adapter

Ablations

Compatibility notes

Contributing

Non-goals

Repository layout

Development

Source checked

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages