AKM Eval runs real benchmark packs through authoritative upstream harnesses and normalizes the outputs.
Part of the akm ecosystem — see also akm-stash, akm-plugins, akm-registry, and akm-bench.
Trust policy:
- no synthetic or heuristic success metrics
- no silent fallback when an official harness or evaluator is unavailable
- baseline and future AKM variants both use real model providers
For normal bin/... usage:
bashdockerwith a running daemonuv- one real model-provider setup:
opencode:config/opencode.jsonplus required env such asOPENCODE_API_KEY- or
openai-compatible: a reachable endpoint plus its required env/config
Extra pack requirements still apply:
beam: localvendor/BEAMcheckout, prepared official datasets, and judge configurationterminal-bench:opencodeprovider path only
bun is only required for repo development tasks.
bin/build-image
bin/doctor --pack locomo
bin/eval --pack locomo --variant baseline --config config/common/locomo-smoke.jsonCommon runnable configs live under config/common/; see docs/running-evals.md for the current list.
config/common/locomo-smoke.jsonconfig/common/longmemeval-smoke.jsonconfig/common/beam-smoke.jsonconfig/common/swe-bench-smoke.jsonconfig/common/swe-bench-smoke-openai-compatible.jsonconfig/common/tau-bench-smoke.jsonconfig/common/terminal-bench-smoke.json
locomolongmemevalbeamswe-benchterminal-benchtau-benchakm-benchremains intentionally blocked
| Pack | opencode |
openai-compatible |
|---|---|---|
locomo |
Yes | Yes |
longmemeval |
Partial | Yes |
beam |
Yes | Yes |
swe-bench |
Yes | Yes |
tau-bench |
No | Yes |
terminal-bench |
Yes | No |
akm-bench |
No | No |
- command flow:
docs/running-evals.md - operator caveats and exceptions:
docs/operator-guide.md - pack constraints:
docs/benchmark-packs.md - remaining external blockers:
docs/operator-blockers.md - normalized result contract:
docs/result-schema.md - contributor guide:
docs/contributing.md