Skip to content

0.5.0-alpha.1: subagents, evo wait, hook-drain fix, skill rewrites#54

Merged
alokwhitewolf merged 67 commits into
mainfrom
feat/model-update
Jun 6, 2026
Merged

0.5.0-alpha.1: subagents, evo wait, hook-drain fix, skill rewrites#54
alokwhitewolf merged 67 commits into
mainfrom
feat/model-update

Conversation

@alokwhitewolf

Copy link
Copy Markdown
Collaborator

Cuts 0.5.0-alpha.1. Tag v0.5.0-alpha.1 already pushed at the tip; publish workflow fires on the same SHA.

Plugin subagents (new plugins/evo/agents/)

Adds an agents/ directory next to skills/. Claude Code's plugin loader picks up agents/<name>.md automatically -- no .claude-plugin/plugin.json change needed. Invokable from skills via Task(subagent_type="evo:<name>", ...).

  • evo:benchmark-reviewer -- read-only pre-flight audit of a freshly-constructed benchmark before its first evo run. Checks per-task instrumentation, eval-set leakage (direct + transitive), Goodhart gate coverage, basic plumbing. Returns {passed, findings[]}. evo:discover step 10d now gates the baseline run on passed=true.
  • evo:verifier -- migrated from skills/verifier/. Read-only audit catching the five false-progress patterns (held-out training, eval items in synthetic data, reverse-engineered verifier, instruct-model substitution, train↔verifier objective mismatch). Skill stub at the old path redirects to the agent.
  • evo:ideator -- migrated from skills/ideator/. Web-research-heavy proposal generator; one brief per invocation (failure_analysis | literature | frontier_extrapolation); appends JSONL proposals the orchestrator reconciles. Skill stub redirects.

Refs #50. The third migration -- skills/subagent/ -> agents/experiment-runner.md -- is not in this PR; that's a larger refactor touching the spawn protocol in evo:optimize.

Hook-drain: lazy-register on PostToolUse

Fixes #49, where a claude --print session whose .evo/ is created from inside the session never registers in inject/sessions/. evo direct then reports fanout=0 and queued directives are undeliverable for the rest of the session.

bin/evo-hook-drain-rs/src/main.rs already had a UserPromptSubmit lazy-register branch with the comment "Lazy-register on first prompt so the session can recover if SessionStart fired before .evo existed." --print mode never emits UserPromptSubmit, so the recovery path never fired in batch sessions. Adds a parallel PostToolUse branch (engage=false); the first tool call after evo init registers the session.

Source change is in this PR. A binary rebuild + release is needed for plugin installs to pick up the fix; tagging v0.5.0-alpha.1 should trigger the publish workflow that builds the binaries. Until that lands, host-install falls back to releases/latest (see below).

evo wait extensions

Implements #52. --for is now repeatable. New targets:

  • --for process=<pid> -- exits when the PID dies. Liveness via os.kill(pid, 0). exit_code returned null because evo wait is rarely the pid's parent; documented in the help text.
  • --for log-growth=<path> -- exits when the file stops growing for --stall-threshold seconds.
  • --for gpu-active / --for gpu-idle -- reads nvidia-smi --query-gpu=utilization.gpu; skips cleanly with {\"note\":\"nvidia-smi unavailable\"} when the tool is absent.

Shared options: --timeout (duration string; default 1h, max 24h), --stall-threshold (default 2m), --poll-interval (default 5s), --json (structured exit reason with exit_reason, triggered_by, per-watch state).

Multiple --for flags combine; returns on the first matching condition. Existing --for experiments and --for ideators preserved.

Follow-up #53: the workspace check fires at the CLI entry, but the new process/log-growth/gpu-* targets don't need workspace context. Move that check into the experiments/ideators branches only.

Diff capture

Fixes #51. attempts/<n>/diff.patch files were always 0 bytes. The capture in cli.py ran render_git_diff(relative_target(config)) -- scoped to the configured target file -- and the typical workflow of committing all changes to the experiment branch before evo run left the working tree clean for the capture site. Any source change outside the target file was also silently dropped even when the working tree was dirty.

Replaced with capture_experiment_diff(root, exp_id, attempt, parent_ref, worktree, executor, exclude_patterns) in core.py. Range diff (parent_branch..exp_branch) repo-wide. Default DEFAULT_DIFF_EXCLUDE_PATTERNS filters safetensors, optimizer/scheduler state, tokenizer blobs, and checkpoint-*/**. Overridable per workspace via config[\"diff_exclude_patterns\"]. Wired into the existing capture site in cli.py.

PostToolUse wait-hint hook

plugins/evo/hooks/wait_hint.sh registered in hooks.json under PostToolUse with matcher: \"Bash\". Inspects the command, matches against a long-running-command pattern table (python.*train, vllm serve, accelerate launch, nohup .* &, while true ... sleep, sleep ... tail), emits a one-line [evo-hint] to stdout suggesting evo wait instead of tail-loop polling. Dedup per session via marker file under $TMPDIR. Cold runtime ~35ms.

Companion to the skill-text polling discipline (below). Closes the gap where the skill text describes the right pattern but agents may not read it.

Host-install: releases/latest fallback

host_install/_hook_drain.py fetched releases/download/v<version>/<asset> and gave up on 404. Pre-release versions (0.5.0-alpha.1, etc.) don't get a corresponding GitHub Release until the publish workflow fires on a stable bump. Result: every alpha install of evo printed a 404 warning and skipped staging the binary, leaving evo direct mid-run inject silently broken.

Adds a second URL to the fetch chain: releases/latest/download/<asset>. The hook-drain wire protocol is stable across minor versions, so a slightly older binary still works for the current alpha CLI -- same rationale used in our Modal image setup. Prints a NOTE when the fallback is used so users know they're running against the latest stable binary rather than a version-pinned one.

Skill rewrites

evo:finetuning

  • f24fbc3 refactor(skill/finetuning): rewrote the ~1200-word body into ~600 words structured for skim. Description names every covered technique (SFT/LoRA/DPO/KTO/ORPO/RFT/GRPO/PPO/RLOO/RLHF) for the host's trigger-phrase loader. Reward-shape decision tree moved from prose to a four-row table. Smoke-run section added as a precondition (~10 examples, ~1 minute, must produce a loadable checkpoint and a non-zero eval). Five false-progress patterns surfaced as a top-level list pointing at references/false-progress.md.
  • 8ad213c feat(skill/finetuning): added a fourth diagnostic: stuck at the same non-zero score across 3+ experiments spanning distinct techniques (SFT, GRPO, RFT). Bottleneck is not the training method; spot-check 3 training examples and 3 eval-prompt examples for train↔verifier output format alignment (e.g. \\boxed{X} vs ANSWER: X).
  • ee4a6ac feat(skill/finetuning): documents EVO_PARENT_POLICY env var as the default warm-start path for non-root experiments. evo run already populates it; the skill now explicitly tells the training script to consume it.

evo:discover and evo:optimize

  • 34d11e8 feat(skills): polling discipline section in both skills. Bans while true; do sleep N; tail; done because a crashed process stops producing output and the loop reads the empty delta as "still working." Documents a bounded for i in $(seq 1 N); do ...; done poll pattern that checks process liveness, log size delta, and GPU activity. Forward-compatible note pointing at evo wait --for (this PR's Extend evo wait to watch processes, log growth, and GPU activity (replace while true agent poll loops) #52).
  • 8ab82be fix(skill/discover): runner-library wrapper anti-pattern. When a benchmark wraps a runner that hides the per-item loop (inspect_evals, lm-eval-harness, evals), the wrapper must parse the runner's per-sample output and emit one log_task(item_id, score=...) per item. The aggregate-only emission case was the canonical failure of the per-task instrumentation discipline; it left the dashboard's per-task panel and the verifier's spot-check capability silently empty.

evo:infra-setup

  • 0b78240 fix(skill/infra-setup): removed disable-model-invocation: true from the frontmatter. Without removing it, the skill couldn't be loaded via the Skill tool, which evo:discover step 0 ("internalize every skill before any action") explicitly requires. Same fix that was previously applied to skills/subagent/ in 680336e.

Dashboard

c8f5e62 feat(dashboard): -- experiment node drawer now has a Logs tab + a trackio link/sparkline in the Summary tab.

  • New Logs tab tails *.log and *.out files in the latest attempt dir. File picker, auto-refresh toggle (2s poll), byte-offset incremental polling so the panel appends new bytes instead of re-rendering.
  • New Metrics row in the Summary tab when an experiment's traces dir contains a .trackio_url marker (written by posttrainbench-evo's trl_trackio_callback.py). Renders a clickable HF Space link and up to 3 inline SVG sparklines (loss / lr / reward / kl / grad_norm preferred). No Plotly dependency.

Backend endpoints:

  • GET /api/node/<exp>/logs -- lists *.log and *.out files in the latest attempt with size + mtime.
  • GET /api/node/<exp>/log/<file>?tail=N&offset=M -- existing endpoint gained line-tail and byte-offset modes; sets X-Log-Size for resumable polling.
  • GET /api/node/<exp>/trackio -- reads the .trackio_url marker; optionally scrapes recent scalars via huggingface_hub + pandas + pyarrow (lazy imports; degrades to link-only when any are unavailable).

Files moved

  • plugins/evo/skills/verifier/SKILL.md and plugins/evo/skills/ideator/SKILL.md -- replaced with redirect stubs. Bodies live in plugins/evo/agents/verifier.md and plugins/evo/agents/ideator.md.
  • plugins/evo/skills/optimize/SKILL.md and plugins/evo/skills/subagent/SKILL.md -- callers updated to invoke Task(subagent_type=\"evo:verifier\" | \"evo:ideator\") instead of loading the old skill bodies.

Versions

  • pyproject.toml, plugins/evo/pyproject.toml, plugins/evo/.claude-plugin/plugin.json, and every skill's evo_version field at 0.5.0-alpha.1.
  • plugins/evo/uv.lock resynced (was drifting at 0.5.0a4).
  • bin/evo-hook-drain-rs/Cargo.toml stays at 0.1.0. The Rust binary doesn't ride the CLI version.

Issues

Platform-agnostic method judgment (triggers, what matters, the rules that are
not laws) plus references: trace schema, glue contract, diagnostics, and
provider recipes (sft/tinker, rl/art, serving/vllm). Judgment lives in the
skill; guardrails are evo gates. evo ships no per-framework training adapter.
Path.home() / ".claude" hardcoded in 7 spots silently misses the plugin
cache when Claude Code is configured via CLAUDE_CONFIG_DIR (cloud
containers with a persistent volume, sandbox setups, custom configs).
The downstream symptom: ensure_hook_drain_binary skipped at install,
hook fires with exit 127 at runtime, evo direct delivery permanently
broken with no clear error.

Centralize on _claude_config_dir() that reads CLAUDE_CONFIG_DIR with
~/.claude as the fallback. Apply across _latest_cache_dir, update --force
cache wipe, and the doctor command.

Regression test in tests/unit/test_claude_code_install_paths.py covers
both the override and the no-leak-to-home invariant.
…ride

Workspaces now declare their wall-clock budget per experiment up front
at `evo init`. Stored in config.json as `per_exp_timeout`; consumed by
`evo run` as the default for `subprocess.run(..., timeout=...)`. Per-call
override is `evo run --timeout N`.

Was: evo run silently used a hardcoded 1800s default whenever the agent
forgot --timeout. SFT-style benchmarks (~40-60min) systematically got
killed at step 1 with no per-workspace tunability.

API change (breaking):
  - `evo init` now requires --per-exp-timeout <seconds>
  - core.init_workspace() gains per_exp_timeout: int | None parameter
  - `evo config get|set per-exp-timeout` mirrors the existing field pattern

Backward compat:
  - Workspaces initialized before this change (no `per_exp_timeout` in
    config.json) fall back to legacy 1800s with a one-line stderr warning
    nudging the user to run `evo config set per-exp-timeout <N>`.

Discover skill (step 7) + cli-quick-reference both updated to include
the required flag, with guidance on picking the value.

Version bump 0.4.4 -> 0.5.0-alpha.1 (minor: required-arg-add at init).

Tests:
  - tests/unit/test_per_exp_timeout.py covers persistence, precedence
    (override > workspace > legacy fallback), value validation, argparse
    enforcement of required=True.
  - 57 existing test-suite evo-init call sites updated with the new flag
    (matched only lists containing both "init" and "--target" to avoid
    false-matching git init / git commit -m "init").
  - Full unit suite green (582 passed, 1 skipped).
Two new skills, each addressing a distinct failure mode the optimize
loop didn't catch before:

evo:verifier (SIA-style sequential)
  Audit a single experiment for cheating/validity. Two phases:
    --phase pre   : static analysis before evo run (cheap, ~30s).
                    Catches test-set leakage in training data, deliberately
                    subsetted eval, missing artifact gates, vague hypotheses,
                    resource-profile conflicts.
    --phase post  : result audit before commit. Duration sanity vs cohort,
                    artifact reality (real model files), score reproducibility
                    spot-check (re-eval 2 random tasks), gate compliance.
  Writes verdict as `evo annotation --type verification`. Exit 1 on FAIL
  so the subagent's caller can branch.

evo:ideator (Cursor parallel-then-reconcile)
  Generate experiment proposals via three briefs spawnable in parallel:
    --brief failure_analysis     : cluster discards, propose alternatives
    --brief literature           : web/arxiv scan for untried techniques
    --brief frontier_extrapolation: deeper variants of the steepest gradient
  All briefs append to .evo/run_<id>/ideator/proposals.jsonl. Orchestrator
  reconciles at next evo new (rank by uplift x confidence, dedupe vs graph).

Integration:
  subagent skill: new step 4 (pre-verifier) + step 6 (post-verifier).
                  Original "Run / Analyze / Annotate / Decide" renumbered
                  to 5/7/8/9. FAIL verdict triggers evo discard with the
                  verifier's one-line reason; PASS proceeds to evo run.
  optimize skill: new step 6b (spawn parallel ideators on stall / every
                  5 commits / failure clusters) + step 6c (reconcile
                  proposals at next brief-writing time). Non-blocking --
                  orchestrator continues its loop while ideators run.

Patterns synthesized from:
  - hexo-ai/sia       : single Feedback-Agent reading trajectories.
                        Sequential. Inspired verifier's pre/post split.
  - huggingface/ml-intern: doom-loop detector as lightweight watchdog
                        (not a separate agent). Informs follow-up work.
  - cursor 2.4 subagents : parallel-then-reconcile pipeline (lint+security+
                        tests merging into one review). Direct template
                        for ideator's three-brief fan-out.

Verifier's pre-phase static cheating check is novel relative to all three
prior systems -- evo is more autonomous (no human-in-loop per experiment)
so it needs to detect design-time gaming itself.

Version bump 0.5.0-alpha.1 -> 0.5.0-alpha.2.
bump-version.py SKILLS tuple extended to include verifier+ideator.

Tests: full unit suite green (582 passed, 1 skipped).
… proposals

evo wait used to only watch experiment-dir transitions. The
verifier+ideator skill (35c313f) said "fire ideators non-blocking, read
proposals at next round" -- but ideators take 5-10 min and the next
round is 1-2 min away, so proposals consistently miss the round they
were spawned for.

Adds --for {experiments,ideators} (default: experiments) plus --count N
(ideators only). The ideator path snapshots proposals.jsonl mtime +
line count at wait start, returns 0 when N additional lines have been
appended since baseline; partial counts surface on timeout (e.g.
"timed out with 1/3 proposals (partial)") so the orchestrator can
proceed with what's available.

Optimize skill step 6b now distinguishes the trigger:
  - STALL or FAILURE CLUSTER -> block here briefly (next round depends
    on fresh ideas): `evo wait --for ideators --count 3 --timeout 900`
  - Periodic (every-N-commits) -> fire and continue; the next round
    can run on in-graph signal, the round after picks up proposals

Step 6c adds a short fallback wait for the fire-and-continue path:
`evo wait --for ideators --count 1 --timeout 120`.

Ideator skill clarifies:
  - --count counts LINES not ideator completions
  - Append-at-end discipline is for failure atomicity, not wait semantics

Tests: 6 new in TestEvoWaitForIdeators (timeout, single-arrival, --count 3
satisfies after 3rd, partial timeout w/ 1/3 surface, baseline doesn't
satisfy, baseline+new wakes). Full unit suite: 588 passed, 1 skipped.

Version bump 0.5.0-alpha.2 -> 0.5.0-alpha.3.
evo wait used to require --for {experiments|ideators} with experiments
as default. The split was reflexive copy of an existing pattern. The
common case is "wake me when anything interesting happens" -- both
sources qualify. Make that the default; --for becomes a filter for the
narrow case (orchestrator blocking specifically for proposals after a
stall, doesn't want incidental experiment activity to wake it).

  Old: evo wait                  # only experiments
       evo wait --for ideators   # only ideators
  New: evo wait                  # both (default; one command, no flags)
       evo wait --for ideators   # filter to just ideators
       evo wait --for experiments # filter to just experiments

--count now requires --for (otherwise ambiguous which kind to count);
--count > 1 without --for returns exit 2 with a usage error pointing
at the right invocation.

Tests: 5 new in TestEvoWaitDefaultWatchesBoth covering:
  - default wakes on experiment outcome (existing behavior preserved)
  - default also wakes on ideator proposal
  - default timeout message names both watched sources
  - --count > 1 without --for is rejected
  - --count = 1 without --for is allowed (no-op pairing)
Existing TestEvoWait and TestEvoWaitForIdeators classes pass unchanged
(their explicit args still satisfy the new resolution).

Full unit suite: 593 passed, 1 skipped.
Version bump 0.5.0-alpha.3 -> 0.5.0-alpha.4.
Patterned after huggingface/ml-intern's tool surface (HF Papers + HF Hub
+ HF Docs + GitHub code/repos/files + web search): the literature brief
now scans arXiv, HF Papers, HF Hub, GitHub code search, GitHub issues,
GitHub PRs, and recent blog posts in one pass. Each source surfaces a
different kind of signal (papers = methodology, repos = runnable code,
issues = practitioner anecdotes, blogs = honest writeups).

Added structured procedure:
  1. Frame the search from project.md + evo show root
  2. Multi-source parallel scan (5-8 queries, source matrix provided)
  3. Due diligence per candidate (WebFetch the paper/repo/issue, check
     last commit, confirm claim is in headline results not appendix)
  4. Cross-check evo graph -- skip duplicates of prior experiments
  5. Rank by has_runnable_code > replication > specificity > recency
  6. Write 2-4 proposals with full provenance

Proposal schema extended with optional `sources[]` (kind+url+claim) and
`confidence_signals` (has_runnable_code, replicated_across_sources,
specificity, recency_months) -- only required for the literature brief.
Orchestrator's reconciler can use these to weight proposals.

Added "Generic web-research mode" subsection: the literature brief
doubles as a focused web-research agent when the orchestrator passes
a specific question instead of the broad "what could we try."

Also stripped domain-specific examples (AIME / Qwen / NuminaMath /
GRPO) from ideator + verifier skills -- these are platform-agnostic
skills, examples should be generic.

Version 0.5.0-alpha.4 -> 0.5.0-alpha.5.
Full unit suite green (593 passed, 1 skipped).
Bumped from 0.5.0-alpha.1 -> a2 -> a3 -> a4 -> a5 across consecutive
commits, but no alpha has been published. Keep all work-in-progress
changes under 0.5.0-alpha.1 until we actually cut a release; future
alpha numbers should be reserved for things that go through the
release/publish flow, not every commit on the branch.
Two simplifications across the verifier/ideator/subagent skill bodies:

1. Ideator skill: drop the Python pseudocode block that implied the
   orchestrator constructs a Task() call with a multi-line f-string.
   Replace with the same prompt-template style the optimize skill
   already uses for spawning experiment subagents ("Each subagent's
   prompt MUST start with the literal sentence..."). The orchestrator
   knows only: skill name + brief arg. The spawned subagent loads the
   skill itself.

2. Subagent skill: rewrite the verifier-call snippets from
   `evo:verifier --phase pre --target <exp_id>` (which looks like a
   shell command but isn't one) to "load the evo verifier skill ...
   with args `--phase pre --target <exp_id>`" -- matching how the
   existing subagent body refers to skill invocations.

Also flag the known architectural gap on post-phase verifier:
evo run auto-commits, so the subagent can't intervene pre-commit.
The clean fix (register the verifier as a workspace gate) is still
pending; until then, use `evo prune` for committed nodes that fail
verification instead of `evo discard`.

No code changes. Skill body text only. Version stays at 0.5.0-alpha.1.
… pre-phase

Two corrections per user feedback:

1. Descriptions said "Use when the user invokes /evo:verifier" /
   "Use when the user invokes /evo:ideator", which is wrong -- these are
   internal-only skills, the user never loads them. Both descriptions
   now match the existing evo:subagent pattern: "Internal protocol for X.
   Loaded by Y. Not user-invokable."

2. Subagent flow only calls verifier pre-`evo run`. The post-phase step
   (formerly subagent step 6) is removed -- evo run auto-commits before
   the subagent can intervene, so a post-call there can't actually
   prevent commit. Steps 6/7/7b/8 renumbered (was 7/8/8b/9). The verifier
   skill body still documents both phases: pre is the primary path called
   by subagents; post is marked advisory-only, useful for ad-hoc audits
   via evo prune, until/unless we re-architect it as a workspace gate.

No version bump.
Two changes addressing a real failure: a running agent picked subagents=5
on an 8-core latency benchmark based on the optimize skill's inline
"~5-8" summary, then realized after the fact that sizing-the-round.md's
actual rule for latency-sensitive measurement is width=1 (contention
corrupts the metric).

1. optimize/SKILL.md Configuration section: drop the "Short version"
   bullet list that gave agents enough to act on without reading the
   reference doc. Replace with a hard read-the-doc instruction + the
   common ways agents skim past the right answer (latency, worktree
   "no cap", harness softeners). Require citing the binding-resource
   case BY NAME from the doc in the opening message -- if the agent
   can't cite it, it didn't read the doc.

2. sizing-the-round.md table: add an explicit row for "latency / timing
   / throughput measurement" → width 1, with reasoning about
   contention BIAS (not just noise). Reframe the CPU-light row as
   "measurement not timing-based" and add a follow-up paragraph
   explicitly listing the cases that look CPU-light but are actually
   contention-corrupted.

Affects: future /optimize sessions on feat/model-update will pick width
correctly for latency benchmarks. No code/CLI change; docs only.
…rescription

Previous commit added a "latency/timing/throughput → width 1" row and
made the optimize body say measurement at >1 width "CORRUPTS RESULTS."
That's a rule, not a prior -- evo skills should encode guardrails as
firm gates and method/hyperparameter choices as overridable priors
(feedback_skill_priors_not_rules.md). Some latency benchmarks tolerate
modest parallelism just fine; the rule wrongly prescribes width 1
across the board.

Reframe:
- sizing-the-round.md drops the dedicated latency row from the table.
  Adds a follow-up paragraph that names the case as a judgment call,
  lists the four things to weigh (effect size vs. jitter, timed-section
  share of wall-clock, solo-confirm gate viability, harness contention
  filtering), and notes width 1 is the safe default for UNKNOWN
  timing-sensitive benchmarks -- not a reflex.
- optimize/SKILL.md body drops "CORRUPTS RESULTS" framing for the
  latency case. Says the doc has judgment framing; up to the agent
  to apply it.

Net: the doc still tells the agent the right questions to ask. It
doesn't pretend to know the answer.
This run's agent decided "since this is a serial GPU workload, I'll
drive experiments directly to avoid subagent overhead" -- skipped
evo:optimize entirely and ran exp_0001 itself. That loses every piece
of the optimize loop's structure (scan-subagent cross-cutting analysis,
verifier pre/post hooks, ideator spawning on stall, frontier reconciliation,
stop-hook discipline) for no actual benefit -- the agent conflated
"subagents=1" with "don't run optimize."

Two skill edits to disambiguate:

1. discover/SKILL.md step 13: explicit "do not run experiments outside
   /evo:optimize" callout. Even for serial workloads, the answer is
   `subagents=1`, NOT bypassing the loop.

2. optimize/SKILL.md preamble: state the loop's value is the STRUCTURE
   around each experiment, not parallelism alone. Pass subagents=1 for
   serial; never skip optimize entirely.

No code change. Skill text only. Version stays at 0.5.0-alpha.1.
…examples

This run's agent (and the previous run's agent) wrote benchmark wrappers
that emit ONE aggregate `log_task("eval_total", score)` call instead of
one per AIME problem. The result: dashboard's per-task panel is useless,
verifier's reproducibility spot-check has nothing to spot-check, ideator's
failure-clustering can't cluster. The diagnostic value of the run is
permanently lost from that point.

Root cause: the inline_instrumentation reference files (.py and .js)
contained only the helper definitions, no usage example. Agents read the
API contract but had no demonstration of the per-item loop pattern, so
they took the path of least resistance and rolled up themselves. The SDK
references showed the loop but didn't strongly nudge against the
aggregate-only pattern.

Updates:

1. inline_instrumentation.py: top-of-file docstring now says per-task
   emission is the LOAD-BEARING discipline + cites which evo features
   depend on it. New USAGE EXAMPLE block at the bottom shows a real
   per-item loop (per-AIME-problem). New ANTI-PATTERN block shows what
   NOT to do (aggregate-only).

2. inline_instrumentation.js: same shape for Node -- header callout +
   USAGE + ANTI-PATTERN blocks.

3. sdk_python.py + sdk_node.js: top-of-file docstring adds the same
   load-bearing callout. New ANTI-PATTERN block at the bottom showing
   the aggregate-only failure mode.

4. discover/SKILL.md step 10c: new paragraph after the mode picker
   stating per-task emission is mandatory when the benchmark has natural
   sub-units. Cites the USAGE EXAMPLE / ANTI-PATTERN blocks in the
   reference files and names the three evo features it unblocks
   (dashboard, verifier, ideator).

No code change. Skill text + reference examples. Version stays 0.5.0-alpha.1.
"When to do what" listed SFT first with description "install a capability
the base lacks: format, tone, chat" -- which reads as "any base that
needs format = SFT." The verifiable-reward case was buried in "What
actually matters." No diagnostic for pivoting away from SFT when it
plateaus on a verifiable benchmark.

Reorganize the decision tree to lead with reward shape:
1. Verifiable reward (exact-match integer, unit-test, parser-decidable)
   -> GRPO/RLOO/PPO. Names AIME, GSM8K, HumanEval as the textbook cases.
   Calls out that the reward shapes correctness AND output-format
   simultaneously, so trained models pass the verifier instead of
   reasoning well but emitting unparseable answers.
2. Preference pairs -> DPO/KTO/ORPO.
3. Demonstrations only -> SFT.
4. Want SFT stability with reward signal -> RFT (rejection-sampling).

Add diagnostic: 2+ committed SFT experiments at 0.0 on a verifiable
benchmark indicates technique-class mismatch, not recipe tuning. Pivot
to verifier-as-reward RL.

Promote "RL-from-base works for strong base models" from the "Not laws"
footnote into actionable main-section guidance.
…erns

Previous SKILL.md was ~1200 words of mixed-density prose. The frontmatter
description ("Load when planning or diagnosing a train move") was passive
and didn't trigger reliably; agents that did load the body skimmed to the
first concrete-looking section and missed diagnostics further down.

Rewrite (~600 words) structured for skim:

- Description now names every covered technique (SFT/LoRA/DPO/KTO/ORPO/
  RFT/GRPO/PPO/RLOO/RLHF) and the trigger phrases, so the host loads it
  on any post-training mention.
- Reward-shape technique selection moved from prose to a four-row table.
- Smoke-run section added as a precondition: ~10 examples, ~1 minute,
  must produce a loadable checkpoint AND a non-zero eval before scaling.
  Catches dtype mismatches, tokenizer/template drift, OOM at this batch
  size, and "loss fell but artifacts dir is empty" in minutes.
- Three diagnostics grouped together: stuck-at-0 on a verifiable
  benchmark after 2+ SFT runs, base below random on a knowledge-heavy
  benchmark, delta <= 0 across several committed train moves.
- New "What never counts as progress" section lists five patterns that
  produce a score number without model improvement: training on the
  held-out set (direct + transitive), embedding eval items in
  "synthetic" data, generating training data conditioned on per-eval-
  item failure logs, submitting a checkpoint that wasn't trained, and
  training a different objective than the verifier scores.
- "Surviving session compaction" -- methodlog.md guidance for decisions
  that fall out of context on long sessions.

Cuts: standalone "Not laws" section (the load-bearing item -- "SFT
before RL is not a law" -- folded into the technique table's caption);
long "Reading a run" paragraph (folded into the references list).

Adds references/false-progress.md with examples and detection per
pattern, including transitive contamination (public instruction-tuning
datasets that contain eval-derived items) and AST-normalized matching
for paraphrased code items.
The node drawer had three tabs (Summary / Diff / Tasks) and surfaced
benchmark results, but nothing about an experiment's training-side
state: training script stdout was reachable only via filesystem, and
external metric trackers (trackio HF Spaces) showed up nowhere.

Adds two surfaces:

1. Logs tab (new). Lists *.log and *.out files in the latest attempt
   dir (and one level under logs/). File picker, auto-refresh checkbox
   (2s poll), manual refresh. Tail is byte-offset based: first load
   pulls the last 500 lines; subsequent polls pull only bytes past the
   prior X-Log-Size, so the panel appends without re-rendering.

2. Trackio link + inline sparklines in the Summary tab. After the
   Artifacts meta row, if the experiment's traces dir contains a
   .trackio_url marker (written by the training callback that pushes
   metrics to a HuggingFace Space), the drawer renders:
   - "Metrics" row with a clickable link to the Space
   - up to 3 series (loss / lr / reward / kl / grad_norm preferred)
     as 140x24 SVG polylines with last-value labels
   Scalars come from the trackio HF Dataset companion (<space>-dataset),
   read via huggingface_hub + pandas; if any of those imports fail,
   sparklines are silently omitted and the link still renders.

Backend changes (dashboard.py):
- GET /api/node/<exp>/log/<file>?tail=N&offset=M -- existing endpoint
  gains tail-by-line-count and append-by-byte-offset modes. Sets
  X-Log-Size header so clients can resume cleanly.
- GET /api/node/<exp>/logs -- lists candidate log files in the latest
  attempt with size + mtime.
- GET /api/node/<exp>/trackio -- reads .trackio_url marker, optionally
  scrapes the last 60 rows of the run's parquet for sparkline data.

Frontend changes (app.js):
- 'logs' added to drawer tab whitelist; setSidebarTab tears down the
  log poll on navigation; closeSidebar tears it down on drawer close.
- state gains logsSelected / logsOffset / logsAutoRefresh.
- loadTrackioPreview fires after Summary renders, so the panel paints
  immediately and the metrics row appears progressively.
- renderSparklines does inline SVG polylines, no Plotly dependency.

Backend optional deps (huggingface_hub, pandas, pyarrow) are imported
lazily and any ImportError degrades the response to a link-only payload.
Pre-release versions (0.5.0-alpha.1, 0.6.0-beta.2, etc.) frequently
don't have a corresponding GitHub Release tagged yet -- the release
build only fires on stable bumps. When evo CLI 0.5.0-alpha.1 calls
ensure_hook_drain_binary, it tries
  https://github.com/evo-hq/evo/releases/download/v0.5.0-alpha.1/evo-hook-drain-linux-amd64
which 404s. Mid-run inject (evo direct) silently breaks until the
binary gets staged manually.

Add a second URL to the fetch loop:
  https://github.com/evo-hq/evo/releases/latest/download/evo-hook-drain-linux-amd64
GitHub redirects /releases/latest/download/<asset> to the asset on the
most recent stable release. The binary's wire protocol is stable
across minor versions, so a slightly older binary still works for the
current alpha CLI -- same rationale we use in the Modal image setup
(modal_app.py fetches /releases/latest/ at build time).

On success via the fallback, prints a NOTE clarifying which path was
used so the user understands they're running a slightly older binary
against a newer CLI -- still wire-compatible, just not identical.
…n it

Add the first plugin-shipped subagent to evo: benchmark-reviewer.
Lives at plugins/evo/agents/benchmark-reviewer.md, invokable via
Task(subagent_type="evo:benchmark-reviewer", ...). Read-only audit of
a freshly-constructed benchmark harness before its first `evo run`.

Checklist the subagent enforces:

1. Per-task instrumentation. Aggregate-only emission ({"score": X}
   without per-item traces) is the canonical bug -- breaks the
   dashboard's per-task panel, the verifier's reproducibility check,
   and the ideator's failure clustering. Common shape: a wrapper
   script around a runner library (inspect_evals, evals,
   lm-eval-harness) writes only the runner's aggregate, never
   converting per-sample data into log_task calls. Block-severity.
2. Eval-set / held-out leakage in training data sources, including
   transitive sources (public instruction-tuning datasets that
   contain eval-derived items).
3. Goodhart gates: constructed benchmarks must have a real
   pass/fail gate; bare benchmark-rerun gates are decorative.
4. Basic plumbing: result.json gets written, traces hit
   $EVO_TRACES_DIR, errors crash rather than write zero scores.
5. Determinism note (not a block).

Output is structured JSON {passed, findings[]} for programmatic gating.

Wire into evo:discover at a new step 10d (between instrumentation +
the cheap validation run). The orchestrator MUST invoke the reviewer
before `evo run` and address every block-severity finding before
proceeding. Renumber the trailing subsections (10e cheap-validation,
10f commit) accordingly.

This is the first plugin agent; plugins/evo/agents/ is a new directory.
Claude Code's plugin loader picks up agents/<name>.md alongside
skills/<name>/SKILL.md without changes to .claude-plugin/plugin.json.
The hook registers a session into .evo/run_*/inject/sessions/ on
SessionStart (eager) and on UserPromptSubmit (lazy recovery when
SessionStart fired before .evo existed, e.g. the agent runs `evo init`
mid-session).

`claude --print` (single-turn batch mode) never fires UserPromptSubmit,
so the UserPromptSubmit recovery path misses any batch session whose
.evo/ is created after SessionStart. `evo direct <text>` then reports
fanout=0 and directives queue undeliverable.

Add a PostToolUse branch alongside the existing UserPromptSubmit branch.
PostToolUse fires on every tool call, so the first tool the agent runs
after `evo init` registers the session. engage=false: PostToolUse is a
tool callback, not an orchestrator engagement signal.

Fixes #49
…scores

Adds a fourth diagnostic for the case where 3+ committed experiments across
structurally distinct techniques (SFT, GRPO, RFT) all land at the same
non-zero score. Names the train-verifier objective mismatch as the most
common cause and prescribes a spot-check of training vs eval examples
before trying a fourth technique variant.
Unbounded `while true; do sleep N; tail file; done` loops block indefinitely
when the underlying process crashes — the tail keeps reading a dead file
and absence of log growth reads as "still working." Adds a bounded poll
pattern to discover and optimize skills that checks three independent
signals (process alive, log growth delta, GPU activity) per iteration with
a hard loop-bound timeout. Forward-compatible note points at the planned
`evo wait` CLI that will replace the loop.
When the benchmark wraps a runner library (inspect_evals, lm-eval-harness,
evals), the per-item loop is hidden inside the runner and the natural
wrapper shape collapses to a single aggregate `{"score": X}`. The runner
typically already writes per-sample data; the wrapper just isn't
forwarding it. Without per-item forwarding, the dashboard's Tasks tab is
empty, the verifier can't spot-check, and there's no input for
RL-on-failures or curriculum strategies. Reinforces the existing per-task
emission rule with the specific runner-wrapper case and points at the
reference files with worked examples.
For any non-root experiment, the training script must warm-start from the
parent's checkpoint (exposed via EVO_PARENT_POLICY) rather than re-training
from base. Re-training from base each generation burns budget on
duplicated work and prevents the experiment tree from accumulating
capability across generations. Adds the concrete pattern (branch on
os.path.exists to handle root, where the value is a base model id) to the
skill and expands references/glue.md with the same contract.
…oad it

Same fix as 680336e for skills/subagent. With disable-model-invocation set,
the Skill tool cannot load infra-setup, but the discover skill's STEP 0
"internalize every skill" instruction requires it, and discover/optimize
both reference infra-setup's provider-matrix.md for remote backend setup.
Removing the field lets the skill be invoked normally; the
"Non-user-invocable" framing in the description still signals intent.
`attempts/<n>/diff.patch` was scoped to `relative_target(config)` -- the
single configured target path. Any source edit outside that path
(training configs, prompt files, helper scripts) dropped from the
patch, so the dashboard rendered the file as empty even when the
experiment branch had real changes.

`capture_experiment_diff` runs `git diff <parent_ref>` against the
worktree with no path scope, covering both the pre-commit workflow
(agent commits to the experiment branch then invokes `evo run`) and
the dirty-worktree workflow (agent leaves edits for
`maybe_commit_worktree`).

Binary / large artifacts (safetensors, optimizer state, tokenizer
blobs, checkpoint dirs) are excluded via pathspec to keep diff.patch
dashboard-renderable. Override per workspace via
`config["diff_exclude_patterns"]`.
…mands

The polling-discipline skill text added in 34d11e8 documents the bounded-
poll pattern but only fires if the agent reads the skill section. Agents
that skip straight to tail-loop polling never get the nudge.

Adds plugins/evo/hooks/wait_hint.sh, a small bash script wired as a
second PostToolUse entry in hooks.json (alongside the existing
evo-hook-drain handler). On every Bash tool call it greps the command
string against patterns that indicate a long-running workload:

  - python.*train, accelerate launch, vllm serve, trl vllm-serve
  - python eval / evaluate.py / run_eval.py
  - nohup ... &
  - sleep <N> ... tail and while true ... sleep (the anti-pattern)

On match it prints a one-line [evo-hint] to stdout (which Claude Code
routes into the next-turn context as a non-blocking system reminder)
pointing at `evo wait --for process=<pid> --for log-growth=<path>
--for gpu-active --timeout 60m --json`.

Dedup is a touch-file under $TMPDIR keyed by session_id, so the hint
fires once per session even if the agent restarts the same command.
Non-matching commands, non-Bash tool calls, and empty payloads fast-
exit 0. Runtime is ~35ms.

Pairs with the polling-discipline section in plugins/evo/skills/
{discover,optimize}/SKILL.md and the planned `evo wait` extensions.
The verifier audits one experiment for design-time cheating (pre-phase)
or result-time validity (post-phase). It is read-only, returns a single
JSON artifact, and the orchestrator gates compute/commit on its verdict
-- a subagent fits that shape better than a skill that gets loaded
into the orchestrator's context.

plugins/evo/agents/verifier.md is the new subagent. Same audit checks
as the old skill (test-set leakage, benchmark subsetting, gate
coverage, hypothesis specificity, resource conflicts for pre; duration
sanity, artifact reality, score reproducibility, gate compliance for
post) plus an explicit pass over the five false-progress patterns
(test-set ingestion, eval items in synthetic data, reverse-engineered
verifier, instruct-model substitution, training-objective mismatch).
Output shape mirrors benchmark-reviewer: `{passed, findings: [{category,
severity, what, where, fix}]}` with severity `block` flipping `passed`
to false; the verdict is also persisted as an `evo annotation` on the
target experiment so the dashboard and later runs see it.

plugins/evo/skills/verifier/SKILL.md is now a stub that points callers
to the subagent invocation. Kept rather than deleted so any callsite
that resolves `evo:verifier` through the skill loader gets a clear
pointer instead of a not-found.

plugins/evo/skills/subagent/SKILL.md step 4 now spawns
`Task(subagent_type="evo:verifier", ...)` instead of loading the
verifier skill. No other callers reference the verifier skill
directly.

Refs #50.
The ideator generates ranked experiment proposals via cross-graph
analysis and web research (arXiv, HF Hub, GitHub, blogs). The research
work is high-token and not directly actionable -- pulling it out of
the orchestrator's context into a subagent isolates research notes
from planning state, and the orchestrator only sees the structured
proposal output.

plugins/evo/agents/ideator.md is the new subagent. One brief per
invocation (`failure_analysis`, `literature`, `frontier_extrapolation`)
matching the old skill's parallel-spawn model -- the orchestrator
fires three Task calls in parallel, each subagent runs one brief in
its own context, all three append to the shared
`.evo/run_<id>/ideator/proposals.jsonl`. Tools include `WebFetch` +
`WebSearch` for the literature brief; the other two briefs are pure
local analysis. Proposal schema is the same as the old skill (renamed
`mechanism` -> `rationale`, `sources` -> `references_consulted`, added
`title` and `technique` for orchestrator-side ranking).

plugins/evo/skills/ideator/SKILL.md is now a stub that points callers
to the subagent invocation.

plugins/evo/skills/optimize/SKILL.md step 6b now spawns three
`Task(subagent_type="evo:ideator", ...)` calls instead of telling each
subagent to load the ideator skill. `evo wait --for ideators` still
works as the synchronization primitive; the file-based reconciliation
contract is unchanged.

Refs #50.
The evo wait extension (#52) broke 8 unit tests in three clusters,
which kept the v0.5.0-alpha.1 and v0.5.0-alpha.2 publish workflows
failing at preflight. Fix all three:

1. test_timeout_capped_at_3600 expected the old 1-hour cap. The cap
   was raised to 24h alongside the extension so long external waits
   (a 10-hour training run) are expressible. Rename to
   test_timeout_capped_at_24h, expect 86400, also assert
   _WAIT_TIMEOUT_CAP == 24 * 3600 directly so a future cap change
   trips the test deliberately.

2. test_count_without_for_is_rejected asserted the substring
   "--count > 1 requires --for". The new error is "--count > 1
   requires exactly one --for experiments|ideators (otherwise
   ambiguous which kind to count)". Update the substring to
   "--count > 1 requires exactly one --for" (present in the new
   message; the rest of the wording is informational).

3. TestEvoWaitForIdeators._run_wait passed wait_for="ideators"
   (a string). --for is now action="append" in argparse, so the
   namespace value is a list. The new parser iterates the string
   character-by-character, yielding 'i', 'd', 'e', ... and failing
   with "unknown form". Pass a list. Comment in the helper explains
   why.

Plus one assertion-text update in test_timeout_when_no_proposals_arrive:
the timeout summary changed from "no new ideator proposal" to "no
ideators activity"; loosened assertIn to "ideators" which matches
both old and new wording.

All 18 tests pass locally (pytest tests/unit/test_evo_wait.py).
…workspace check

Fixes #53.

`cmd_wait` called `repo_root()` and the `.evo/` existence check at the
top of the function, before any `--for` parsing. That bailed any
invocation outside an evo workspace, even when the user only requested
external-state watches (`--for process=<pid>`, `--for log-growth=<path>`,
`--for gpu-active`, `--for gpu-idle`) -- none of which read or write
anything under `.evo/`.

Refactor:

1. Parse `--for` first.
2. Determine `needs_workspace` from the parsed conditions: true iff any
   condition is `experiments` or `ideators` (the workspace-anchored
   targets). The legacy default (no `--for`) implies both, so it still
   needs a workspace.
3. Only call `repo_root()` + `.evo/` existence check when
   `needs_workspace` is true. Otherwise `root` stays `None`.
4. The `run_dir` lookup later is now guarded on `root is not None`.
5. The workspace error message gains a clarifying suffix:
   `(required for --for experiments|ideators)`.

Regression test: `test_non_workspace_watches_work_outside_workspace`.
Uses `--for log-growth` rather than `--for process=<pid>` to avoid the
zombie/reaping subtlety where a `subprocess.Popen` child of the test
process stays kill(pid,0)-alive until reaped, which would make a
process-watch test flaky.

Verified locally:
- `evo wait --for process=$PID --timeout 10s --json` in a non-git tmpdir
  returns clean JSON with `exit_reason: process-exited`.
- `evo wait --timeout 1s` (legacy default) in the same tmpdir still
  errors with `ERROR: not in an evo workspace (required for --for
  experiments|ideators)`.
- All 19 unit tests in `tests/unit/test_evo_wait.py` pass.
New section after parent-warm-start covering reusable artifacts -- LoRA
adapters, curated/tokenized datasets, embeddings, retrieval indexes,
precomputed eval generations.

Convention is .evo/cache/, sibling to run_<NNNN>/. Already gitignored
via the workspace's .evo/ exclude. Survives evo new / evo run /
evo reset because none of those touch siblings of run_<id>/.

Pattern: workspace-root lookup, key embeds every recipe input that
changes the artifact, read-or-compute. Calls out the worktree-local
anti-pattern (artifact disappears on gc) and acknowledges the HF Hub
cache to avoid duplication.

Defers the named-asset registry to #55.
Domain-agnostic counterpart to the finetuning skill's caching section.
Subagent skill applies to every experiment-running agent regardless of
domain, so this is where the general "check before recomputing" prior
belongs.

Lists the artifact shape (slow to produce, stable across siblings/
descendants) and the cache path (.evo/cache/, gitignored, untouched by
new/run/reset). Calls out the worktree-local anti-pattern. Defers the
concrete read-or-compute pattern to the finetuning skill rather than
duplicating it.
Carries:
- docs(skill/finetuning): cache expensive intermediates under .evo/cache/
- docs(skill/subagent): domain-agnostic counterpart pointing at finetuning
…ator-readable

ideator and verifier migrated from skills to subagents in d69f35d /
0127244 (plugins/evo/agents/{ideator,verifier}.md). The skill paths
were kept as 6-line stub redirects so any caller resolving
Skill("evo:ideator") got a clear pointer rather than a not-found
error.

In practice the stubs are net-negative: an LLM reading the stub still
has to manually translate to Task(subagent_type="evo:ideator", ...).
Without the stub, Skill("evo:ideator") returns "not found" and the
optimize-skill roster (and any subagent-tool listing) can be the
canonical source naming them as subagents.

evo:subagent: drop the "Not for orchestrator use" gate. The description
now positions the skill as legitimately invokable by the orchestrator
to understand the brief shape its dispatched subagents expect.
Two-line preface in the body tells an orchestrator-as-reader to stop
at "Host conventions" -- the rest of the body is subagent-as-reader.
The brief-writing logic in /optimize gets a place to learn the
subagent's input/output contract without having to grep the source.
…ng references

The agent has had no map of evo's invocable surface, only individual
skill descriptions. Result, observed in a recent run: agent stuck in SFT
for 5 experiments without ever invoking evo:finetuning, never read
sizing-the-round.md, hand-wrote a stripped-down log_task/write_result
that produced empty Tasks panels in the dashboard.

Each entry-point skill now opens with an "Evo surface" tree sized to
its reader:

- discover/SKILL.md: full tree -- main thread (orchestrator) + subagent
  thread + complete references catalogue. First contact with the surface.
- optimize/SKILL.md: loop-relevant subset -- what gets pulled/dispatched
  during the loop. Cross-references discover for the full tree.
- subagent/SKILL.md: subagent's slice -- skills it pulls (finetuning),
  subagents it dispatches (verifier, mandatory pre/post), references
  it reads. Cross-references discover for the full tree.
- finetuning/SKILL.md: replaces the flat ## References list with a
  categorized tree (core contracts / rl/ / sft/ / serving/) + a brief
  cross-skill section.

Each line in every tree is a triggering condition ("if you're about to
do X, pull this"), not a description of what each piece does. The skill
bodies themselves carry the content; the tree is the index.

Also moved sizing-the-round.md from optimize/references/ to
discover/references/ (decision happens at the discover->optimize handoff,
which is during discover; reading it after /evo:optimize is invoked is
too late -- subagents=N is already chosen). Updated the path reference
in optimize/SKILL.md body accordingly.
Carries:
- skills: drop ideator/verifier stub redirects; make evo:subagent
  orchestrator-readable
- skills: add Evo surface tree to discover/optimize/subagent +
  finetuning references (categorized by rl/sft/serving)
- sizing-the-round.md: moved from optimize/ to discover/references/
  (decision happens at the discover->optimize handoff)
…, discard-time diff + dashboard discard_reason (#57)

#49 -- hook-drain: engage on evo invocations seen via PostToolUse
  Previously: SessionStart engaged the orchestrator, but only when
  .evo/ already existed. For claude --print batch mode that created
  the workspace mid-session via `evo init`, SessionStart's engage=true
  call no-op'd (no workspace yet); first PostToolUse registered with
  engage=false; later `evo` invocations never had a chance to upgrade
  engagement. Net: directives from `evo direct` got skipped_unengaged=1
  for the entire session lifetime -- mid-run injection non-functional.

  Now: PostToolUse detects when its Bash invocation is an `evo ...`
  command and (a) uses engage=true on first-time registration, or
  (b) upgrades has_evo_engaged: false -> true if the session was
  registered earlier via a non-evo PostToolUse. Subagent context
  (agent_id present, EVO_EXP_ID set) still suppresses engagement so
  spawned subagents don't engage the workspace loop on their own.

  Helpers: is_evo_invocation() substring-matches "evo " in the tool
  command (handles shell-snapshot wrappers); is_session_engaged()
  reads the session record and checks has_evo_engaged.

#56 -- evo run --check + evo run: assert tasks-in-result when N>1 traces
  Catches rolled-own log_task/write_result that emit per-task traces
  to traces/ but omit the `tasks` array from result.json. Dashboard's
  per-task panel reads outcome.benchmark.result.tasks for committed
  experiments (no fallback to the traces dir), so these benchmarks
  silently render "No benchmark task results recorded" even with 30
  traces on disk.

  Now: both _cmd_run_check and _cmd_run_impl, after load_result(),
  count traces/task_*.json files; if N>1 and parsed.tasks is empty,
  raise with a pointer at the canonical inline_instrumentation.py
  reference. Forces the agent to paste the file (which has the
  _SCORES accumulator + tasks aggregation built in) instead of
  hand-rolling a stripped-down version.

#57 -- discard observability: capture diff + render discard_reason
  Two halves:

  (a) Backend: cmd_discard now calls capture_experiment_diff against
      the experiment branch vs its parent's commit BEFORE
      delete_discarded_experiment removes the worktree+branch. Writes
      to <experiment_dir>/diff.patch (sibling to result.json, NOT
      under attempts/ since the discard may happen with current_attempt=0).
      Best-effort: log+continue on any failure (missing parent commit,
      orphaned worktree, etc.) so the discard itself never blocks.

  (b) Frontend: app.js drawer SUMMARY tab now renders
      node.discard_reason for discarded experiments (parallel to
      the existing pruned_reason rendering). Backend already returned
      it in the node API; frontend just had no code reading it.

  Together: discarded experiments now leave a visible trace of (i) the
  reason in the dashboard summary and (ii) the actual code changes
  via diff.patch the dashboard's Diff tab renders.

Closes #49 (engagement-half; registration-half already landed).
Closes #56.
Closes #57.
#49 (Rust hook-drain) -- 18 new unit tests in main.rs:
  - contains_evo_word: 8 tests covering direct, chained, shell-snapshot
    wrappers, tab separators, start-of-string, AND adversarial cases
    (servo init, levo build, evolution.py, evolved.sh, sevo --help,
    cargo install evo-hq-cli, python evo.py, vim evo_helper.py, lone
    `evo` with no trailing whitespace, empty/short input).
  - is_evo_invocation: 5 tests covering JSON payload parsing (Bash
    tool name, command field present/absent), rejection of non-Bash
    tools (Read, Write), rejection of evo-as-substring in commands,
    rejection of evo paths in file-path arguments.
  - is_session_engaged: 4 tests covering true/false/missing-file +
    a documented limitation case (whitespace-around-colon JSON variants).

  Also tightened contains_evo_word from naive substring match to
  word-boundary check (start-of-string OR non-word char before, space
  OR tab after). The prior `command.contains("evo ")` had real false
  positives: "servo init", "levo build", any path or argument ending
  in "...evo" followed by a space. Pure-stdlib byte scan, no regex dep.

#56 -- 12 new unit tests in test_check_tasks_assertion.py:
  - Positive: N traces + tasks array present, zero traces, exactly
    one trace, missing traces dir, parsed=None with no traces, dict
    populated with at least one task.
  - Negative (assertion fires): 2 traces + no tasks, 30 traces + no
    tasks (PostTrainBench shape), tasks=empty dict, tasks=empty list,
    parsed=None with N>1 traces.
  - Edge: non-task files in traces dir (.DS_Store, README.md,
    summary.json) don't inflate the count; `taskoutcome_*.json`
    doesn't match the glob.

  Refactor: extracted `_assert_tasks_aggregated(traces_dir, parsed)`
  helper. Both `_cmd_run_check` and `_cmd_run_impl` call into it
  instead of inlining the same conditional. Single source of truth +
  unit-testable.

#57 -- 7 new unit tests in test_discard_diff_capture.py:
  - Positive: diff lands for branch with commits (file additions),
    captures modifications-not-just-additions (verifies +/- lines),
    diff path is the canonical experiment_dir/diff.patch.
  - Skip cases (returns None, no raise): worktree missing, parent_ref
    missing on graph entry, node has no parent_id.
  - Failure isolation: unexpected git failure (broken parent SHA)
    swallowed; discard never blocks.

  Refactor: extracted `_capture_discard_time_diff(root, exp_id, node,
  graph)` helper. cmd_discard calls into it instead of inlining the
  try/except wrapped capture. Same testability benefit.

Test results:
  - 18/18 Rust tests pass
  - 20/20 new Python tests pass (#56 + #57)
  - 646/646 existing Python unit tests still pass (1 known-flaky test
    `test_gate_check_runs_gates_without_changing_node_status` passes
    in isolation; intermittent failure in bulk sweep is pre-existing
    and not caused by these changes)
Carries:
- skills: Evo surface trees in discover/optimize/subagent + categorized
  references tree (rl/sft/serving) in finetuning
- skills: drop ideator/verifier stub redirects; make evo:subagent
  orchestrator-readable
- hook-drain (#49 follow-up): word-boundary `evo` detection in
  PostToolUse + engagement upgrade so mid-run `evo direct` actually
  fans out to claude --print batch sessions whose .evo/ was created
  via `evo init` mid-session
- evo run --check + evo run (#56): assert tasks-in-result when N>1
  per-task traces exist (catches rolled-own write_result)
- evo discard (#57): capture parent..exp diff into diff.patch before
  delete_discarded_experiment wipes the worktree+branch; dashboard
  drawer renders discard_reason for discarded experiments
- testing: 38 new unit tests (18 Rust + 20 Python); helpers refactored
  out of cli.py for testability
…ences principle

Replace the verbose "Skills before references, always" lecture in
discover with one principle line near the top of the Evo surface
section: "Always have a sense of the skill before jumping into its
references." Keep the tree itself unchanged (skills + references at
their normal hierarchy); the principle handles the ordering.

Why: agent in last run read finetuning/references/glue.md directly
(after seeing it in the tree) and used the I/O contract as license to
write SFT+LoRA without ever invoking evo:finetuning. The reward-shape
decision tree (only in the skill body) got skipped, agent defaulted
to SFT, 5 regressions followed. The reference looked authoritative on
its own because the tree placed it alongside skills with no ordering
hint.

glue.md: short pointer at the top noting the technique pick lives in
the parent skill body, not in this file. One sentence; doesn't
moralize.
…tion

Previous description -- "Run the evo optimization loop with parallel
subagents until interrupted" -- read like ceremony if width is 1. Agent
in a recent run reasoned itself out of invoking the skill: "instead
of going through the full evo:optimize ceremony (which adds subagent
overhead), I'll directly drive experiments using evo for tracking."
Never loaded the skill body, never saw the "applies to serial
workloads too" guidance inside it. Result: ad-hoc training, no
frontier reasoning, no scan-subagent cross-cutting analysis, no
annotation discipline -- the whole loop bypassed.

New description names the structural value up front: cross-cutting
analysis, frontier-based parent selection, ideator dispatch, verifier
pre/post hooks, annotation discipline. Positions optimize as the
natural successor to evo:discover + baseline commit. Frames width as
a configurable knob (subagents=N for any N) rather than parallelism
as the value prop.

The description is the only thing the agent sees in the slash-command
listing before deciding whether to invoke. Loading it via Skill is
gated behind that decision; once the agent decides bypass is fine,
the body's "don't bypass" warnings are unreachable.
Existing test_hook_drain.py uses a fake evo-drain shim that prints {} --
verifies the Rust binary handoff trigger, not what drain.py actually
does with queued events. Real drain code path had zero coverage.

6 new tests in tests/unit/test_directive_delivery_e2e.py exercising
the full pipeline: cmd_direct (queue) -> real Rust hook-drain binary
(handoff) -> real Python drain_session (delivery):

  Queue side (validates alpha.7 #49 wiring):
  - cmd_direct writes event to workspace.jsonl
  - cmd_direct touches marker for engaged session

  Delivery side (the bug suspected in #58):
  - hook fires after directive queued -> EVO DIRECTIVE banner emitted
  - delivered/<event_id>-<sid>.json record written
  - marker unlinked after successful delivery (prevents re-fire)
  - unengaged session does NOT receive directive (engagement filter)

Result: all 6 tests PASS. The delivery pipeline works correctly when
invoked in isolation. Implication: the live failure observed in the
PostTrainBench run (delivered/ empty after `evo direct` reported
fanout=1) is not a bug in the queue/handoff/drain chain. Most likely
cause is that the agent's blocking Bash tool call (training process
running synchronously) delayed PostToolUse firing -- and therefore
the next drain invocation -- until after observation ended.

These tests serve as regression coverage for the queue+delivery
contract going forward; they would catch a real break in either half.
- benchmark-reviewer: add `mode=review-experiment` for post-commit
  per-task failure analysis. Reads per-task traces + eval-runner log,
  classifies failures into a 9-category taxonomy (truncated,
  wrong-format, wrong-answer, hallucination, refusal, language-drift,
  prompt-misread, eval-error, unknown), writes per-task annotations
  via `evo annotate <exp> --task K`, returns JSON with
  failure_breakdown + top_failure_pattern + next_step_signal.
  `mode=audit` (existing) stays the default.

- subagent: new step 6b. After COMMITTED, before annotating, spawn
  benchmark-reviewer in review-experiment mode. Skipped for
  EVALUATED/FAILED/DISCARDED. By the time the orchestrator picks the
  next frontier, per-task diagnosis is on disk.

- finetuning: add `## Stream training metrics live` section + new
  references/observability.md. Env-driven detection prior
  (WANDB_API_KEY / TRACKIO_SPACE_ID / MLFLOW_TRACKING_URI), TRL
  report_to one-liners, custom-loop init+log pattern, EVO_EXPERIMENT_ID
  as run name. Closes the observability-blind window during long
  training runs (separate from in-flight evo state, tracked in #59).
- finetuning: add `## Long training: checkpoint, mid-eval, early-stop
  in-script` -- for trainings > 30 min, build a TrainerCallback that
  periodically saves a checkpoint, runs a mini-eval on a 5-10 item
  held-out subset, and stops on stalled/regressing trajectory. Commits
  the best mid-eval checkpoint, not the last. Pattern B from the
  multi-stage tradeoff: one evo node, verification inside the script.

- finetuning: add `## Cap retries at training scale` -- recommend
  `evo config set max-attempts 1` for training-heavy workspaces.
  Default 3 retries was designed for second-scale benchmarks; at
  hour-scale, retry-with-tweak burns more compute than a fresh
  hypothesis would. One shot per node; regression -> discard, branch.

- subagent Evo surface: add evo:benchmark-reviewer to dispatch list
  (was missing -- only verifier was there); add observability.md to
  finetuning references tree; expand evo:finetuning trigger line with
  scope hint (technique choice, training recipe, observability, retry
  discipline) so the subagent knows it has the right skill without
  re-reading the whole body.
Drawer Summary tab gains a Diagnoses section (global annotations from
benchmark-reviewer/verifier/agents); Tasks tab shows the per-task
annotation inline under each failing task, plus target + full model
output on expand. Logs tab replaces the file dropdown with inline file
tabs, strips ANSI escape codes, and unifies the card styling. Tab strip
fixed from 3-col to 4-col grid (Logs no longer wraps) with brighter
active-tab contrast for low-contrast monitors.
# Conflicts:
#	plugins/evo/.claude-plugin/plugin.json
#	plugins/evo/.codex-plugin/plugin.json
#	plugins/evo/npm/package.json
#	plugins/evo/npm/skills/discover/SKILL.md
#	plugins/evo/npm/skills/infra-setup/SKILL.md
#	plugins/evo/npm/skills/optimize/SKILL.md
#	plugins/evo/npm/skills/report/SKILL.md
#	plugins/evo/npm/skills/subagent/SKILL.md
#	plugins/evo/pyproject.toml
#	plugins/evo/skills/discover/SKILL.md
#	plugins/evo/skills/infra-setup/SKILL.md
#	plugins/evo/skills/optimize/SKILL.md
#	plugins/evo/skills/report/SKILL.md
#	plugins/evo/skills/subagent/SKILL.md
#	plugins/evo/src/evo/__init__.py
#	sdk/node/package.json
#	sdk/python/pyproject.toml
#	sdk/python/src/evo_agent/__init__.py
…loop

Opt-in, Claude-Code-only workflow driver alongside the prose loop, selected by
default-orchestrator (prose|workflow; default prose). When host=claude-code and
the flag resolves to workflow, optimize/SKILL.md launches
skills/optimize/workflows/evo-optimize.js via the Workflow tool instead of
driving the loop turn-by-turn; otherwise the prose loop runs unchanged. Both
drive the same evo CLI -- gates, frontier, dashboard, state are identical.

The workflow encodes the loop control deterministically: orient, mandatory scan
+ cross-history axis check, ideator dispatch on stall/periodic with proposal
reconciliation, brief writing + diversity dedupe, per-brief parallel lanes
(implement -> pre-verify<->revise -> run -> post-audit, deepened to budget),
collect/prune, and stall that resets only on a verified committed score beating
the prior best. args are coerced (object or JSON string); schemas stay flat
because the StructuredOutput validator rejects allOf/if/then.

cli.py: default-orchestrator config field (set/get/show + argparse choices).
.gitignore: ignore .claude/worktrees/.
Runs a self-paced Opus observer alongside the optimize round loop via
Promise.all. It checks host and cross-history signals during rounds
(zombie GPU, buried stderr, stuck experiment, saturated axis, dead
direction) and folds work-quality hints into the next round's brief;
runtime and host issues surface as alerts.

The observer is advisory and isolated from the optimizer:
- Interruptible wait: optimize drops a sentinel (.evo/.wf_optimize_done)
  on exit and the in-flight tick polls it in 15s hops, so the thread
  stops within a hop instead of blocking the run for the full interval.
- A failed tick is swallowed (per-tick try/catch + thread .catch) so it
  can never reject Promise.all and abort the run.
- Self-disables after 3 consecutive failed ticks, so a tick that fails
  before its pacing wait can't hot-spin agents.

Orient state-read moves from haiku to sonnet.
…ore evo run

The workflow splits a subagent into separate implement and run agents.
implementPrompt loads the subagent skill in full; runPrompt was a fresh
agent told only "Run `evo run` to evaluate and commit" — no skill load,
no `--check`, no train-then-eval ordering. For a finetune the run agent
called the real `evo run` before its background train.py wrote
final_model/, producing a spurious "final_model not found" failure that
consumed the attempt.

runPrompt now loads the subagent skill, names `evo run --check` for
non-committing wiring validation, and requires the artifact (e.g.
final_model/) to exist before the real run, warm-starting from
EVO_PARENT_POLICY when the experiment trains.
@alokwhitewolf alokwhitewolf merged commit 06eae04 into main Jun 6, 2026
41 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant