0.5.0-alpha.1: subagents, evo wait, hook-drain fix, skill rewrites by alokwhitewolf · Pull Request #54 · evo-hq/evo

alokwhitewolf · 2026-06-01T20:17:51Z

Cuts 0.5.0-alpha.1. Tag v0.5.0-alpha.1 already pushed at the tip; publish workflow fires on the same SHA.

Plugin subagents (new `plugins/evo/agents/`)

Adds an agents/ directory next to skills/. Claude Code's plugin loader picks up agents/<name>.md automatically -- no .claude-plugin/plugin.json change needed. Invokable from skills via Task(subagent_type="evo:<name>", ...).

evo:benchmark-reviewer -- read-only pre-flight audit of a freshly-constructed benchmark before its first evo run. Checks per-task instrumentation, eval-set leakage (direct + transitive), Goodhart gate coverage, basic plumbing. Returns {passed, findings[]}. evo:discover step 10d now gates the baseline run on passed=true.
evo:verifier -- migrated from skills/verifier/. Read-only audit catching the five false-progress patterns (held-out training, eval items in synthetic data, reverse-engineered verifier, instruct-model substitution, train↔verifier objective mismatch). Skill stub at the old path redirects to the agent.
evo:ideator -- migrated from skills/ideator/. Web-research-heavy proposal generator; one brief per invocation (failure_analysis | literature | frontier_extrapolation); appends JSONL proposals the orchestrator reconciles. Skill stub redirects.

Refs #50. The third migration -- skills/subagent/ -> agents/experiment-runner.md -- is not in this PR; that's a larger refactor touching the spawn protocol in evo:optimize.

Hook-drain: lazy-register on `PostToolUse`

Fixes #49, where a claude --print session whose .evo/ is created from inside the session never registers in inject/sessions/. evo direct then reports fanout=0 and queued directives are undeliverable for the rest of the session.

bin/evo-hook-drain-rs/src/main.rs already had a UserPromptSubmit lazy-register branch with the comment "Lazy-register on first prompt so the session can recover if SessionStart fired before .evo existed." --print mode never emits UserPromptSubmit, so the recovery path never fired in batch sessions. Adds a parallel PostToolUse branch (engage=false); the first tool call after evo init registers the session.

Source change is in this PR. A binary rebuild + release is needed for plugin installs to pick up the fix; tagging v0.5.0-alpha.1 should trigger the publish workflow that builds the binaries. Until that lands, host-install falls back to releases/latest (see below).

`evo wait` extensions

Implements #52. --for is now repeatable. New targets:

--for process=<pid> -- exits when the PID dies. Liveness via os.kill(pid, 0). exit_code returned null because evo wait is rarely the pid's parent; documented in the help text.
--for log-growth=<path> -- exits when the file stops growing for --stall-threshold seconds.
--for gpu-active / --for gpu-idle -- reads nvidia-smi --query-gpu=utilization.gpu; skips cleanly with {\"note\":\"nvidia-smi unavailable\"} when the tool is absent.

Shared options: --timeout (duration string; default 1h, max 24h), --stall-threshold (default 2m), --poll-interval (default 5s), --json (structured exit reason with exit_reason, triggered_by, per-watch state).

Multiple --for flags combine; returns on the first matching condition. Existing --for experiments and --for ideators preserved.

Follow-up #53: the workspace check fires at the CLI entry, but the new process/log-growth/gpu-* targets don't need workspace context. Move that check into the experiments/ideators branches only.

Diff capture

Fixes #51. attempts/<n>/diff.patch files were always 0 bytes. The capture in cli.py ran render_git_diff(relative_target(config)) -- scoped to the configured target file -- and the typical workflow of committing all changes to the experiment branch before evo run left the working tree clean for the capture site. Any source change outside the target file was also silently dropped even when the working tree was dirty.

Replaced with capture_experiment_diff(root, exp_id, attempt, parent_ref, worktree, executor, exclude_patterns) in core.py. Range diff (parent_branch..exp_branch) repo-wide. Default DEFAULT_DIFF_EXCLUDE_PATTERNS filters safetensors, optimizer/scheduler state, tokenizer blobs, and checkpoint-*/**. Overridable per workspace via config[\"diff_exclude_patterns\"]. Wired into the existing capture site in cli.py.

PostToolUse wait-hint hook

plugins/evo/hooks/wait_hint.sh registered in hooks.json under PostToolUse with matcher: \"Bash\". Inspects the command, matches against a long-running-command pattern table (python.*train, vllm serve, accelerate launch, nohup .* &, while true ... sleep, sleep ... tail), emits a one-line [evo-hint] to stdout suggesting evo wait instead of tail-loop polling. Dedup per session via marker file under $TMPDIR. Cold runtime ~35ms.

Companion to the skill-text polling discipline (below). Closes the gap where the skill text describes the right pattern but agents may not read it.

Host-install: `releases/latest` fallback

host_install/_hook_drain.py fetched releases/download/v<version>/<asset> and gave up on 404. Pre-release versions (0.5.0-alpha.1, etc.) don't get a corresponding GitHub Release until the publish workflow fires on a stable bump. Result: every alpha install of evo printed a 404 warning and skipped staging the binary, leaving evo direct mid-run inject silently broken.

Adds a second URL to the fetch chain: releases/latest/download/<asset>. The hook-drain wire protocol is stable across minor versions, so a slightly older binary still works for the current alpha CLI -- same rationale used in our Modal image setup. Prints a NOTE when the fallback is used so users know they're running against the latest stable binary rather than a version-pinned one.

Skill rewrites

`evo:finetuning`

f24fbc3 refactor(skill/finetuning): rewrote the ~1200-word body into ~600 words structured for skim. Description names every covered technique (SFT/LoRA/DPO/KTO/ORPO/RFT/GRPO/PPO/RLOO/RLHF) for the host's trigger-phrase loader. Reward-shape decision tree moved from prose to a four-row table. Smoke-run section added as a precondition (~10 examples, ~1 minute, must produce a loadable checkpoint and a non-zero eval). Five false-progress patterns surfaced as a top-level list pointing at references/false-progress.md.
8ad213c feat(skill/finetuning): added a fourth diagnostic: stuck at the same non-zero score across 3+ experiments spanning distinct techniques (SFT, GRPO, RFT). Bottleneck is not the training method; spot-check 3 training examples and 3 eval-prompt examples for train↔verifier output format alignment (e.g. \\boxed{X} vs ANSWER: X).
ee4a6ac feat(skill/finetuning): documents EVO_PARENT_POLICY env var as the default warm-start path for non-root experiments. evo run already populates it; the skill now explicitly tells the training script to consume it.

`evo:discover` and `evo:optimize`

34d11e8 feat(skills): polling discipline section in both skills. Bans while true; do sleep N; tail; done because a crashed process stops producing output and the loop reads the empty delta as "still working." Documents a bounded for i in $(seq 1 N); do ...; done poll pattern that checks process liveness, log size delta, and GPU activity. Forward-compatible note pointing at evo wait --for (this PR's Extend evo wait to watch processes, log growth, and GPU activity (replace while true agent poll loops) #52).
8ab82be fix(skill/discover): runner-library wrapper anti-pattern. When a benchmark wraps a runner that hides the per-item loop (inspect_evals, lm-eval-harness, evals), the wrapper must parse the runner's per-sample output and emit one log_task(item_id, score=...) per item. The aggregate-only emission case was the canonical failure of the per-task instrumentation discipline; it left the dashboard's per-task panel and the verifier's spot-check capability silently empty.

`evo:infra-setup`

0b78240 fix(skill/infra-setup): removed disable-model-invocation: true from the frontmatter. Without removing it, the skill couldn't be loaded via the Skill tool, which evo:discover step 0 ("internalize every skill before any action") explicitly requires. Same fix that was previously applied to skills/subagent/ in 680336e.

Dashboard

c8f5e62 feat(dashboard): -- experiment node drawer now has a Logs tab + a trackio link/sparkline in the Summary tab.

New Logs tab tails *.log and *.out files in the latest attempt dir. File picker, auto-refresh toggle (2s poll), byte-offset incremental polling so the panel appends new bytes instead of re-rendering.
New Metrics row in the Summary tab when an experiment's traces dir contains a .trackio_url marker (written by posttrainbench-evo's trl_trackio_callback.py). Renders a clickable HF Space link and up to 3 inline SVG sparklines (loss / lr / reward / kl / grad_norm preferred). No Plotly dependency.

Backend endpoints:

GET /api/node/<exp>/logs -- lists *.log and *.out files in the latest attempt with size + mtime.
GET /api/node/<exp>/log/<file>?tail=N&offset=M -- existing endpoint gained line-tail and byte-offset modes; sets X-Log-Size for resumable polling.
GET /api/node/<exp>/trackio -- reads the .trackio_url marker; optionally scrapes recent scalars via huggingface_hub + pandas + pyarrow (lazy imports; degrades to link-only when any are unavailable).

Files moved

plugins/evo/skills/verifier/SKILL.md and plugins/evo/skills/ideator/SKILL.md -- replaced with redirect stubs. Bodies live in plugins/evo/agents/verifier.md and plugins/evo/agents/ideator.md.
plugins/evo/skills/optimize/SKILL.md and plugins/evo/skills/subagent/SKILL.md -- callers updated to invoke Task(subagent_type=\"evo:verifier\" | \"evo:ideator\") instead of loading the old skill bodies.

Versions

pyproject.toml, plugins/evo/pyproject.toml, plugins/evo/.claude-plugin/plugin.json, and every skill's evo_version field at 0.5.0-alpha.1.
plugins/evo/uv.lock resynced (was drifting at 0.5.0a4).
bin/evo-hook-drain-rs/Cargo.toml stays at 0.1.0. The Rust binary doesn't ride the CLI version.

Issues

Fixes hook-drain: session not registered when .evo/ is created mid-session under claude --print #49, attempts/<n>/diff.patch written as 0 bytes despite real commits on experiment branch #51
Refs Add subagents to the evo plugin (plugins/evo/agents/) #50 (2 of 3 planned migrations done; subagent -> experiment-runner deferred)
Refs Extend evo wait to watch processes, log growth, and GPU activity (replace while true agent poll loops) #52 (evo wait --for process/log-growth/gpu-* requires an evo workspace, but should not #53 filed as the workspace-check follow-up)
Refs evo wait --for process/log-growth/gpu-* requires an evo workspace, but should not #53 (filed)

Platform-agnostic method judgment (triggers, what matters, the rules that are not laws) plus references: trace schema, glue contract, diagnostics, and provider recipes (sft/tinker, rl/art, serving/vllm). Judgment lives in the skill; guardrails are evo gates. evo ships no per-framework training adapter.

Path.home() / ".claude" hardcoded in 7 spots silently misses the plugin cache when Claude Code is configured via CLAUDE_CONFIG_DIR (cloud containers with a persistent volume, sandbox setups, custom configs). The downstream symptom: ensure_hook_drain_binary skipped at install, hook fires with exit 127 at runtime, evo direct delivery permanently broken with no clear error. Centralize on _claude_config_dir() that reads CLAUDE_CONFIG_DIR with ~/.claude as the fallback. Apply across _latest_cache_dir, update --force cache wipe, and the doctor command. Regression test in tests/unit/test_claude_code_install_paths.py covers both the override and the no-leak-to-home invariant.

…ride Workspaces now declare their wall-clock budget per experiment up front at `evo init`. Stored in config.json as `per_exp_timeout`; consumed by `evo run` as the default for `subprocess.run(..., timeout=...)`. Per-call override is `evo run --timeout N`. Was: evo run silently used a hardcoded 1800s default whenever the agent forgot --timeout. SFT-style benchmarks (~40-60min) systematically got killed at step 1 with no per-workspace tunability. API change (breaking): - `evo init` now requires --per-exp-timeout <seconds> - core.init_workspace() gains per_exp_timeout: int | None parameter - `evo config get|set per-exp-timeout` mirrors the existing field pattern Backward compat: - Workspaces initialized before this change (no `per_exp_timeout` in config.json) fall back to legacy 1800s with a one-line stderr warning nudging the user to run `evo config set per-exp-timeout <N>`. Discover skill (step 7) + cli-quick-reference both updated to include the required flag, with guidance on picking the value. Version bump 0.4.4 -> 0.5.0-alpha.1 (minor: required-arg-add at init). Tests: - tests/unit/test_per_exp_timeout.py covers persistence, precedence (override > workspace > legacy fallback), value validation, argparse enforcement of required=True. - 57 existing test-suite evo-init call sites updated with the new flag (matched only lists containing both "init" and "--target" to avoid false-matching git init / git commit -m "init"). - Full unit suite green (582 passed, 1 skipped).

Two new skills, each addressing a distinct failure mode the optimize loop didn't catch before: evo:verifier (SIA-style sequential) Audit a single experiment for cheating/validity. Two phases: --phase pre : static analysis before evo run (cheap, ~30s). Catches test-set leakage in training data, deliberately subsetted eval, missing artifact gates, vague hypotheses, resource-profile conflicts. --phase post : result audit before commit. Duration sanity vs cohort, artifact reality (real model files), score reproducibility spot-check (re-eval 2 random tasks), gate compliance. Writes verdict as `evo annotation --type verification`. Exit 1 on FAIL so the subagent's caller can branch. evo:ideator (Cursor parallel-then-reconcile) Generate experiment proposals via three briefs spawnable in parallel: --brief failure_analysis : cluster discards, propose alternatives --brief literature : web/arxiv scan for untried techniques --brief frontier_extrapolation: deeper variants of the steepest gradient All briefs append to .evo/run_<id>/ideator/proposals.jsonl. Orchestrator reconciles at next evo new (rank by uplift x confidence, dedupe vs graph). Integration: subagent skill: new step 4 (pre-verifier) + step 6 (post-verifier). Original "Run / Analyze / Annotate / Decide" renumbered to 5/7/8/9. FAIL verdict triggers evo discard with the verifier's one-line reason; PASS proceeds to evo run. optimize skill: new step 6b (spawn parallel ideators on stall / every 5 commits / failure clusters) + step 6c (reconcile proposals at next brief-writing time). Non-blocking -- orchestrator continues its loop while ideators run. Patterns synthesized from: - hexo-ai/sia : single Feedback-Agent reading trajectories. Sequential. Inspired verifier's pre/post split. - huggingface/ml-intern: doom-loop detector as lightweight watchdog (not a separate agent). Informs follow-up work. - cursor 2.4 subagents : parallel-then-reconcile pipeline (lint+security+ tests merging into one review). Direct template for ideator's three-brief fan-out. Verifier's pre-phase static cheating check is novel relative to all three prior systems -- evo is more autonomous (no human-in-loop per experiment) so it needs to detect design-time gaming itself. Version bump 0.5.0-alpha.1 -> 0.5.0-alpha.2. bump-version.py SKILLS tuple extended to include verifier+ideator. Tests: full unit suite green (582 passed, 1 skipped).

… proposals evo wait used to only watch experiment-dir transitions. The verifier+ideator skill (35c313f) said "fire ideators non-blocking, read proposals at next round" -- but ideators take 5-10 min and the next round is 1-2 min away, so proposals consistently miss the round they were spawned for. Adds --for {experiments,ideators} (default: experiments) plus --count N (ideators only). The ideator path snapshots proposals.jsonl mtime + line count at wait start, returns 0 when N additional lines have been appended since baseline; partial counts surface on timeout (e.g. "timed out with 1/3 proposals (partial)") so the orchestrator can proceed with what's available. Optimize skill step 6b now distinguishes the trigger: - STALL or FAILURE CLUSTER -> block here briefly (next round depends on fresh ideas): `evo wait --for ideators --count 3 --timeout 900` - Periodic (every-N-commits) -> fire and continue; the next round can run on in-graph signal, the round after picks up proposals Step 6c adds a short fallback wait for the fire-and-continue path: `evo wait --for ideators --count 1 --timeout 120`. Ideator skill clarifies: - --count counts LINES not ideator completions - Append-at-end discipline is for failure atomicity, not wait semantics Tests: 6 new in TestEvoWaitForIdeators (timeout, single-arrival, --count 3 satisfies after 3rd, partial timeout w/ 1/3 surface, baseline doesn't satisfy, baseline+new wakes). Full unit suite: 588 passed, 1 skipped. Version bump 0.5.0-alpha.2 -> 0.5.0-alpha.3.

evo wait used to require --for {experiments|ideators} with experiments as default. The split was reflexive copy of an existing pattern. The common case is "wake me when anything interesting happens" -- both sources qualify. Make that the default; --for becomes a filter for the narrow case (orchestrator blocking specifically for proposals after a stall, doesn't want incidental experiment activity to wake it). Old: evo wait # only experiments evo wait --for ideators # only ideators New: evo wait # both (default; one command, no flags) evo wait --for ideators # filter to just ideators evo wait --for experiments # filter to just experiments --count now requires --for (otherwise ambiguous which kind to count); --count > 1 without --for returns exit 2 with a usage error pointing at the right invocation. Tests: 5 new in TestEvoWaitDefaultWatchesBoth covering: - default wakes on experiment outcome (existing behavior preserved) - default also wakes on ideator proposal - default timeout message names both watched sources - --count > 1 without --for is rejected - --count = 1 without --for is allowed (no-op pairing) Existing TestEvoWait and TestEvoWaitForIdeators classes pass unchanged (their explicit args still satisfy the new resolution). Full unit suite: 593 passed, 1 skipped. Version bump 0.5.0-alpha.3 -> 0.5.0-alpha.4.

Patterned after huggingface/ml-intern's tool surface (HF Papers + HF Hub + HF Docs + GitHub code/repos/files + web search): the literature brief now scans arXiv, HF Papers, HF Hub, GitHub code search, GitHub issues, GitHub PRs, and recent blog posts in one pass. Each source surfaces a different kind of signal (papers = methodology, repos = runnable code, issues = practitioner anecdotes, blogs = honest writeups). Added structured procedure: 1. Frame the search from project.md + evo show root 2. Multi-source parallel scan (5-8 queries, source matrix provided) 3. Due diligence per candidate (WebFetch the paper/repo/issue, check last commit, confirm claim is in headline results not appendix) 4. Cross-check evo graph -- skip duplicates of prior experiments 5. Rank by has_runnable_code > replication > specificity > recency 6. Write 2-4 proposals with full provenance Proposal schema extended with optional `sources[]` (kind+url+claim) and `confidence_signals` (has_runnable_code, replicated_across_sources, specificity, recency_months) -- only required for the literature brief. Orchestrator's reconciler can use these to weight proposals. Added "Generic web-research mode" subsection: the literature brief doubles as a focused web-research agent when the orchestrator passes a specific question instead of the broad "what could we try." Also stripped domain-specific examples (AIME / Qwen / NuminaMath / GRPO) from ideator + verifier skills -- these are platform-agnostic skills, examples should be generic. Version 0.5.0-alpha.4 -> 0.5.0-alpha.5. Full unit suite green (593 passed, 1 skipped).

Bumped from 0.5.0-alpha.1 -> a2 -> a3 -> a4 -> a5 across consecutive commits, but no alpha has been published. Keep all work-in-progress changes under 0.5.0-alpha.1 until we actually cut a release; future alpha numbers should be reserved for things that go through the release/publish flow, not every commit on the branch.

Two simplifications across the verifier/ideator/subagent skill bodies: 1. Ideator skill: drop the Python pseudocode block that implied the orchestrator constructs a Task() call with a multi-line f-string. Replace with the same prompt-template style the optimize skill already uses for spawning experiment subagents ("Each subagent's prompt MUST start with the literal sentence..."). The orchestrator knows only: skill name + brief arg. The spawned subagent loads the skill itself. 2. Subagent skill: rewrite the verifier-call snippets from `evo:verifier --phase pre --target <exp_id>` (which looks like a shell command but isn't one) to "load the evo verifier skill ... with args `--phase pre --target <exp_id>`" -- matching how the existing subagent body refers to skill invocations. Also flag the known architectural gap on post-phase verifier: evo run auto-commits, so the subagent can't intervene pre-commit. The clean fix (register the verifier as a workspace gate) is still pending; until then, use `evo prune` for committed nodes that fail verification instead of `evo discard`. No code changes. Skill body text only. Version stays at 0.5.0-alpha.1.

… pre-phase Two corrections per user feedback: 1. Descriptions said "Use when the user invokes /evo:verifier" / "Use when the user invokes /evo:ideator", which is wrong -- these are internal-only skills, the user never loads them. Both descriptions now match the existing evo:subagent pattern: "Internal protocol for X. Loaded by Y. Not user-invokable." 2. Subagent flow only calls verifier pre-`evo run`. The post-phase step (formerly subagent step 6) is removed -- evo run auto-commits before the subagent can intervene, so a post-call there can't actually prevent commit. Steps 6/7/7b/8 renumbered (was 7/8/8b/9). The verifier skill body still documents both phases: pre is the primary path called by subagents; post is marked advisory-only, useful for ad-hoc audits via evo prune, until/unless we re-architect it as a workspace gate. No version bump.

Two changes addressing a real failure: a running agent picked subagents=5 on an 8-core latency benchmark based on the optimize skill's inline "~5-8" summary, then realized after the fact that sizing-the-round.md's actual rule for latency-sensitive measurement is width=1 (contention corrupts the metric). 1. optimize/SKILL.md Configuration section: drop the "Short version" bullet list that gave agents enough to act on without reading the reference doc. Replace with a hard read-the-doc instruction + the common ways agents skim past the right answer (latency, worktree "no cap", harness softeners). Require citing the binding-resource case BY NAME from the doc in the opening message -- if the agent can't cite it, it didn't read the doc. 2. sizing-the-round.md table: add an explicit row for "latency / timing / throughput measurement" → width 1, with reasoning about contention BIAS (not just noise). Reframe the CPU-light row as "measurement not timing-based" and add a follow-up paragraph explicitly listing the cases that look CPU-light but are actually contention-corrupted. Affects: future /optimize sessions on feat/model-update will pick width correctly for latency benchmarks. No code/CLI change; docs only.

…rescription Previous commit added a "latency/timing/throughput → width 1" row and made the optimize body say measurement at >1 width "CORRUPTS RESULTS." That's a rule, not a prior -- evo skills should encode guardrails as firm gates and method/hyperparameter choices as overridable priors (feedback_skill_priors_not_rules.md). Some latency benchmarks tolerate modest parallelism just fine; the rule wrongly prescribes width 1 across the board. Reframe: - sizing-the-round.md drops the dedicated latency row from the table. Adds a follow-up paragraph that names the case as a judgment call, lists the four things to weigh (effect size vs. jitter, timed-section share of wall-clock, solo-confirm gate viability, harness contention filtering), and notes width 1 is the safe default for UNKNOWN timing-sensitive benchmarks -- not a reflex. - optimize/SKILL.md body drops "CORRUPTS RESULTS" framing for the latency case. Says the doc has judgment framing; up to the agent to apply it. Net: the doc still tells the agent the right questions to ask. It doesn't pretend to know the answer.

This run's agent decided "since this is a serial GPU workload, I'll drive experiments directly to avoid subagent overhead" -- skipped evo:optimize entirely and ran exp_0001 itself. That loses every piece of the optimize loop's structure (scan-subagent cross-cutting analysis, verifier pre/post hooks, ideator spawning on stall, frontier reconciliation, stop-hook discipline) for no actual benefit -- the agent conflated "subagents=1" with "don't run optimize." Two skill edits to disambiguate: 1. discover/SKILL.md step 13: explicit "do not run experiments outside /evo:optimize" callout. Even for serial workloads, the answer is `subagents=1`, NOT bypassing the loop. 2. optimize/SKILL.md preamble: state the loop's value is the STRUCTURE around each experiment, not parallelism alone. Pass subagents=1 for serial; never skip optimize entirely. No code change. Skill text only. Version stays at 0.5.0-alpha.1.

…examples This run's agent (and the previous run's agent) wrote benchmark wrappers that emit ONE aggregate `log_task("eval_total", score)` call instead of one per AIME problem. The result: dashboard's per-task panel is useless, verifier's reproducibility spot-check has nothing to spot-check, ideator's failure-clustering can't cluster. The diagnostic value of the run is permanently lost from that point. Root cause: the inline_instrumentation reference files (.py and .js) contained only the helper definitions, no usage example. Agents read the API contract but had no demonstration of the per-item loop pattern, so they took the path of least resistance and rolled up themselves. The SDK references showed the loop but didn't strongly nudge against the aggregate-only pattern. Updates: 1. inline_instrumentation.py: top-of-file docstring now says per-task emission is the LOAD-BEARING discipline + cites which evo features depend on it. New USAGE EXAMPLE block at the bottom shows a real per-item loop (per-AIME-problem). New ANTI-PATTERN block shows what NOT to do (aggregate-only). 2. inline_instrumentation.js: same shape for Node -- header callout + USAGE + ANTI-PATTERN blocks. 3. sdk_python.py + sdk_node.js: top-of-file docstring adds the same load-bearing callout. New ANTI-PATTERN block at the bottom showing the aggregate-only failure mode. 4. discover/SKILL.md step 10c: new paragraph after the mode picker stating per-task emission is mandatory when the benchmark has natural sub-units. Cites the USAGE EXAMPLE / ANTI-PATTERN blocks in the reference files and names the three evo features it unblocks (dashboard, verifier, ideator). No code change. Skill text + reference examples. Version stays 0.5.0-alpha.1.

"When to do what" listed SFT first with description "install a capability the base lacks: format, tone, chat" -- which reads as "any base that needs format = SFT." The verifiable-reward case was buried in "What actually matters." No diagnostic for pivoting away from SFT when it plateaus on a verifiable benchmark. Reorganize the decision tree to lead with reward shape: 1. Verifiable reward (exact-match integer, unit-test, parser-decidable) -> GRPO/RLOO/PPO. Names AIME, GSM8K, HumanEval as the textbook cases. Calls out that the reward shapes correctness AND output-format simultaneously, so trained models pass the verifier instead of reasoning well but emitting unparseable answers. 2. Preference pairs -> DPO/KTO/ORPO. 3. Demonstrations only -> SFT. 4. Want SFT stability with reward signal -> RFT (rejection-sampling). Add diagnostic: 2+ committed SFT experiments at 0.0 on a verifiable benchmark indicates technique-class mismatch, not recipe tuning. Pivot to verifier-as-reward RL. Promote "RL-from-base works for strong base models" from the "Not laws" footnote into actionable main-section guidance.

…erns Previous SKILL.md was ~1200 words of mixed-density prose. The frontmatter description ("Load when planning or diagnosing a train move") was passive and didn't trigger reliably; agents that did load the body skimmed to the first concrete-looking section and missed diagnostics further down. Rewrite (~600 words) structured for skim: - Description now names every covered technique (SFT/LoRA/DPO/KTO/ORPO/ RFT/GRPO/PPO/RLOO/RLHF) and the trigger phrases, so the host loads it on any post-training mention. - Reward-shape technique selection moved from prose to a four-row table. - Smoke-run section added as a precondition: ~10 examples, ~1 minute, must produce a loadable checkpoint AND a non-zero eval before scaling. Catches dtype mismatches, tokenizer/template drift, OOM at this batch size, and "loss fell but artifacts dir is empty" in minutes. - Three diagnostics grouped together: stuck-at-0 on a verifiable benchmark after 2+ SFT runs, base below random on a knowledge-heavy benchmark, delta <= 0 across several committed train moves. - New "What never counts as progress" section lists five patterns that produce a score number without model improvement: training on the held-out set (direct + transitive), embedding eval items in "synthetic" data, generating training data conditioned on per-eval- item failure logs, submitting a checkpoint that wasn't trained, and training a different objective than the verifier scores. - "Surviving session compaction" -- methodlog.md guidance for decisions that fall out of context on long sessions. Cuts: standalone "Not laws" section (the load-bearing item -- "SFT before RL is not a law" -- folded into the technique table's caption); long "Reading a run" paragraph (folded into the references list). Adds references/false-progress.md with examples and detection per pattern, including transitive contamination (public instruction-tuning datasets that contain eval-derived items) and AST-normalized matching for paraphrased code items.

The node drawer had three tabs (Summary / Diff / Tasks) and surfaced benchmark results, but nothing about an experiment's training-side state: training script stdout was reachable only via filesystem, and external metric trackers (trackio HF Spaces) showed up nowhere. Adds two surfaces: 1. Logs tab (new). Lists *.log and *.out files in the latest attempt dir (and one level under logs/). File picker, auto-refresh checkbox (2s poll), manual refresh. Tail is byte-offset based: first load pulls the last 500 lines; subsequent polls pull only bytes past the prior X-Log-Size, so the panel appends without re-rendering. 2. Trackio link + inline sparklines in the Summary tab. After the Artifacts meta row, if the experiment's traces dir contains a .trackio_url marker (written by the training callback that pushes metrics to a HuggingFace Space), the drawer renders: - "Metrics" row with a clickable link to the Space - up to 3 series (loss / lr / reward / kl / grad_norm preferred) as 140x24 SVG polylines with last-value labels Scalars come from the trackio HF Dataset companion (<space>-dataset), read via huggingface_hub + pandas; if any of those imports fail, sparklines are silently omitted and the link still renders. Backend changes (dashboard.py): - GET /api/node/<exp>/log/<file>?tail=N&offset=M -- existing endpoint gains tail-by-line-count and append-by-byte-offset modes. Sets X-Log-Size header so clients can resume cleanly. - GET /api/node/<exp>/logs -- lists candidate log files in the latest attempt with size + mtime. - GET /api/node/<exp>/trackio -- reads .trackio_url marker, optionally scrapes the last 60 rows of the run's parquet for sparkline data. Frontend changes (app.js): - 'logs' added to drawer tab whitelist; setSidebarTab tears down the log poll on navigation; closeSidebar tears it down on drawer close. - state gains logsSelected / logsOffset / logsAutoRefresh. - loadTrackioPreview fires after Summary renders, so the panel paints immediately and the metrics row appears progressively. - renderSparklines does inline SVG polylines, no Plotly dependency. Backend optional deps (huggingface_hub, pandas, pyarrow) are imported lazily and any ImportError degrades the response to a link-only payload.

Pre-release versions (0.5.0-alpha.1, 0.6.0-beta.2, etc.) frequently don't have a corresponding GitHub Release tagged yet -- the release build only fires on stable bumps. When evo CLI 0.5.0-alpha.1 calls ensure_hook_drain_binary, it tries https://github.com/evo-hq/evo/releases/download/v0.5.0-alpha.1/evo-hook-drain-linux-amd64 which 404s. Mid-run inject (evo direct) silently breaks until the binary gets staged manually. Add a second URL to the fetch loop: https://github.com/evo-hq/evo/releases/latest/download/evo-hook-drain-linux-amd64 GitHub redirects /releases/latest/download/<asset> to the asset on the most recent stable release. The binary's wire protocol is stable across minor versions, so a slightly older binary still works for the current alpha CLI -- same rationale we use in the Modal image setup (modal_app.py fetches /releases/latest/ at build time). On success via the fallback, prints a NOTE clarifying which path was used so the user understands they're running a slightly older binary against a newer CLI -- still wire-compatible, just not identical.

…n it Add the first plugin-shipped subagent to evo: benchmark-reviewer. Lives at plugins/evo/agents/benchmark-reviewer.md, invokable via Task(subagent_type="evo:benchmark-reviewer", ...). Read-only audit of a freshly-constructed benchmark harness before its first `evo run`. Checklist the subagent enforces: 1. Per-task instrumentation. Aggregate-only emission ({"score": X} without per-item traces) is the canonical bug -- breaks the dashboard's per-task panel, the verifier's reproducibility check, and the ideator's failure clustering. Common shape: a wrapper script around a runner library (inspect_evals, evals, lm-eval-harness) writes only the runner's aggregate, never converting per-sample data into log_task calls. Block-severity. 2. Eval-set / held-out leakage in training data sources, including transitive sources (public instruction-tuning datasets that contain eval-derived items). 3. Goodhart gates: constructed benchmarks must have a real pass/fail gate; bare benchmark-rerun gates are decorative. 4. Basic plumbing: result.json gets written, traces hit $EVO_TRACES_DIR, errors crash rather than write zero scores. 5. Determinism note (not a block). Output is structured JSON {passed, findings[]} for programmatic gating. Wire into evo:discover at a new step 10d (between instrumentation + the cheap validation run). The orchestrator MUST invoke the reviewer before `evo run` and address every block-severity finding before proceeding. Renumber the trailing subsections (10e cheap-validation, 10f commit) accordingly. This is the first plugin agent; plugins/evo/agents/ is a new directory. Claude Code's plugin loader picks up agents/<name>.md alongside skills/<name>/SKILL.md without changes to .claude-plugin/plugin.json.

The hook registers a session into .evo/run_*/inject/sessions/ on SessionStart (eager) and on UserPromptSubmit (lazy recovery when SessionStart fired before .evo existed, e.g. the agent runs `evo init` mid-session). `claude --print` (single-turn batch mode) never fires UserPromptSubmit, so the UserPromptSubmit recovery path misses any batch session whose .evo/ is created after SessionStart. `evo direct <text>` then reports fanout=0 and directives queue undeliverable. Add a PostToolUse branch alongside the existing UserPromptSubmit branch. PostToolUse fires on every tool call, so the first tool the agent runs after `evo init` registers the session. engage=false: PostToolUse is a tool callback, not an orchestrator engagement signal. Fixes #49

…scores Adds a fourth diagnostic for the case where 3+ committed experiments across structurally distinct techniques (SFT, GRPO, RFT) all land at the same non-zero score. Names the train-verifier objective mismatch as the most common cause and prescribes a spot-check of training vs eval examples before trying a fourth technique variant.

Unbounded `while true; do sleep N; tail file; done` loops block indefinitely when the underlying process crashes — the tail keeps reading a dead file and absence of log growth reads as "still working." Adds a bounded poll pattern to discover and optimize skills that checks three independent signals (process alive, log growth delta, GPU activity) per iteration with a hard loop-bound timeout. Forward-compatible note points at the planned `evo wait` CLI that will replace the loop.

When the benchmark wraps a runner library (inspect_evals, lm-eval-harness, evals), the per-item loop is hidden inside the runner and the natural wrapper shape collapses to a single aggregate `{"score": X}`. The runner typically already writes per-sample data; the wrapper just isn't forwarding it. Without per-item forwarding, the dashboard's Tasks tab is empty, the verifier can't spot-check, and there's no input for RL-on-failures or curriculum strategies. Reinforces the existing per-task emission rule with the specific runner-wrapper case and points at the reference files with worked examples.

For any non-root experiment, the training script must warm-start from the parent's checkpoint (exposed via EVO_PARENT_POLICY) rather than re-training from base. Re-training from base each generation burns budget on duplicated work and prevents the experiment tree from accumulating capability across generations. Adds the concrete pattern (branch on os.path.exists to handle root, where the value is a base model id) to the skill and expands references/glue.md with the same contract.

…oad it Same fix as 680336e for skills/subagent. With disable-model-invocation set, the Skill tool cannot load infra-setup, but the discover skill's STEP 0 "internalize every skill" instruction requires it, and discover/optimize both reference infra-setup's provider-matrix.md for remote backend setup. Removing the field lets the skill be invoked normally; the "Non-user-invocable" framing in the description still signals intent.

`attempts/<n>/diff.patch` was scoped to `relative_target(config)` -- the single configured target path. Any source edit outside that path (training configs, prompt files, helper scripts) dropped from the patch, so the dashboard rendered the file as empty even when the experiment branch had real changes. `capture_experiment_diff` runs `git diff <parent_ref>` against the worktree with no path scope, covering both the pre-commit workflow (agent commits to the experiment branch then invokes `evo run`) and the dirty-worktree workflow (agent leaves edits for `maybe_commit_worktree`). Binary / large artifacts (safetensors, optimizer state, tokenizer blobs, checkpoint dirs) are excluded via pathspec to keep diff.patch dashboard-renderable. Override per workspace via `config["diff_exclude_patterns"]`.

…mands The polling-discipline skill text added in 34d11e8 documents the bounded- poll pattern but only fires if the agent reads the skill section. Agents that skip straight to tail-loop polling never get the nudge. Adds plugins/evo/hooks/wait_hint.sh, a small bash script wired as a second PostToolUse entry in hooks.json (alongside the existing evo-hook-drain handler). On every Bash tool call it greps the command string against patterns that indicate a long-running workload: - python.*train, accelerate launch, vllm serve, trl vllm-serve - python eval / evaluate.py / run_eval.py - nohup ... & - sleep <N> ... tail and while true ... sleep (the anti-pattern) On match it prints a one-line [evo-hint] to stdout (which Claude Code routes into the next-turn context as a non-blocking system reminder) pointing at `evo wait --for process=<pid> --for log-growth=<path> --for gpu-active --timeout 60m --json`. Dedup is a touch-file under $TMPDIR keyed by session_id, so the hint fires once per session even if the agent restarts the same command. Non-matching commands, non-Bash tool calls, and empty payloads fast- exit 0. Runtime is ~35ms. Pairs with the polling-discipline section in plugins/evo/skills/ {discover,optimize}/SKILL.md and the planned `evo wait` extensions.

The verifier audits one experiment for design-time cheating (pre-phase) or result-time validity (post-phase). It is read-only, returns a single JSON artifact, and the orchestrator gates compute/commit on its verdict -- a subagent fits that shape better than a skill that gets loaded into the orchestrator's context. plugins/evo/agents/verifier.md is the new subagent. Same audit checks as the old skill (test-set leakage, benchmark subsetting, gate coverage, hypothesis specificity, resource conflicts for pre; duration sanity, artifact reality, score reproducibility, gate compliance for post) plus an explicit pass over the five false-progress patterns (test-set ingestion, eval items in synthetic data, reverse-engineered verifier, instruct-model substitution, training-objective mismatch). Output shape mirrors benchmark-reviewer: `{passed, findings: [{category, severity, what, where, fix}]}` with severity `block` flipping `passed` to false; the verdict is also persisted as an `evo annotation` on the target experiment so the dashboard and later runs see it. plugins/evo/skills/verifier/SKILL.md is now a stub that points callers to the subagent invocation. Kept rather than deleted so any callsite that resolves `evo:verifier` through the skill loader gets a clear pointer instead of a not-found. plugins/evo/skills/subagent/SKILL.md step 4 now spawns `Task(subagent_type="evo:verifier", ...)` instead of loading the verifier skill. No other callers reference the verifier skill directly. Refs #50.

The ideator generates ranked experiment proposals via cross-graph analysis and web research (arXiv, HF Hub, GitHub, blogs). The research work is high-token and not directly actionable -- pulling it out of the orchestrator's context into a subagent isolates research notes from planning state, and the orchestrator only sees the structured proposal output. plugins/evo/agents/ideator.md is the new subagent. One brief per invocation (`failure_analysis`, `literature`, `frontier_extrapolation`) matching the old skill's parallel-spawn model -- the orchestrator fires three Task calls in parallel, each subagent runs one brief in its own context, all three append to the shared `.evo/run_<id>/ideator/proposals.jsonl`. Tools include `WebFetch` + `WebSearch` for the literature brief; the other two briefs are pure local analysis. Proposal schema is the same as the old skill (renamed `mechanism` -> `rationale`, `sources` -> `references_consulted`, added `title` and `technique` for orchestrator-side ranking). plugins/evo/skills/ideator/SKILL.md is now a stub that points callers to the subagent invocation. plugins/evo/skills/optimize/SKILL.md step 6b now spawns three `Task(subagent_type="evo:ideator", ...)` calls instead of telling each subagent to load the ideator skill. `evo wait --for ideators` still works as the synchronization primitive; the file-based reconciliation contract is unchanged. Refs #50.

The evo wait extension (#52) broke 8 unit tests in three clusters, which kept the v0.5.0-alpha.1 and v0.5.0-alpha.2 publish workflows failing at preflight. Fix all three: 1. test_timeout_capped_at_3600 expected the old 1-hour cap. The cap was raised to 24h alongside the extension so long external waits (a 10-hour training run) are expressible. Rename to test_timeout_capped_at_24h, expect 86400, also assert _WAIT_TIMEOUT_CAP == 24 * 3600 directly so a future cap change trips the test deliberately. 2. test_count_without_for_is_rejected asserted the substring "--count > 1 requires --for". The new error is "--count > 1 requires exactly one --for experiments|ideators (otherwise ambiguous which kind to count)". Update the substring to "--count > 1 requires exactly one --for" (present in the new message; the rest of the wording is informational). 3. TestEvoWaitForIdeators._run_wait passed wait_for="ideators" (a string). --for is now action="append" in argparse, so the namespace value is a list. The new parser iterates the string character-by-character, yielding 'i', 'd', 'e', ... and failing with "unknown form". Pass a list. Comment in the helper explains why. Plus one assertion-text update in test_timeout_when_no_proposals_arrive: the timeout summary changed from "no new ideator proposal" to "no ideators activity"; loosened assertIn to "ideators" which matches both old and new wording. All 18 tests pass locally (pytest tests/unit/test_evo_wait.py).

…workspace check Fixes #53. `cmd_wait` called `repo_root()` and the `.evo/` existence check at the top of the function, before any `--for` parsing. That bailed any invocation outside an evo workspace, even when the user only requested external-state watches (`--for process=<pid>`, `--for log-growth=<path>`, `--for gpu-active`, `--for gpu-idle`) -- none of which read or write anything under `.evo/`. Refactor: 1. Parse `--for` first. 2. Determine `needs_workspace` from the parsed conditions: true iff any condition is `experiments` or `ideators` (the workspace-anchored targets). The legacy default (no `--for`) implies both, so it still needs a workspace. 3. Only call `repo_root()` + `.evo/` existence check when `needs_workspace` is true. Otherwise `root` stays `None`. 4. The `run_dir` lookup later is now guarded on `root is not None`. 5. The workspace error message gains a clarifying suffix: `(required for --for experiments|ideators)`. Regression test: `test_non_workspace_watches_work_outside_workspace`. Uses `--for log-growth` rather than `--for process=<pid>` to avoid the zombie/reaping subtlety where a `subprocess.Popen` child of the test process stays kill(pid,0)-alive until reaped, which would make a process-watch test flaky. Verified locally: - `evo wait --for process=$PID --timeout 10s --json` in a non-git tmpdir returns clean JSON with `exit_reason: process-exited`. - `evo wait --timeout 1s` (legacy default) in the same tmpdir still errors with `ERROR: not in an evo workspace (required for --for experiments|ideators)`. - All 19 unit tests in `tests/unit/test_evo_wait.py` pass.

New section after parent-warm-start covering reusable artifacts -- LoRA adapters, curated/tokenized datasets, embeddings, retrieval indexes, precomputed eval generations. Convention is .evo/cache/, sibling to run_<NNNN>/. Already gitignored via the workspace's .evo/ exclude. Survives evo new / evo run / evo reset because none of those touch siblings of run_<id>/. Pattern: workspace-root lookup, key embeds every recipe input that changes the artifact, read-or-compute. Calls out the worktree-local anti-pattern (artifact disappears on gc) and acknowledges the HF Hub cache to avoid duplication. Defers the named-asset registry to #55.

Domain-agnostic counterpart to the finetuning skill's caching section. Subagent skill applies to every experiment-running agent regardless of domain, so this is where the general "check before recomputing" prior belongs. Lists the artifact shape (slow to produce, stable across siblings/ descendants) and the cache path (.evo/cache/, gitignored, untouched by new/run/reset). Calls out the worktree-local anti-pattern. Defers the concrete read-or-compute pattern to the finetuning skill rather than duplicating it.

Carries: - docs(skill/finetuning): cache expensive intermediates under .evo/cache/ - docs(skill/subagent): domain-agnostic counterpart pointing at finetuning

…ator-readable ideator and verifier migrated from skills to subagents in d69f35d / 0127244 (plugins/evo/agents/{ideator,verifier}.md). The skill paths were kept as 6-line stub redirects so any caller resolving Skill("evo:ideator") got a clear pointer rather than a not-found error. In practice the stubs are net-negative: an LLM reading the stub still has to manually translate to Task(subagent_type="evo:ideator", ...). Without the stub, Skill("evo:ideator") returns "not found" and the optimize-skill roster (and any subagent-tool listing) can be the canonical source naming them as subagents. evo:subagent: drop the "Not for orchestrator use" gate. The description now positions the skill as legitimately invokable by the orchestrator to understand the brief shape its dispatched subagents expect. Two-line preface in the body tells an orchestrator-as-reader to stop at "Host conventions" -- the rest of the body is subagent-as-reader. The brief-writing logic in /optimize gets a place to learn the subagent's input/output contract without having to grep the source.

…ng references The agent has had no map of evo's invocable surface, only individual skill descriptions. Result, observed in a recent run: agent stuck in SFT for 5 experiments without ever invoking evo:finetuning, never read sizing-the-round.md, hand-wrote a stripped-down log_task/write_result that produced empty Tasks panels in the dashboard. Each entry-point skill now opens with an "Evo surface" tree sized to its reader: - discover/SKILL.md: full tree -- main thread (orchestrator) + subagent thread + complete references catalogue. First contact with the surface. - optimize/SKILL.md: loop-relevant subset -- what gets pulled/dispatched during the loop. Cross-references discover for the full tree. - subagent/SKILL.md: subagent's slice -- skills it pulls (finetuning), subagents it dispatches (verifier, mandatory pre/post), references it reads. Cross-references discover for the full tree. - finetuning/SKILL.md: replaces the flat ## References list with a categorized tree (core contracts / rl/ / sft/ / serving/) + a brief cross-skill section. Each line in every tree is a triggering condition ("if you're about to do X, pull this"), not a description of what each piece does. The skill bodies themselves carry the content; the tree is the index. Also moved sizing-the-round.md from optimize/references/ to discover/references/ (decision happens at the discover->optimize handoff, which is during discover; reading it after /evo:optimize is invoked is too late -- subagents=N is already chosen). Updated the path reference in optimize/SKILL.md body accordingly.

Carries: - skills: drop ideator/verifier stub redirects; make evo:subagent orchestrator-readable - skills: add Evo surface tree to discover/optimize/subagent + finetuning references (categorized by rl/sft/serving) - sizing-the-round.md: moved from optimize/ to discover/references/ (decision happens at the discover->optimize handoff)

…, discard-time diff + dashboard discard_reason (#57) #49 -- hook-drain: engage on evo invocations seen via PostToolUse Previously: SessionStart engaged the orchestrator, but only when .evo/ already existed. For claude --print batch mode that created the workspace mid-session via `evo init`, SessionStart's engage=true call no-op'd (no workspace yet); first PostToolUse registered with engage=false; later `evo` invocations never had a chance to upgrade engagement. Net: directives from `evo direct` got skipped_unengaged=1 for the entire session lifetime -- mid-run injection non-functional. Now: PostToolUse detects when its Bash invocation is an `evo ...` command and (a) uses engage=true on first-time registration, or (b) upgrades has_evo_engaged: false -> true if the session was registered earlier via a non-evo PostToolUse. Subagent context (agent_id present, EVO_EXP_ID set) still suppresses engagement so spawned subagents don't engage the workspace loop on their own. Helpers: is_evo_invocation() substring-matches "evo " in the tool command (handles shell-snapshot wrappers); is_session_engaged() reads the session record and checks has_evo_engaged. #56 -- evo run --check + evo run: assert tasks-in-result when N>1 traces Catches rolled-own log_task/write_result that emit per-task traces to traces/ but omit the `tasks` array from result.json. Dashboard's per-task panel reads outcome.benchmark.result.tasks for committed experiments (no fallback to the traces dir), so these benchmarks silently render "No benchmark task results recorded" even with 30 traces on disk. Now: both _cmd_run_check and _cmd_run_impl, after load_result(), count traces/task_*.json files; if N>1 and parsed.tasks is empty, raise with a pointer at the canonical inline_instrumentation.py reference. Forces the agent to paste the file (which has the _SCORES accumulator + tasks aggregation built in) instead of hand-rolling a stripped-down version. #57 -- discard observability: capture diff + render discard_reason Two halves: (a) Backend: cmd_discard now calls capture_experiment_diff against the experiment branch vs its parent's commit BEFORE delete_discarded_experiment removes the worktree+branch. Writes to <experiment_dir>/diff.patch (sibling to result.json, NOT under attempts/ since the discard may happen with current_attempt=0). Best-effort: log+continue on any failure (missing parent commit, orphaned worktree, etc.) so the discard itself never blocks. (b) Frontend: app.js drawer SUMMARY tab now renders node.discard_reason for discarded experiments (parallel to the existing pruned_reason rendering). Backend already returned it in the node API; frontend just had no code reading it. Together: discarded experiments now leave a visible trace of (i) the reason in the dashboard summary and (ii) the actual code changes via diff.patch the dashboard's Diff tab renders. Closes #49 (engagement-half; registration-half already landed). Closes #56. Closes #57.

#49 (Rust hook-drain) -- 18 new unit tests in main.rs: - contains_evo_word: 8 tests covering direct, chained, shell-snapshot wrappers, tab separators, start-of-string, AND adversarial cases (servo init, levo build, evolution.py, evolved.sh, sevo --help, cargo install evo-hq-cli, python evo.py, vim evo_helper.py, lone `evo` with no trailing whitespace, empty/short input). - is_evo_invocation: 5 tests covering JSON payload parsing (Bash tool name, command field present/absent), rejection of non-Bash tools (Read, Write), rejection of evo-as-substring in commands, rejection of evo paths in file-path arguments. - is_session_engaged: 4 tests covering true/false/missing-file + a documented limitation case (whitespace-around-colon JSON variants). Also tightened contains_evo_word from naive substring match to word-boundary check (start-of-string OR non-word char before, space OR tab after). The prior `command.contains("evo ")` had real false positives: "servo init", "levo build", any path or argument ending in "...evo" followed by a space. Pure-stdlib byte scan, no regex dep. #56 -- 12 new unit tests in test_check_tasks_assertion.py: - Positive: N traces + tasks array present, zero traces, exactly one trace, missing traces dir, parsed=None with no traces, dict populated with at least one task. - Negative (assertion fires): 2 traces + no tasks, 30 traces + no tasks (PostTrainBench shape), tasks=empty dict, tasks=empty list, parsed=None with N>1 traces. - Edge: non-task files in traces dir (.DS_Store, README.md, summary.json) don't inflate the count; `taskoutcome_*.json` doesn't match the glob. Refactor: extracted `_assert_tasks_aggregated(traces_dir, parsed)` helper. Both `_cmd_run_check` and `_cmd_run_impl` call into it instead of inlining the same conditional. Single source of truth + unit-testable. #57 -- 7 new unit tests in test_discard_diff_capture.py: - Positive: diff lands for branch with commits (file additions), captures modifications-not-just-additions (verifies +/- lines), diff path is the canonical experiment_dir/diff.patch. - Skip cases (returns None, no raise): worktree missing, parent_ref missing on graph entry, node has no parent_id. - Failure isolation: unexpected git failure (broken parent SHA) swallowed; discard never blocks. Refactor: extracted `_capture_discard_time_diff(root, exp_id, node, graph)` helper. cmd_discard calls into it instead of inlining the try/except wrapped capture. Same testability benefit. Test results: - 18/18 Rust tests pass - 20/20 new Python tests pass (#56 + #57) - 646/646 existing Python unit tests still pass (1 known-flaky test `test_gate_check_runs_gates_without_changing_node_status` passes in isolation; intermittent failure in bulk sweep is pre-existing and not caused by these changes)

Carries: - skills: Evo surface trees in discover/optimize/subagent + categorized references tree (rl/sft/serving) in finetuning - skills: drop ideator/verifier stub redirects; make evo:subagent orchestrator-readable - hook-drain (#49 follow-up): word-boundary `evo` detection in PostToolUse + engagement upgrade so mid-run `evo direct` actually fans out to claude --print batch sessions whose .evo/ was created via `evo init` mid-session - evo run --check + evo run (#56): assert tasks-in-result when N>1 per-task traces exist (catches rolled-own write_result) - evo discard (#57): capture parent..exp diff into diff.patch before delete_discarded_experiment wipes the worktree+branch; dashboard drawer renders discard_reason for discarded experiments - testing: 38 new unit tests (18 Rust + 20 Python); helpers refactored out of cli.py for testability

…ences principle Replace the verbose "Skills before references, always" lecture in discover with one principle line near the top of the Evo surface section: "Always have a sense of the skill before jumping into its references." Keep the tree itself unchanged (skills + references at their normal hierarchy); the principle handles the ordering. Why: agent in last run read finetuning/references/glue.md directly (after seeing it in the tree) and used the I/O contract as license to write SFT+LoRA without ever invoking evo:finetuning. The reward-shape decision tree (only in the skill body) got skipped, agent defaulted to SFT, 5 regressions followed. The reference looked authoritative on its own because the tree placed it alongside skills with no ordering hint. glue.md: short pointer at the top noting the technique pick lives in the parent skill body, not in this file. One sentence; doesn't moralize.

…tion Previous description -- "Run the evo optimization loop with parallel subagents until interrupted" -- read like ceremony if width is 1. Agent in a recent run reasoned itself out of invoking the skill: "instead of going through the full evo:optimize ceremony (which adds subagent overhead), I'll directly drive experiments using evo for tracking." Never loaded the skill body, never saw the "applies to serial workloads too" guidance inside it. Result: ad-hoc training, no frontier reasoning, no scan-subagent cross-cutting analysis, no annotation discipline -- the whole loop bypassed. New description names the structural value up front: cross-cutting analysis, frontier-based parent selection, ideator dispatch, verifier pre/post hooks, annotation discipline. Positions optimize as the natural successor to evo:discover + baseline commit. Frames width as a configurable knob (subagents=N for any N) rather than parallelism as the value prop. The description is the only thing the agent sees in the slash-command listing before deciding whether to invoke. Loading it via Skill is gated behind that decision; once the agent decides bypass is fine, the body's "don't bypass" warnings are unreachable.

Existing test_hook_drain.py uses a fake evo-drain shim that prints {} -- verifies the Rust binary handoff trigger, not what drain.py actually does with queued events. Real drain code path had zero coverage. 6 new tests in tests/unit/test_directive_delivery_e2e.py exercising the full pipeline: cmd_direct (queue) -> real Rust hook-drain binary (handoff) -> real Python drain_session (delivery): Queue side (validates alpha.7 #49 wiring): - cmd_direct writes event to workspace.jsonl - cmd_direct touches marker for engaged session Delivery side (the bug suspected in #58): - hook fires after directive queued -> EVO DIRECTIVE banner emitted - delivered/<event_id>-<sid>.json record written - marker unlinked after successful delivery (prevents re-fire) - unengaged session does NOT receive directive (engagement filter) Result: all 6 tests PASS. The delivery pipeline works correctly when invoked in isolation. Implication: the live failure observed in the PostTrainBench run (delivered/ empty after `evo direct` reported fanout=1) is not a bug in the queue/handoff/drain chain. Most likely cause is that the agent's blocking Bash tool call (training process running synchronously) delayed PostToolUse firing -- and therefore the next drain invocation -- until after observation ended. These tests serve as regression coverage for the queue+delivery contract going forward; they would catch a real break in either half.

- benchmark-reviewer: add `mode=review-experiment` for post-commit per-task failure analysis. Reads per-task traces + eval-runner log, classifies failures into a 9-category taxonomy (truncated, wrong-format, wrong-answer, hallucination, refusal, language-drift, prompt-misread, eval-error, unknown), writes per-task annotations via `evo annotate <exp> --task K`, returns JSON with failure_breakdown + top_failure_pattern + next_step_signal. `mode=audit` (existing) stays the default. - subagent: new step 6b. After COMMITTED, before annotating, spawn benchmark-reviewer in review-experiment mode. Skipped for EVALUATED/FAILED/DISCARDED. By the time the orchestrator picks the next frontier, per-task diagnosis is on disk. - finetuning: add `## Stream training metrics live` section + new references/observability.md. Env-driven detection prior (WANDB_API_KEY / TRACKIO_SPACE_ID / MLFLOW_TRACKING_URI), TRL report_to one-liners, custom-loop init+log pattern, EVO_EXPERIMENT_ID as run name. Closes the observability-blind window during long training runs (separate from in-flight evo state, tracked in #59).

- finetuning: add `## Long training: checkpoint, mid-eval, early-stop in-script` -- for trainings > 30 min, build a TrainerCallback that periodically saves a checkpoint, runs a mini-eval on a 5-10 item held-out subset, and stops on stalled/regressing trajectory. Commits the best mid-eval checkpoint, not the last. Pattern B from the multi-stage tradeoff: one evo node, verification inside the script. - finetuning: add `## Cap retries at training scale` -- recommend `evo config set max-attempts 1` for training-heavy workspaces. Default 3 retries was designed for second-scale benchmarks; at hour-scale, retry-with-tweak burns more compute than a fresh hypothesis would. One shot per node; regression -> discard, branch. - subagent Evo surface: add evo:benchmark-reviewer to dispatch list (was missing -- only verifier was there); add observability.md to finetuning references tree; expand evo:finetuning trigger line with scope hint (technique choice, training recipe, observability, retry discipline) so the subagent knows it has the right skill without re-reading the whole body.

Drawer Summary tab gains a Diagnoses section (global annotations from benchmark-reviewer/verifier/agents); Tasks tab shows the per-task annotation inline under each failing task, plus target + full model output on expand. Logs tab replaces the file dropdown with inline file tabs, strips ANSI escape codes, and unifies the card styling. Tab strip fixed from 3-col to 4-col grid (Logs no longer wraps) with brighter active-tab contrast for low-contrast monitors.

# Conflicts: # plugins/evo/.claude-plugin/plugin.json # plugins/evo/.codex-plugin/plugin.json # plugins/evo/npm/package.json # plugins/evo/npm/skills/discover/SKILL.md # plugins/evo/npm/skills/infra-setup/SKILL.md # plugins/evo/npm/skills/optimize/SKILL.md # plugins/evo/npm/skills/report/SKILL.md # plugins/evo/npm/skills/subagent/SKILL.md # plugins/evo/pyproject.toml # plugins/evo/skills/discover/SKILL.md # plugins/evo/skills/infra-setup/SKILL.md # plugins/evo/skills/optimize/SKILL.md # plugins/evo/skills/report/SKILL.md # plugins/evo/skills/subagent/SKILL.md # plugins/evo/src/evo/__init__.py # sdk/node/package.json # sdk/python/pyproject.toml # sdk/python/src/evo_agent/__init__.py

…loop Opt-in, Claude-Code-only workflow driver alongside the prose loop, selected by default-orchestrator (prose|workflow; default prose). When host=claude-code and the flag resolves to workflow, optimize/SKILL.md launches skills/optimize/workflows/evo-optimize.js via the Workflow tool instead of driving the loop turn-by-turn; otherwise the prose loop runs unchanged. Both drive the same evo CLI -- gates, frontier, dashboard, state are identical. The workflow encodes the loop control deterministically: orient, mandatory scan + cross-history axis check, ideator dispatch on stall/periodic with proposal reconciliation, brief writing + diversity dedupe, per-brief parallel lanes (implement -> pre-verify<->revise -> run -> post-audit, deepened to budget), collect/prune, and stall that resets only on a verified committed score beating the prior best. args are coerced (object or JSON string); schemas stay flat because the StructuredOutput validator rejects allOf/if/then. cli.py: default-orchestrator config field (set/get/show + argparse choices). .gitignore: ignore .claude/worktrees/.

Runs a self-paced Opus observer alongside the optimize round loop via Promise.all. It checks host and cross-history signals during rounds (zombie GPU, buried stderr, stuck experiment, saturated axis, dead direction) and folds work-quality hints into the next round's brief; runtime and host issues surface as alerts. The observer is advisory and isolated from the optimizer: - Interruptible wait: optimize drops a sentinel (.evo/.wf_optimize_done) on exit and the in-flight tick polls it in 15s hops, so the thread stops within a hop instead of blocking the run for the full interval. - A failed tick is swallowed (per-tick try/catch + thread .catch) so it can never reject Promise.all and abort the run. - Self-disables after 3 consecutive failed ticks, so a tick that fails before its pacing wait can't hot-spin agents. Orient state-read moves from haiku to sonnet.

…ore evo run The workflow splits a subagent into separate implement and run agents. implementPrompt loads the subagent skill in full; runPrompt was a fresh agent told only "Run `evo run` to evaluate and commit" — no skill load, no `--check`, no train-then-eval ordering. For a finetune the run agent called the real `evo run` before its background train.py wrote final_model/, producing a spurious "final_model not found" failure that consumed the attempt. runPrompt now loads the subagent skill, names `evo run --check` for non-committing wiring validation, and requires the artifact (e.g. final_model/) to exist before the real run, warm-starting from EVO_PARENT_POLICY when the experiment trains.

alokwhitewolf added 30 commits May 31, 2026 19:27

feat(dashboard): EVO_DASHBOARD_HOST env to bind 0.0.0.0 (Modal/cloud)

b39e9ed

alokwhitewolf added 29 commits June 2, 2026 02:31

chore: bump 0.5.0-alpha.2 → 0.5.0-alpha.3, sync npm/

cb3b3d7

chore: bump 0.5.0-alpha.3 → 0.5.0-alpha.4, sync npm/

6fb7699

chore: bump 0.5.0-alpha.4 → 0.5.0-alpha.5, sync npm/

a6e70c4

Carries: - docs(skill/finetuning): cache expensive intermediates under .evo/cache/ - docs(skill/subagent): domain-agnostic counterpart pointing at finetuning

Merge branch 'feat/model-update' into feat/optimize-workflow

3890380

feat(optimize): configurable scan batch size + compact scan-batch labels

026692c

chore: bump 0.5.0-alpha.7 → 0.5.0-alpha.8, sync npm/

ca14911

chore: bump 0.5.0-alpha.8 → 0.5.0-alpha.9, sync npm/

a24ca6b

chore: bump 0.5.0-alpha.9 → 0.5.0-alpha.10, sync npm/

aac0f72

chore: bump 0.5.0-alpha.10 → 0.5.0-alpha.11, sync npm/

06eae04

alokwhitewolf merged commit 06eae04 into main Jun 6, 2026
41 of 42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.5.0-alpha.1: subagents, evo wait, hook-drain fix, skill rewrites#54

0.5.0-alpha.1: subagents, evo wait, hook-drain fix, skill rewrites#54
alokwhitewolf merged 67 commits into
mainfrom
feat/model-update

alokwhitewolf commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alokwhitewolf commented Jun 1, 2026

Plugin subagents (new plugins/evo/agents/)

Hook-drain: lazy-register on PostToolUse

evo wait extensions

Diff capture

PostToolUse wait-hint hook

Host-install: releases/latest fallback

Skill rewrites

evo:finetuning

evo:discover and evo:optimize

evo:infra-setup

Dashboard

Files moved

Versions

Issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Plugin subagents (new `plugins/evo/agents/`)

Hook-drain: lazy-register on `PostToolUse`

`evo wait` extensions

Host-install: `releases/latest` fallback

`evo:finetuning`

`evo:discover` and `evo:optimize`

`evo:infra-setup`