dev: reh3376_dev01 -> main#466
Conversation
…alth review (Epic 0) EVENTGRAPH-004 federates the last unfederated Hebbian write — the ApplyNegativeFeedback contradict action — into reinforcement_events (trigger_path=apply_negative_feedback_contradict). Data-decided scope: reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere; no producer calls /v1/learning/negative-feedback — instrument before the producer arrives, the inverse of the dormancy pattern). Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of the revived CoactivateSession path — no tuning needed, textbook session cliques, pre-fix orphans stay as historical record (operator decision). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…inforcement_events (Epic 1) The contradict action (no co-activation edge → MERGE CONTRADICTS) was the last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a FOREACH, where the edge variable is invisible to RETURN — so the original single statement is split into two statements in the SAME ExecuteWrite transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and (b) contradict with a per-pair RETURN. Classification is identical: weaken never deletes edges, so contradict's NOT EXISTS sees the same edge set the original OPTIONAL MATCH did. Contradict rows land with trigger_path=apply_negative_feedback_contradict. created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets it; ON CREATE never does — invariant pinned by comment). delta_weight is the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on re-match); negative-feedback semantics are carried by trigger_path, not the sign. Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser tests (create/re-match branches); learning suite green; lint clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…-match + weaken unchanged (Epic 2) Live against the restarted Epic-1 binary: contradict create row (+0.15, created_new_edge=true), re-match row (delta=0, evidence=2), weaken row byte-equivalent to pre-split behavior (negative delta, floor at 0). Federation CLI surfaces the new trigger_path with no read-side change. UATS learning_negative_feedback 5/5 PASS. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…c 3) Feature doc: 5-path trigger_path table + delta-semantics consumer warning (contradict delta is the CONTRADICTS edge's own weight delta — semantics live in trigger_path, not the sign). UATS spec extended: zero-count equals assertions on nonexistent nodes (hash refreshed, 5/5 live). CLAUDE.md architecture note + producer-gap disclosure. Sprint close in post.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Squash-merge workflow leaves a stale merge-base: PR #418's squash (b408bbc) rewrote the same CHANGELOG/CLAUDE.md/service.go regions this branch then extended in EVENTGRAPH-004. Verified before resolving: main == dev01@36377a2 + .github/workflows/codeql.yml exactly (git diff b408bbc 36377a2 shows only codeql.yml), so taking this branch's side of every content conflict is lossless; codeql.yml comes in from main. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Squash merges never advance the dev branch's merge-base, so every sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR (first bitten: PR #419). New sync-dev-after-merge.yml merges main back into the source *_dev* branch after each merged PR; the GITHUB_TOKEN push triggers no other workflows, so it can never spawn an empty auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual resolution; workflow_dispatch enables manual runs/live testing. auto-pr.yml additionally skips PR creation when branch content is identical to main — guards MANUAL sync pushes, verified against the live repo state (current dev01 ≡ main → empty=true → skip). actionlint clean (untrusted refs passed via env, not inline). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…deep-dive Full-codebase review vs MDEMG's purpose (cognitive substrate / connection layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors, synthesizer + adversarial completeness critic (19 revisions applied). Verdict: server-side substrate is mature, but the system is not currently functioning as the assistant's internal dialogue — the per-prompt delivery channel silently no-ops (hook reads .user_prompt, Claude Code sends .prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170, live-verified), scheduled decay/prune has been a permanent dry-run, RSIC validates 16/17 actions vacuously, and supervision covers 3 of ~14 background loops. Every defect is the same disease: wired-looking seams with no caller, wrong contract, or no reader. 4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the learning loops, (3) survivability + class-ending forcing functions, (4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit. Orchestrator spot-verification annex included (5 claims re-verified live). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…per-prompt channel (Epic 0) Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends .prompt) → channel exits silently on every prompt; post-tool-observe.py reads tool_output (CC sends tool_response) → false "Build/test succeeded" observations with empty output; guidance wrongly coupled to RESULT_COUNT>0; minor pre-compact transcript jq. session-start / pre-bash-check / pre-write-check verified correct. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rompt channel (Epic 1) Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read `.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall, Jiminy guidance, /strict reformulation, the warm trigger, and the retrieve-time Hebbian reinforcement have NEVER fired in any session. Now reads `.prompt // .user_prompt` (legacy fallback kept). Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer exits — it printed its notice then skipped guidance + warm + retrieval reinforcement, coupling independent deliveries. Both copies (live + installer template). Tier 1 simulated stdin: real .prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT, 5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/ malformed payloads exit silently (fail-open preserved). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…"succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ine shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…act pin + close (Epics 4-5) Live in the real session: first-ever guidance delivery (J17 T1 bootstrap + DICT, 5363 bytes vs 0 forever); real failing build → error observation with actual compiler output in CMS. PostToolUse success-only firing documented as a limitation. Hook stdin contract pinned in CLAUDE.md. Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…channel (Epic 0) Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings: template→live drift severed alert delivery (50-entry file actively rotating today, never shown); no Cleared lifecycle (nothing sets the field; no /v1/alert* endpoints); no absence detection for the channel that just had a months-long silent outage; compose publishes 9999 on 0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear lifecycle, hook_events absence rule (reuses V0024 via jobhealth), hooks doctor, PORT-TRUTH rider, Tier 3, docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…y restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…emplates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…vered (Epic 3) Alert.Cleared existed but nothing ever set it: once hooks rendered the file, the same entries would re-render every prompt forever. New: FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks now clear exactly what they displayed (fire-and-forget, fail-open); cleared = delivered-to-operator, not resolved — persisting conditions re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier standard (was UnixNano; old ids remain valid opaque strings). Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10 cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render) → 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS alerts_clear 3/3 live (runner falsy-body inheritance discovered: variant bodies must be non-empty objects). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…self-reports outages (Epic 4) POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events via the jobhealth policy point (job_name hook:<name>; no new sink). Two independent heartbeats: prompt-context fires per delivery (the monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_ COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule hook_channel_silent (distinct service per the NOSILENT cooldown rule): sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_ HOURS (24) → high alert. This is the "job never ran" guarantee applied to the channel whose months-long outage HOOKWIRE-001 found only by manual audit — the next contract drift self-reports. Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS (24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land rows (session metadata, latency); throttle holds; rule SQL positive + negative branches proven against the real table; UATS hooks_event 3/3. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…lose (Epics 7-8) Live-verified across the sprint: alert backlog drained 50→2 on real prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent loaded); doctor 11/11 + correct failure mode; sidecar fresh on 127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel- health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings: packaging plists are templates (raw copy → launchd exit 78; service install is canonical); UATS falsy-variant-body inheritance pinned. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sis latency Caught in the HOOKSYNC-001 full-suite regression: the synchronous /v1/jiminy/guide includes local-model synthesis (~43s observed quiet, ~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been silently erroring since synthesis latency grew. Aligned with the JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes live. Pre-existing — not a HOOKSYNC regression (Guide path untouched). The other 3 suite errors were load-induced flakes (pass individually): suite-vs-llama-server slot contention, noted for UXTS-CI-001. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…n hierarchy (Epic 0) Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns NULL on embedding lists (proven: NULL where vector.similarity.cosine returns 0.627 on the same pair); 3 creation sites affected incl. an ABSTRACTS_TO site the audit missed. Scale worse than audited and growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442 NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly — drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched backfill, null-weight gauge + alert rule via the existing graph-stats → metric_samples path, UVTS-quick regression guard. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…cosine replaces point.distance (Epic 1) point.distance() is a spatial-Point function: on embedding lists it returns NULL, so every weight at the 3 abstraction-edge creation sites was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless; the CASE guards passed on good embeddings, then the THEN expr evaluated NULL — edges with good embeddings got nothing while embedding-less ones got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly (live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1 (theme GENERALIZES) gains the null-guard it never had. Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped with member ids for UNWIND. All 3 statements EXPLAIN-validated live. Tier 1: pair-builder tests (uniqueness, CUID format, empty input). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2nd training-integrity remediation sprint. The distill gate (x9_distill_capture_v2.py:360: kept = mean(reward_vector) >= 0.8, global) drops spec-correct-but-terse answers because coverage_score/ explanation_quality/coherence_score reward length over correctness — gutting ape.reflect (largest target) + summarize + synthesize, then balanced_sampler amplifies the verbose-skew. Principle: inclusion selects for CORRECTNESS not length. Fix: length-neutral correctness rewards + per-task inclusion thresholds + a forcing-function test (each of the 12 covered tasks' known-correct golden rows clear their gate) + distribution check. Closes with the eval-integrity-deferred GGUF serving + honest baseline recompute. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…c 1) The distill inclusion gate (mean(reward_vector) >= 0.8) selected for LENGTH, not correctness — the corpus-skew mechanism behind 3 discarded retrains. Four reward functions used length/count ladders that dropped spec-correct-but-terse answers below the gate and rewarded verbosity upward: - coverage_score: <20 words→0.4 / <50→0.7 / then rising → now substantive content scores 0.9 flat (length-neutral); empty→0.0, pure-repetition→0.3. - explanation_quality: <20 words→0.6 cliff → now substantive→0.9. - coherence_score: required >=2 sentences + 10 words → now any coherent non-repetitive response→0.9; pure repetition→0.4. - insight_count: rewarded bullet COUNT (>=5→1.0) → now >=1 genuine insight→0.9 (no upward count reward; stops bullet-spam, stops dropping single-insight reflections to 0.5 — ape.reflect, the largest target). Verified: terse-correct now clears the 0.8 gate; verbosity/bullet-count no longer rewarded above concise; varied detailed content unaffected (0.9); empty/repetition still rejected. Tests rewritten to pin the new semantics (78 pass). Subtler keyword-bag functions (specificity/actionability) left for the continuation — they reward content signals, not raw length. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ndings (Epic 2-3)
Epic 2: --reward-threshold-map JSON ({"task": float}) overrides the global
--reward-threshold per task in x9_distill_capture_v2.py, so tasks whose reward
arrays have a different natural ceiling can gate at the right bar. Records the
per-task gate in each row + the manifest. Live-verified end-to-end: real
OpenAI + TSDB run with {"consulting.classify": 0.6} applied the override
(3/3 captured, manifest per_task reward_threshold=0.6).
Epic 3 (live Tier 3, docs/development/reward-correctness-001/live_findings.md):
scored REAL production llm_interactions at the 0.8 gate, old vs new rewards.
Validated Epic 1: hidden.summarize recovered 69/72 real concise summaries the
old length ladder dropped. Surfaced three larger correctness issues the
length fix does NOT close (the real dominant suppressors for the big tasks):
1. ape.reflect (54k, largest target): json_valid mean 0.133 — ~87% of recent
responses TRUNCATED mid-JSON (prompt ~5800 + ~3000 output > 8192 per-slot
KV bound). Production serving/capture defect; gate correctly rejects.
Recommended own-sprint follow-up (raise output budget, re-capture).
2. jiminy.evaluate: explanation_quality=0.0 on correct {violations,warnings}
responses — wrong reward for the schema (no top-level explanation key).
Reward-array fix, operator-gated (changes a ULTS array + re-grades).
3. jiminy.synthesize: keyword-bag follow_rate/specificity just below gate —
the deferred Epic 1 continuation.
Also fixed 2 pre-existing lint nits in the touched file (F541, E741).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Closes REWARD-CORRECTNESS-001 at Epic 1+2+3. Epic 4 (baseline recompute) explicitly deferred behind the ape.reflect truncation fix per operator sequencing — recomputing over a known-truncated corpus would bake in the corruption. Next sprint: ape.reflect truncation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Root-caused the ape.reflect ~87% truncation (largest training target) to a structurally-unbounded prompt: live-measured 7489 tokens (Current Assessment ~3895 + 5-cycle history ~2693), leaving only ~700 of the 8192 per-slot KV budget for output — 191/200 invalid responses cluster at 490-520 tokens_out, truncating mid-JSON at the ceiling. Compression already on; not a max_tokens cap. Plan: bound the prompt to a configurable token budget (gate verbose TSDB dataset fields, cap history cycles, final drop-oldest guard) so output always has ~4000-token headroom, with an optional serving-slot increase as the safety margin. Lever A (structural prompt budget) + Lever B (KV slot) proposed, picked at execution. Tier 3 proof: fresh ape.reflect json_valid recovers 0.13 → ~1.0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ut (Epic 1-2) Implements the structural fix for the ape.reflect ~87% mid-JSON cutoff. The prompt was unbounded (live 7489 tok), leaving only ~700 of the 8192 per-slot KV budget for output, so the largest training target's responses were cut off mid-array. buildUserPrompt now enforces a token budget: - gate the verbose TSDB dataset fields (LLMPerformance x17 / Retrieval / Embedding / TrainingReadiness, ~3895 of 7489 prompt tokens) behind RSIC_LLM_REFLECT_INCLUDE_DATASETS (default false); scalar health metrics the detectors use are always kept; - cap history cycles via RSIC_LLM_REFLECT_HISTORY_CYCLES (default 3, was hardcoded 5); - final budget guard (RSIC_LLM_REFLECT_PROMPT_BUDGET_TOKENS default 3500, 0 disables): drops history oldest-first, then trims the assessment tail, logging loudly what was dropped (never silent). estimateTokens calibrated to the measured 2.3 chars/tok ratio, slightly conservative. 3 config fields (range-validated, no hardcoding) wired config -> LLMReflectorConfig -> server.go. 6 Tier-1 tests: dataset gating, history cap, drop-history-under-budget, trim-assessment-under-budget, under-budget-unchanged, estimator. Full ape suite + lint + config scanner (687/687) clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…md (Epic 5) Live Tier 3 result documented: 3/3 fresh post-restart ape.reflect rows valid JSON (100%, up from ~13%), tokens_in ~2575 (from ~7489). Corrected the stale CLAUDE.md ape.reflect prompt-size figure (~5800 -> ~7489 live-measured) and added the per-slot KV "prompt+output share the budget" guidance. Closes APE-PROMPT-BUDGET-001. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… import shift The Epic 1-2 import (log/slog) + struct fields shifted llmReflectSystemPrompt from line 74 to 80. The ULTS hash verifier reads from line-2 and grabs the first backtick string; at the stale :74 the search region now included the `"` backtick in `quoted[i] = `"` + a + `"`` → wrong hash. The system prompt TEXT is unchanged (hash 39b2bc… still matches at :80). Updated system_prompt_source :74 → :80. Local glob verify: ape.reflect 11/11 PASS. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ta (Epic 0-1) Operator-chosen audit-first prune phase. Read-only enumeration of every non-conforming TSDB/file target with exact counts + a backup/small-batch/verify prune plan (each category operator-gated). Findings — PRUNE TARGETS: (A) 2,111 invalid-JSON rows in object/array tasks (ape.reflect 1890 in the 06-11..06-13 truncation window, rerank_cross 184, evaluate_llm 18, query_classify 18, classify 1); (B) 21,135 error/empty rows (mdemg data clean target); (C) rerank mislabeled archive 6,894 events/21M + valid_golden 108 leaked + ~14 stale April baselines. ~23,246 TSDB rows total (~22.7%). NOT prune targets (schema/reward mismatch, data is fine, fix the definition): hidden.summarize 72 (prose vs object schema), string-schema tasks the jsonb check false-flags (intent_translate/codegen/synthesize emit valid bare strings), jiminy.evaluate. Audit pitfall recorded: never run a jsonb-validity prune predicate against string-schema tasks. Corrects the "87% of 54k" assumption: ape.reflect corruption is 1,890 rows in the recent truncation window (forward-fixed by APE-PROMPT-BUDGET-001), not corpus-wide. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…LOG (Epic 2 close) Pruned 1,898 genuinely-corrupt invalid-JSON rows from llm_interactions (ape.reflect 1,879 / jiminy.evaluate_llm 18 / consulting.classify 1), backup-first to .mdemg-backup-20260613_195431/dataprune/ (reversible). 102,415 -> 100,517, remaining_corrupt=0, live healthy, recent ape.reflect 14/14 valid. Small-batch verify caught that the raw pg_input_is_valid predicate over-counted by 213 (valid JSON behind markdown fences / think-tags that production SanitizeResponse strips); validated all candidates through a faithful replica of llmclient.SanitizeResponse and spared the recoverable 213. Categories B (error rows) + C (file artifacts) deferred. The backup dir is untracked (not committed). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
B: 21,254 error/silent-failure rows removed via mdemg data clean (4 spaces), backed up first. C: rerank prefix-archive (6,894 events/21M, no refs) moved to backup; valid_golden + ~14 baselines RETAINED (load-bearing — leak source + regression harness; retire during baseline recompute). Final: llm_interactions 79,461 rows, 0 non-conforming. Verification catch documented: data clean dry-run per-task table is surviving-rows, not the delete set. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ace data
Two bugs surfaced during the space-hygiene cleanup (removing 24 junk/test/demo
spaces for live testing):
1. `mdemg space delete` gated its pre-check on `count(MemoryNode {space_id})`
but the delete itself is label-agnostic (`MATCH (n {space_id})`). A space
holding only SymbolNodes/Observations (e.g. e2e-test = 10,918 SymbolNodes)
reported "no nodes. Nothing to delete." and silently survived. Pre-check now
counts all labels, matching the delete.
2. `ListSpaces` (`mdemg space list`) panicked — `interface conversion:
interface {} is nil, not string` at the `sid.(string)` assertion — when any
MemoryNode had a null space_id (orphaned/infra artifacts). The query now
excludes null/empty space_id (such nodes are not a "space"), and the
assertion is nil-guarded defensively.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
24 junk/test/demo spaces removed (~143k nodes, backed up); blank-space resolved (global infra kept null, 155 test MemoryNodes staged for delete); 2 space-tool bugs fixed (delete pre-check, list panic). Record in space_cleanup.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…xes (Epic 0) The REWARD-CORRECTNESS-001 follow-ups: (1) hidden.summarize schema object->string (production emits prose; 72 rows mis-flagged invalid-JSON); (2) explanation_quality schema-aware for nested violations[].reasoning (fixes jiminy.evaluate + evaluate_llm scoring correct responses 0.0); (3) keyword-bag specificity/ actionability substantive-floored (jiminy.synthesize valid guidance dropped for lacking magic words). Makes the 4 tasks' grading correct before the baseline recompute. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… + live validation Three reward/schema mismatches that scored CORRECT responses wrong: 1. hidden.summarize ULTS schema object->string. Production emits bare prose (cluster_summarizer.go), so the object schema mis-flagged 72 valid summaries as invalid-JSON. (Reward already fixed in RC-001; this corrects the spec.) 2. explanation_quality made schema-aware: jiminy.evaluate / evaluate_llm nest reasoning in violations[].reasoning, not a top-level field, so the flat lookup scored every correct response 0.0. Now credits nested reasoning and treats a valid no-violation verdict as a correct "no issues" answer (nothing to explain). Falls back to the flat path. 3. specificity_score / actionability_score substantive-floored (0.7 floor, keyword presence a bounded bonus, hedging/empty/repetition low) — the keyword-bag dropped valid concise guidance below the gate for lacking ~6 magic words. follow_rate inherits it. Live Tier 3 (real production rows, old->new kept@0.8): jiminy.evaluate 0/60->60/60 (mean 0.667->0.967), jiminy.synthesize 3/60->59/60 (0.725->0.879), ape.reflect 47/60->60/60 (0.848->0.956); evaluate_llm unchanged 60/60. New means 0.88-0.97 = correct production output scoring correctly, no over-inflation. 87 unit tests + 609 neural tests + ruff green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ort 8101->8102 (Epic 1) The capstone of the training-integrity arc: recompute the frozen 0.8338 baseline through the fixed harness (valid_clean + RC-001/002 rewards + GGUF :8102). Epic 1 fixes the stale rl_phase11.yaml mlx_port (8101 mlx_lm.server decommissioned → 8102 llama-server GGUF), flagged by EVAL-INTEGRITY-001. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…d harness (Epic 2-4) Recomputed the adapter-promotion baseline through the fixed harness (valid_clean leak-free eval + RC-001/002 corrected rewards + GGUF llama-server :8102) = 0.8655, replacing the stale frozen 0.8338 (valid_golden-leaked, old length-biased rewards, decommissioned MLX serving — not comparable). evaluate_gate_5a now derives the target from the loaded baseline REPORT (single source of truth); the constant is retained only as a >5pp drift tripwire. status ok, 12 tasks, 50 samples/task, 0 zero-call. ape.reflect 0.696 is an eval-harness artifact (stored ~7.5k-token prompts bypass the runtime prompt budget and get cut off mid-JSON), not a model regression. Closes the training-integrity arc: trustworthy gate (EVAL-INTEGRITY-001), correct rewards (REWARD-CORRECTNESS-001/002), sound corpus (APE-PROMPT-BUDGET-001 + DATAPRUNE), honest baseline (this sprint). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
BASELINE-RECOMPUTE-001 — honest adapter-promotion baseline (training-integrity capstone)The promotion gate's baseline was a stale frozen constant 0.8338 (99%-leaked Not comparable to 0.8338 (different eval + rewards + serving); future retrains compare against 0.8655. Live recompute (Tier 3): Closes the training-integrity arc: trustworthy gate → correct rewards → sound corpus → honest baseline. Disclosed follow-up: (Note: an identical summary was accidentally posted to the now-merged #465 first — this #466 is the correct PR.) |
Summary
Development branch changes from
reh3376_dev01.Commits
mdemg modelCLI + pluggable Fetcher interfaceAuto-generated PR from reh3376_dev01 push