dev: reh3376_dev01 -> main#465
Conversation
…rcement_events (Epic 3) The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult preserved) and emits reinforcement events for weaken rows only. prevWeight is captured before the FOREACH SET. Contradict path deliberately not emitted (CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated; build + lint clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lly runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… close (Epic 4) All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_ coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session 4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed (CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession was dead, not "writing via sidecar paths"); verification.md + post.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…alth review (Epic 0) EVENTGRAPH-004 federates the last unfederated Hebbian write — the ApplyNegativeFeedback contradict action — into reinforcement_events (trigger_path=apply_negative_feedback_contradict). Data-decided scope: reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere; no producer calls /v1/learning/negative-feedback — instrument before the producer arrives, the inverse of the dormancy pattern). Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of the revived CoactivateSession path — no tuning needed, textbook session cliques, pre-fix orphans stay as historical record (operator decision). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…inforcement_events (Epic 1) The contradict action (no co-activation edge → MERGE CONTRADICTS) was the last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a FOREACH, where the edge variable is invisible to RETURN — so the original single statement is split into two statements in the SAME ExecuteWrite transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and (b) contradict with a per-pair RETURN. Classification is identical: weaken never deletes edges, so contradict's NOT EXISTS sees the same edge set the original OPTIONAL MATCH did. Contradict rows land with trigger_path=apply_negative_feedback_contradict. created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets it; ON CREATE never does — invariant pinned by comment). delta_weight is the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on re-match); negative-feedback semantics are carried by trigger_path, not the sign. Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser tests (create/re-match branches); learning suite green; lint clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…-match + weaken unchanged (Epic 2) Live against the restarted Epic-1 binary: contradict create row (+0.15, created_new_edge=true), re-match row (delta=0, evidence=2), weaken row byte-equivalent to pre-split behavior (negative delta, floor at 0). Federation CLI surfaces the new trigger_path with no read-side change. UATS learning_negative_feedback 5/5 PASS. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…c 3) Feature doc: 5-path trigger_path table + delta-semantics consumer warning (contradict delta is the CONTRADICTS edge's own weight delta — semantics live in trigger_path, not the sign). UATS spec extended: zero-count equals assertions on nonexistent nodes (hash refreshed, 5/5 live). CLAUDE.md architecture note + producer-gap disclosure. Sprint close in post.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Squash-merge workflow leaves a stale merge-base: PR #418's squash (b408bbc) rewrote the same CHANGELOG/CLAUDE.md/service.go regions this branch then extended in EVENTGRAPH-004. Verified before resolving: main == dev01@36377a2 + .github/workflows/codeql.yml exactly (git diff b408bbc 36377a2 shows only codeql.yml), so taking this branch's side of every content conflict is lossless; codeql.yml comes in from main. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Squash merges never advance the dev branch's merge-base, so every sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR (first bitten: PR #419). New sync-dev-after-merge.yml merges main back into the source *_dev* branch after each merged PR; the GITHUB_TOKEN push triggers no other workflows, so it can never spawn an empty auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual resolution; workflow_dispatch enables manual runs/live testing. auto-pr.yml additionally skips PR creation when branch content is identical to main — guards MANUAL sync pushes, verified against the live repo state (current dev01 ≡ main → empty=true → skip). actionlint clean (untrusted refs passed via env, not inline). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…deep-dive Full-codebase review vs MDEMG's purpose (cognitive substrate / connection layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors, synthesizer + adversarial completeness critic (19 revisions applied). Verdict: server-side substrate is mature, but the system is not currently functioning as the assistant's internal dialogue — the per-prompt delivery channel silently no-ops (hook reads .user_prompt, Claude Code sends .prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170, live-verified), scheduled decay/prune has been a permanent dry-run, RSIC validates 16/17 actions vacuously, and supervision covers 3 of ~14 background loops. Every defect is the same disease: wired-looking seams with no caller, wrong contract, or no reader. 4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the learning loops, (3) survivability + class-ending forcing functions, (4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit. Orchestrator spot-verification annex included (5 claims re-verified live). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…per-prompt channel (Epic 0) Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends .prompt) → channel exits silently on every prompt; post-tool-observe.py reads tool_output (CC sends tool_response) → false "Build/test succeeded" observations with empty output; guidance wrongly coupled to RESULT_COUNT>0; minor pre-compact transcript jq. session-start / pre-bash-check / pre-write-check verified correct. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rompt channel (Epic 1) Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read `.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall, Jiminy guidance, /strict reformulation, the warm trigger, and the retrieve-time Hebbian reinforcement have NEVER fired in any session. Now reads `.prompt // .user_prompt` (legacy fallback kept). Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer exits — it printed its notice then skipped guidance + warm + retrieval reinforcement, coupling independent deliveries. Both copies (live + installer template). Tier 1 simulated stdin: real .prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT, 5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/ malformed payloads exit silently (fail-open preserved). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…"succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ine shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…act pin + close (Epics 4-5) Live in the real session: first-ever guidance delivery (J17 T1 bootstrap + DICT, 5363 bytes vs 0 forever); real failing build → error observation with actual compiler output in CMS. PostToolUse success-only firing documented as a limitation. Hook stdin contract pinned in CLAUDE.md. Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…channel (Epic 0) Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings: template→live drift severed alert delivery (50-entry file actively rotating today, never shown); no Cleared lifecycle (nothing sets the field; no /v1/alert* endpoints); no absence detection for the channel that just had a months-long silent outage; compose publishes 9999 on 0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear lifecycle, hook_events absence rule (reuses V0024 via jobhealth), hooks doctor, PORT-TRUTH rider, Tier 3, docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…y restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…emplates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…vered (Epic 3) Alert.Cleared existed but nothing ever set it: once hooks rendered the file, the same entries would re-render every prompt forever. New: FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks now clear exactly what they displayed (fire-and-forget, fail-open); cleared = delivered-to-operator, not resolved — persisting conditions re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier standard (was UnixNano; old ids remain valid opaque strings). Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10 cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render) → 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS alerts_clear 3/3 live (runner falsy-body inheritance discovered: variant bodies must be non-empty objects). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…self-reports outages (Epic 4) POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events via the jobhealth policy point (job_name hook:<name>; no new sink). Two independent heartbeats: prompt-context fires per delivery (the monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_ COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule hook_channel_silent (distinct service per the NOSILENT cooldown rule): sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_ HOURS (24) → high alert. This is the "job never ran" guarantee applied to the channel whose months-long outage HOOKWIRE-001 found only by manual audit — the next contract drift self-reports. Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS (24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land rows (session metadata, latency); throttle holds; rule SQL positive + negative branches proven against the real table; UATS hooks_event 3/3. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…lose (Epics 7-8) Live-verified across the sprint: alert backlog drained 50→2 on real prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent loaded); doctor 11/11 + correct failure mode; sidecar fresh on 127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel- health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings: packaging plists are templates (raw copy → launchd exit 78; service install is canonical); UATS falsy-variant-body inheritance pinned. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sis latency Caught in the HOOKSYNC-001 full-suite regression: the synchronous /v1/jiminy/guide includes local-model synthesis (~43s observed quiet, ~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been silently erroring since synthesis latency grew. Aligned with the JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes live. Pre-existing — not a HOOKSYNC regression (Guide path untouched). The other 3 suite errors were load-induced flakes (pass individually): suite-vs-llama-server slot contention, noted for UXTS-CI-001. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2nd training-integrity remediation sprint. The distill gate (x9_distill_capture_v2.py:360: kept = mean(reward_vector) >= 0.8, global) drops spec-correct-but-terse answers because coverage_score/ explanation_quality/coherence_score reward length over correctness — gutting ape.reflect (largest target) + summarize + synthesize, then balanced_sampler amplifies the verbose-skew. Principle: inclusion selects for CORRECTNESS not length. Fix: length-neutral correctness rewards + per-task inclusion thresholds + a forcing-function test (each of the 12 covered tasks' known-correct golden rows clear their gate) + distribution check. Closes with the eval-integrity-deferred GGUF serving + honest baseline recompute. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…c 1) The distill inclusion gate (mean(reward_vector) >= 0.8) selected for LENGTH, not correctness — the corpus-skew mechanism behind 3 discarded retrains. Four reward functions used length/count ladders that dropped spec-correct-but-terse answers below the gate and rewarded verbosity upward: - coverage_score: <20 words→0.4 / <50→0.7 / then rising → now substantive content scores 0.9 flat (length-neutral); empty→0.0, pure-repetition→0.3. - explanation_quality: <20 words→0.6 cliff → now substantive→0.9. - coherence_score: required >=2 sentences + 10 words → now any coherent non-repetitive response→0.9; pure repetition→0.4. - insight_count: rewarded bullet COUNT (>=5→1.0) → now >=1 genuine insight→0.9 (no upward count reward; stops bullet-spam, stops dropping single-insight reflections to 0.5 — ape.reflect, the largest target). Verified: terse-correct now clears the 0.8 gate; verbosity/bullet-count no longer rewarded above concise; varied detailed content unaffected (0.9); empty/repetition still rejected. Tests rewritten to pin the new semantics (78 pass). Subtler keyword-bag functions (specificity/actionability) left for the continuation — they reward content signals, not raw length. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ndings (Epic 2-3)
Epic 2: --reward-threshold-map JSON ({"task": float}) overrides the global
--reward-threshold per task in x9_distill_capture_v2.py, so tasks whose reward
arrays have a different natural ceiling can gate at the right bar. Records the
per-task gate in each row + the manifest. Live-verified end-to-end: real
OpenAI + TSDB run with {"consulting.classify": 0.6} applied the override
(3/3 captured, manifest per_task reward_threshold=0.6).
Epic 3 (live Tier 3, docs/development/reward-correctness-001/live_findings.md):
scored REAL production llm_interactions at the 0.8 gate, old vs new rewards.
Validated Epic 1: hidden.summarize recovered 69/72 real concise summaries the
old length ladder dropped. Surfaced three larger correctness issues the
length fix does NOT close (the real dominant suppressors for the big tasks):
1. ape.reflect (54k, largest target): json_valid mean 0.133 — ~87% of recent
responses TRUNCATED mid-JSON (prompt ~5800 + ~3000 output > 8192 per-slot
KV bound). Production serving/capture defect; gate correctly rejects.
Recommended own-sprint follow-up (raise output budget, re-capture).
2. jiminy.evaluate: explanation_quality=0.0 on correct {violations,warnings}
responses — wrong reward for the schema (no top-level explanation key).
Reward-array fix, operator-gated (changes a ULTS array + re-grades).
3. jiminy.synthesize: keyword-bag follow_rate/specificity just below gate —
the deferred Epic 1 continuation.
Also fixed 2 pre-existing lint nits in the touched file (F541, E741).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Closes REWARD-CORRECTNESS-001 at Epic 1+2+3. Epic 4 (baseline recompute) explicitly deferred behind the ape.reflect truncation fix per operator sequencing — recomputing over a known-truncated corpus would bake in the corruption. Next sprint: ape.reflect truncation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Root-caused the ape.reflect ~87% truncation (largest training target) to a structurally-unbounded prompt: live-measured 7489 tokens (Current Assessment ~3895 + 5-cycle history ~2693), leaving only ~700 of the 8192 per-slot KV budget for output — 191/200 invalid responses cluster at 490-520 tokens_out, truncating mid-JSON at the ceiling. Compression already on; not a max_tokens cap. Plan: bound the prompt to a configurable token budget (gate verbose TSDB dataset fields, cap history cycles, final drop-oldest guard) so output always has ~4000-token headroom, with an optional serving-slot increase as the safety margin. Lever A (structural prompt budget) + Lever B (KV slot) proposed, picked at execution. Tier 3 proof: fresh ape.reflect json_valid recovers 0.13 → ~1.0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ut (Epic 1-2) Implements the structural fix for the ape.reflect ~87% mid-JSON cutoff. The prompt was unbounded (live 7489 tok), leaving only ~700 of the 8192 per-slot KV budget for output, so the largest training target's responses were cut off mid-array. buildUserPrompt now enforces a token budget: - gate the verbose TSDB dataset fields (LLMPerformance x17 / Retrieval / Embedding / TrainingReadiness, ~3895 of 7489 prompt tokens) behind RSIC_LLM_REFLECT_INCLUDE_DATASETS (default false); scalar health metrics the detectors use are always kept; - cap history cycles via RSIC_LLM_REFLECT_HISTORY_CYCLES (default 3, was hardcoded 5); - final budget guard (RSIC_LLM_REFLECT_PROMPT_BUDGET_TOKENS default 3500, 0 disables): drops history oldest-first, then trims the assessment tail, logging loudly what was dropped (never silent). estimateTokens calibrated to the measured 2.3 chars/tok ratio, slightly conservative. 3 config fields (range-validated, no hardcoding) wired config -> LLMReflectorConfig -> server.go. 6 Tier-1 tests: dataset gating, history cap, drop-history-under-budget, trim-assessment-under-budget, under-budget-unchanged, estimator. Full ape suite + lint + config scanner (687/687) clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…md (Epic 5) Live Tier 3 result documented: 3/3 fresh post-restart ape.reflect rows valid JSON (100%, up from ~13%), tokens_in ~2575 (from ~7489). Corrected the stale CLAUDE.md ape.reflect prompt-size figure (~5800 -> ~7489 live-measured) and added the per-slot KV "prompt+output share the budget" guidance. Closes APE-PROMPT-BUDGET-001. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… import shift The Epic 1-2 import (log/slog) + struct fields shifted llmReflectSystemPrompt from line 74 to 80. The ULTS hash verifier reads from line-2 and grabs the first backtick string; at the stale :74 the search region now included the `"` backtick in `quoted[i] = `"` + a + `"`` → wrong hash. The system prompt TEXT is unchanged (hash 39b2bc… still matches at :80). Updated system_prompt_source :74 → :80. Local glob verify: ape.reflect 11/11 PASS. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ta (Epic 0-1) Operator-chosen audit-first prune phase. Read-only enumeration of every non-conforming TSDB/file target with exact counts + a backup/small-batch/verify prune plan (each category operator-gated). Findings — PRUNE TARGETS: (A) 2,111 invalid-JSON rows in object/array tasks (ape.reflect 1890 in the 06-11..06-13 truncation window, rerank_cross 184, evaluate_llm 18, query_classify 18, classify 1); (B) 21,135 error/empty rows (mdemg data clean target); (C) rerank mislabeled archive 6,894 events/21M + valid_golden 108 leaked + ~14 stale April baselines. ~23,246 TSDB rows total (~22.7%). NOT prune targets (schema/reward mismatch, data is fine, fix the definition): hidden.summarize 72 (prose vs object schema), string-schema tasks the jsonb check false-flags (intent_translate/codegen/synthesize emit valid bare strings), jiminy.evaluate. Audit pitfall recorded: never run a jsonb-validity prune predicate against string-schema tasks. Corrects the "87% of 54k" assumption: ape.reflect corruption is 1,890 rows in the recent truncation window (forward-fixed by APE-PROMPT-BUDGET-001), not corpus-wide. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…LOG (Epic 2 close) Pruned 1,898 genuinely-corrupt invalid-JSON rows from llm_interactions (ape.reflect 1,879 / jiminy.evaluate_llm 18 / consulting.classify 1), backup-first to .mdemg-backup-20260613_195431/dataprune/ (reversible). 102,415 -> 100,517, remaining_corrupt=0, live healthy, recent ape.reflect 14/14 valid. Small-batch verify caught that the raw pg_input_is_valid predicate over-counted by 213 (valid JSON behind markdown fences / think-tags that production SanitizeResponse strips); validated all candidates through a faithful replica of llmclient.SanitizeResponse and spared the recoverable 213. Categories B (error rows) + C (file artifacts) deferred. The backup dir is untracked (not committed). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
B: 21,254 error/silent-failure rows removed via mdemg data clean (4 spaces), backed up first. C: rerank prefix-archive (6,894 events/21M, no refs) moved to backup; valid_golden + ~14 baselines RETAINED (load-bearing — leak source + regression harness; retire during baseline recompute). Final: llm_interactions 79,461 rows, 0 non-conforming. Verification catch documented: data clean dry-run per-task table is surviving-rows, not the delete set. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ace data
Two bugs surfaced during the space-hygiene cleanup (removing 24 junk/test/demo
spaces for live testing):
1. `mdemg space delete` gated its pre-check on `count(MemoryNode {space_id})`
but the delete itself is label-agnostic (`MATCH (n {space_id})`). A space
holding only SymbolNodes/Observations (e.g. e2e-test = 10,918 SymbolNodes)
reported "no nodes. Nothing to delete." and silently survived. Pre-check now
counts all labels, matching the delete.
2. `ListSpaces` (`mdemg space list`) panicked — `interface conversion:
interface {} is nil, not string` at the `sid.(string)` assertion — when any
MemoryNode had a null space_id (orphaned/infra artifacts). The query now
excludes null/empty space_id (such nodes are not a "space"), and the
assertion is nil-guarded defensively.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
24 junk/test/demo spaces removed (~143k nodes, backed up); blank-space resolved (global infra kept null, 155 test MemoryNodes staged for delete); 2 space-tool bugs fixed (delete pre-check, list panic). Record in space_cleanup.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…xes (Epic 0) The REWARD-CORRECTNESS-001 follow-ups: (1) hidden.summarize schema object->string (production emits prose; 72 rows mis-flagged invalid-JSON); (2) explanation_quality schema-aware for nested violations[].reasoning (fixes jiminy.evaluate + evaluate_llm scoring correct responses 0.0); (3) keyword-bag specificity/ actionability substantive-floored (jiminy.synthesize valid guidance dropped for lacking magic words). Makes the 4 tasks' grading correct before the baseline recompute. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… + live validation Three reward/schema mismatches that scored CORRECT responses wrong: 1. hidden.summarize ULTS schema object->string. Production emits bare prose (cluster_summarizer.go), so the object schema mis-flagged 72 valid summaries as invalid-JSON. (Reward already fixed in RC-001; this corrects the spec.) 2. explanation_quality made schema-aware: jiminy.evaluate / evaluate_llm nest reasoning in violations[].reasoning, not a top-level field, so the flat lookup scored every correct response 0.0. Now credits nested reasoning and treats a valid no-violation verdict as a correct "no issues" answer (nothing to explain). Falls back to the flat path. 3. specificity_score / actionability_score substantive-floored (0.7 floor, keyword presence a bounded bonus, hedging/empty/repetition low) — the keyword-bag dropped valid concise guidance below the gate for lacking ~6 magic words. follow_rate inherits it. Live Tier 3 (real production rows, old->new kept@0.8): jiminy.evaluate 0/60->60/60 (mean 0.667->0.967), jiminy.synthesize 3/60->59/60 (0.725->0.879), ape.reflect 47/60->60/60 (0.848->0.956); evaluate_llm unchanged 60/60. New means 0.88-0.97 = correct production output scoring correctly, no over-inflation. 87 unit tests + 609 neural tests + ruff green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
REWARD-CORRECTNESS-002 — schema/reward-mismatch fixesThe REWARD-CORRECTNESS-001 live-findings follow-ups. Three reward/schema mismatches that graded correct responses wrong (same "reward not measuring correctness" class as the length bias):
Live Tier 3 (real production rows, old → new kept@0.8)
New means 0.88–0.97 = correct output scoring correctly, no over-inflation. 87 reward unit tests + 609 neural tests + ruff green. Unblocks the honest baseline recompute — corpora are now sound (corrupt rows pruned, truncation fixed, reward grading correct). Files
|
BASELINE-RECOMPUTE-001 — honest adapter-promotion baseline (training-integrity capstone)The promotion gate's baseline was a stale frozen constant 0.8338 (computed on the 99%-leaked Not comparable to 0.8338 (different eval + rewards + serving). Future retrains compare against 0.8655. Live recompute (Tier 3): Closes the training-integrity arc: trustworthy gate (EVAL-INTEGRITY-001) → correct rewards (REWARD-CORRECTNESS-001/002) → sound corpus (APE-PROMPT-BUDGET-001 + DATAPRUNE) → honest baseline (this). Disclosed follow-up: Files
|
Summary
Development branch changes from
reh3376_dev01.Commits
mdemg modelCLI + pluggable Fetcher interfaceAuto-generated PR from reh3376_dev01 push