Implement Learning Loop - ApplyCoactivation#2
Conversation
…eshold filtering - Add configurable LearningMinActivation config parameter (env: LEARNING_MIN_ACTIVATION) - Update learning service to use configurable threshold instead of hardcoded 0.20 - Add validation for threshold to be in range [0, 1] - Add fallback default of 0.20 if config value is not set - Improved code comments explaining clique spam prevention Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…te cap - Sort pairs by activation product (ai * aj) in descending order - Select top-K pairs to prioritize strongest co-activations - Prevents clique spam per 04_Activation_and_Learning.md guidelines Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… eta * a_i * a_j - mu * w_ij - Add configurable Hebbian learning parameters to config (LEARNING_ETA, LEARNING_MU, LEARNING_WMIN, LEARNING_WMAX) - Update learning/service.go to use config values with sensible fallback defaults - Add HebbianWeightUpdate() pure function for unit testing the formula - Document the formula: new_w = (1-μ)*w + η*a_i*a_j, clamped to [wmin, wmax] Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ERGE pattern - Enhanced Cypher query with comprehensive comments explaining MERGE pattern - Added last_activated_at property to both forward and reverse edges - ON CREATE SET: initializes all edge properties including last_activated_at - ON MATCH SET: updates timestamps, last_activated_at, and evidence_count - Symmetric edges: both directions updated with matching weight values - Per 02_Graph_Schema.md, last_activated_at is required for all relationships Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…imestamps) Added `version` property to CO_ACTIVATED_WITH edges per 02_Graph_Schema.md: - ON CREATE SET: version=1 (initialize) - ON MATCH SET: version=coalesce(version,0)+1 (increment) This follows the same optimistic concurrency pattern used in: - observations/service.go (MemoryNode.version, HAS_OBSERVATION.version) - retrieval/service.go (MemoryNode.version) Edge metadata is now complete per schema specification: - edge_id, space_id, created_at, updated_at, version, status - weight, evidence_count, last_activated_at, decay_rate - dim_coactivation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…logic Add comprehensive unit tests for the learning service's pair selection and filtering functionality: - TestClamp01: validates value clamping to [0,1] range - TestPairsToMaps: verifies pair to map conversion for Cypher - TestFilterByActivationThreshold: tests node filtering by minimum activation - TestPairGeneration: validates O(n^2) pair generation from nodes - TestPairActivationProducts: confirms activation product calculation - TestTopKPairSelection: tests top-K selection by activation product - TestTopKSelectionPreservesHighestProducts: ensures highest products are kept - TestEdgeCapEnforcement: validates edge cap is properly enforced - TestCombinedFilteringAndSelection: tests full pipeline integration All tests pass with go test -v ./internal/learning/... Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added comprehensive unit tests for HebbianWeightUpdate function: - TestHebbianWeightUpdateBasic: basic formula verification with default params - TestHebbianWeightUpdateClamping: bounds enforcement (wmin, wmax) - TestHebbianWeightUpdateDecayBehavior: mu parameter effects on decay - TestHebbianWeightUpdateLearningRate: eta parameter effects on strengthening - TestHebbianWeightUpdateActivationProduct: ai*aj proportionality - TestHebbianWeightUpdateMultipleIterations: cumulative weight changes - TestHebbianWeightUpdateDecayWithoutActivation: decay behavior over time - TestHebbianWeightUpdateFormulaDerivation: step-by-step formula verification - TestHebbianWeightUpdateSymmetry: order independence of activations - TestHebbianWeightUpdateEdgeCases: boundary conditions and edge cases Total: 12 new test functions with 54+ subtests (93 total tests in package). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive integration tests for the learning service that test actual edge creation and update behavior against Neo4j: - TestApplyCoactivationCreatesEdges: verifies new edges are created with correct metadata (edge_id, weight, evidence_count, version, status, timestamps, dim_coactivation, decay_rate) - TestApplyCoactivationUpdatesEdges: verifies existing edges are updated with incremented evidence_count, version, and adjusted weight using Hebbian formula - TestApplyCoactivationEdgeSymmetry: verifies both forward and reverse edges have identical weights and metadata - TestApplyCoactivationBelowThreshold: verifies nodes below activation threshold don't create edges - TestApplyCoactivationEmptyResults: verifies graceful handling of empty/insufficient results - TestApplyCoactivationMultipleIterations: verifies cumulative learning with monotonic weight increase - TestApplyCoactivationWeightBounds: verifies weights stay within configured [wmin, wmax] bounds Tests use build tag 'integration' and skip when Neo4j is unavailable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add integration tests for edge cap enforcement: - TestApplyCoactivationEdgeCapEnforcement: Tests with 10 nodes (45 potential pairs) and cap=10, verifies exactly 20 edges created (10 pairs × 2 directions) - TestApplyCoactivationEdgeCapWithLargeInput: Tests with 20 nodes (190 potential pairs) and cap=50, verifies exactly 100 edges created Both tests verify: - Correct number of edges written to database - Highest activation product pairs are selected (top-K selection) - Lowest activation product pairs are excluded Adds countEdges helper function for verifying edge counts in Neo4j. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe pull request extends the learning service to support configurable Hebbian learning parameters (learning rate, decay, weight bounds, activation threshold) via environment variables, implements top-K pair selection based on activation products, and updates the Neo4j integration to compute and persist weight-updated edges with enhanced properties using Cypher-native Hebbian calculations. Changes
Sequence DiagramsequenceDiagram
participant Config
participant LearningService
participant NodeMemory
participant Neo4jDB
LearningService->>Config: Read LearningMinActivation,<br/>LearningEta, LearningMu,<br/>LearningWMin, LearningWMax
Config-->>LearningService: Return learning parameters
LearningService->>NodeMemory: Fetch activated nodes
NodeMemory-->>LearningService: Return node list with<br/>activation values
LearningService->>LearningService: Filter nodes by<br/>LearningMinActivation threshold
LearningService->>LearningService: Generate O(n²) candidate pairs<br/>and compute a_i × a_j
alt Pairs exceed LearningEdgeCapPerRequest
LearningService->>LearningService: Sort pairs by product,<br/>truncate to top-K
end
LearningService->>LearningService: For each pair,<br/>compute Hebbian update:<br/>w_new = (1-μ)×w + η×(a_i×a_j)<br/>clamp to [wmin, wmax]
LearningService->>Neo4jDB: MERGE forward and reverse<br/>CO_ACTIVATED_WITH edges<br/>with updated weight & properties
Neo4jDB->>Neo4jDB: On MATCH: increment<br/>evidence_count & version
Neo4jDB-->>LearningService: Confirm updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Free 📒 Files selected for processing (4)
✏️ Tip: You can disable this entire section by setting Note 🎁 Summarized by CodeRabbit FreeYour organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login. Comment |
The reasoning module and reranker were replacing results, losing the normalized_confidence that was set earlier in the scoring pipeline. Changes: - Add ApplyNormalizedConfidenceToResults() function in scoring.go - Call it AFTER all post-processing (reasoning, reranking, truncation) - Add FileFilter struct for code_only and extension filtering - Add comprehensive tests for new functionality This fixes the issue where API responses were missing normalized_confidence and confidence_level fields despite Task #2 being marked complete. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…RPO trainer + DPO pair generator + dual regression harness, code-complete; compute pass operator-gated) Exit criterion (Option B refined): Epics 1-5 code-complete, Epic 6 three-tier tests green (73 total), Epic 7 in-repo deferrals done, Epic 8 docs landed. Actual MLX GRPO run + gate 5a/5b execution are explicit operator tasks — code path is wired, tested with mocked rollouts, ready to attach to an MLX optimizer step (~100 LOC adapter, follow-up task #227). What ships: - neural/training/rl/ — trainer.py (MLX-agnostic orchestrator with injectable RolloutFn/OptimizerStepFn/EvalFn/CheckpointFn Protocol callables + SQL sidecar persistence matching Phase 10's persist.py), grpo_loss.py (clipped surrogate + KL + entropy, log-ratio clamp at ±20, hand-computed fixture asserted at TOL=1e-6), advantage.py (per-task normalization + 3 zero-stddev policies: intra_batch_only default / widen / drop), reward_sampler.py (reads benchmark_results, 3 sampling strategies), preflight.py (5 gates), regression.py (dual gate 5a vs Phase 5 SFT baseline 0.8338 × 1.02 = 0.8505 target + 5b vs fresh-merge ≤ 0.5pp delta) - neural/training/dpo/pair_generator.py — reads benchmark_results, buckets by (task_id, prompt_hash), chosen/rejected by scalar reward delta ≥ 0.15. End-to-end tested against live Phase 10 TSDB: 5 pairs across 2 tasks, SHA256 bbe7bb9a… - internal/tsdb/migrations/013_rl_training.sql — APPLIED LIVE, schema_meta 12→13. Additive rl_training_runs (PK CUIDv2 + FK to benchmark_runs + gate_verdict CHECK) + rl_training_steps (hypertable on recorded_at, 30-day chunks, per-step loss components + advantage stats + clip frac + n_samples/n_dropped). Reverse-tested offline before apply. - configs/rl_phase11.yaml + configs/dpo_phase12_pairs.yaml — zero hardcoded values; every MEMORY knob explicit Decision forks (Plan §10): - Risk #1: Option B (custom in-repo trainer) over Option A (vendor mlx-lm-lora). mlx_lm==0.31.2 has no native GRPO. Orchestrator came in at ~330 LOC, below plan's 400-600 LOC estimate. Isolated MLX coupling behind Protocol callables → unit tests use pure-Python mocks with no MLX import. - Risk #2: default zero_stddev_policy: intra_batch_only. 9/16 Phase 10 tasks have historical σ=0 from deterministic rewards; batch rollouts have real sampling-temperature variance so within-batch σ is meaningful. widen (mean × 0.05) + drop fallbacks config-selectable. Tests: 73 total, all green. - Tier 1 unit 37 (grpo_loss 8 + advantage 13 + reward_sampler 16) - Tier 2 integration 36 (trainer 8 + regression 12 + pair_generator 16) - Tier 3 e2e live (V0013 applied, trainer sidecar round-trip via psql -f, DPO from live Phase 10 TSDB) Deferred, operator-gated: MLX adapter wiring (task #227); real GRPO run (~4-8 hrs MLX); gate 5a/5b execution ($10-20 judge spend); adapter promotion sandbox → .local-models/qwen3-14b-mdemg-v1-rl/; --scorer=registry default flip (gates on 5a PASS); stagnation auto-exit log (gates on benchmark_runs .count ≥ 2). Policy compliance: - epoch cap = 3 (training.epochs: 3 explicit) - n_epochs=auto disallowed (trainer reads integer, no auto branch) - early-stop val_reward < best × 0.95 × 2 evals (wired + test_trainer_ integration verifies fires at step 6) - CUIDv2 for run_id + step_id (reuses neural/benchmarks/_ids.new_run_id) - no hardcoded values (every knob in yaml with CLI override) - sequential epics (no parallelism; docs before implementation) - 3-tier testing (unit 37 + integration 36 + e2e live) - single batched commit at sprint close - sprint summary on PR per MEMORY feedback_sprint_summary_on_pr - max_tokens ≥ 3000 (rollout 4000, judge 4000) - latency_budget_ms ≥ 15000 (rollout 30000, judge 30000) - plan-options pattern disclosed (Risk #1 + Risk #2) Phase 12 (HITL DPO) unblocked. training_data/dpo/phase11/pairs.jsonl is the curation input; trainer/loss/advantage modules directly reusable for Phase 12's DPO training loop. Sprint chain: A (#335) → B (#336) → C (#338/#339/#340) → D (#343) → E (14cd2b3) → DATA (#346 234baec) → PHASE5 (#347 c0be250) → PHASE10 (#348 b81c5fb) → PHASE11 (this commit) → operator compute pass (#227) → Phase 12 HITL DPO. Docs: sprint_plan_ft_lora_phase11.md (12-section v1.0 plan), phase_11_rl_ post.md (12-section post-run report), 00_README_v2.md v5.7 → v5.8, 03_IMPLEMENTATION_PLAN_v2.md §Phase 5.11 EXECUTED banner, 04_BENCHMARK_RL_ v2.md §Phase 11 EXECUTED banner with code-complete/compute-deferred delineation, AGENT_HANDOFF.md top entry, CHANGELOG.md [Unreleased] ### Added, CLAUDE.md Testing section expanded. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…fail + degraded-mode Phase 11.6.3 — MLX Watchdog (Operational Hygiene #2). Eliminates the retry-storm cascade observed in Phase 12 (1642% CPU when 16 LLM call sites each independently retried 6× on a dead mlx_lm.server). Auto-restart via launchd, fast-fail in llmclient when probe says down, alerts + Prometheus metrics + operator CLI for visibility. Phase 13 (Column-Voting Retrieval) unblocked for sustained live A/B testing. Shipped: - internal/mlxprobe/ (~250 LOC) — goroutine, state machine (up→degraded→down with 3-failure / 2-success hysteresis), atomic state, supervisor lifecycle, SetDefault/SetFastFailEnabled/SetFastFailObserver singletons - internal/llmclient/client.go:471 — 10-LOC fast-fail gate at top of doWithRetry, returns new ErrMLXDown sentinel; embeddings unaffected (gate keys on baseURL match) - packaging/launchd/com.mdemg.mlx-server.plist — KeepAlive on crash, ThrottleInterval=60s, conservative Phase 12 mlx flags. launchdServices slice extension marked Optional: true (skipped when mlx_lm not on PATH; MDEMG_MLX_LM_BIN/MDEMG_MODEL_PATH env overrides) - internal/cli/watchdog.go — `mdemg watchdog status [--json]` CLI parsing /metrics + launchctl print + ~/.mdemg/alerts/current.json - internal/metrics/collectors.go — 3 new metrics: mdemg_mlx_health_state{endpoint}, mdemg_mlx_fast_fail_total{caller_task}, mdemg_mlx_state_transitions_total{from,to} - internal/cli/serve.go — supervisor-managed probe goroutine, late-bound alert dispatcher callback (up→down=High, down→up=Low; 300s cooldown handles 60s launchd restart-cycle flap suppression) - internal/config/config.go — 4 config knobs with cross-field validation: MLX_WATCHDOG_ENABLED (default false), MLX_PROBE_INTERVAL_SEC (5), MLX_PROBE_TIMEOUT_SEC (2; must be < interval), MLX_FAIL_FAST_ENABLED (true) Decision-fork outcomes (per MEMORY plan-options pattern): - Architecture: launchd-only restart + mdemg-side probe + llmclient gate (chose Option A over Go-binary supervisor / shell wrapper — KeepAlive + ThrottleInterval already does what those would; reuses 5-plist pattern) - Probe interval/timeout: 5s/2s (~15s detection, <100ms/min overhead; matches healthprobe defaults) - Fast-fail on degraded: NO — only on down (degraded is informational/ log-only; turning slowness into hard failure is worse than slowness) - MLX_WATCHDOG_ENABLED default: false until live-soak validates - Plist install: Optional (skip when mlx_lm not on PATH) Tests: - Tier 1: go test -race ./internal/mlxprobe/... ./internal/llmclient/... ./internal/config/... ./internal/cli/... ./internal/metrics/... — all green - Tier 2: go test -tags=integration -run TestMLXWatchdog ./tests/integration/ — 100 concurrent fast-fails, OpenAI endpoint isolation - Tier 3 (live): mdemg watchdog status verified cleanly against running system; destructive kill -9 mlx smokes (Live Smoke 1/2/3/4) deferred to operator-led validation per safe-execution policy. Smoke 2 8h soak required before flipping MLX_WATCHDOG_ENABLED default to true - golangci-lint run: 0 issues across all touched packages Schema unchanged at 16 (additive metrics + alerts only; no TSDB migration). Policy compliance: - ✅ No hardcoded values (4 config knobs, plist template-substituted) - ✅ Sequential epics 0→8, docs before implementation - ✅ 3-tier testing (Tier 3 partial — live smokes operator-led) - ✅ Single batched commit at sprint close - ✅ Plan-options pattern (5 forks documented) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sic_llm_semaphore_blocked Live Tier 3 verification on the running mdemg surfaced one more double- prefixed metric beyond the Phase 11.6.3 + Phase 13 set fixed in commit f3d106e: `mdemg_mdemg_rsic_llm_semaphore_blocked_total` from Phase 11.6.x. Same root cause — definition included `mdemg_` while the registry's `NewCounter` already prepends `mdemg_`. Fixed by stripping the prefix. After this fix + kickstart, the live metric exposes as `mdemg_rsic_llm_semaphore_blocked_total`. Tier 3 live verification SUCCESSFUL on this branch (Phase 11.6.3 always-on policy): SIGSTOP'd mlx via launchctl, watched state machine transition up→down at ~15s mark, posted /v1/memory/retrieve which short-circuited through the fast-fail gate on 3 caller_task counters (retrieval.intent_translate, retrieval.query_classify, retrieval.rerank_cross), SIGCONT'd mlx, watched recovery to up within ~10s. Total: 4 up→down + 4 down→up transitions across multiple cycles, all recorded; alerts dispatched with proper cooldown dedup (2 alerts across 4 cycles within 300s window). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lag-off) + Phase 13 Epic 6 V0017 audit-writer fix + Phase 11+ feature-doc backfill (narrow close)
Narrow close per operator approval after Epic 0+1+2 produced design questions
that warrant dedicated follow-up sprints. Note 05 deferred to Phase 14.2;
Note 06 default flip deferred to Phase 14.1.
What landed
-----------
* Phase 13 Epic 6 V0017 audit-writer fix (in-flight discovery)
- tsdb/retrieval_audit_writer.go (new, ~165 LOC; buffered + 30s flush via CopyFrom)
- retrievalAuditAdapter in api/server.go (cycle-safe translation)
- V0017 was empty since Phase 13 because SetRetrievalAuditWriter had no
callers; now writes per retrieve when RETRIEVAL_AUDIT_ENABLED=true.
- Live verification: 279 audit rows accumulated in 4h since fix landed.
* Note 06 sparse activation gate (flag-off)
- retrieval/gate.go (~190 LOC) + 9 Tier 1 unit tests, all green
- Wired post-aggregation, pre-rerank in service.go
- 4 config knobs (SPARSE_*); default off, percentile 0.95, min 3, max 20
- Per-request override via ?sparse=true|false and ?sparse_percentile=N
- debug.sparse_gate_* + debug.below_threshold_* (when JiminyEnabled)
- 3 Prometheus histograms
* TSDB V0019 sparse_gate_metrics
- migrations/019_sparse_gate_metrics.sql (hypertable, 7-day chunks)
- tsdb/sparse_gate_writer.go (~165 LOC)
- sparseGateRecorderAdapter in api/server.go (always wired so per-request
overrides record even when default off)
- TSDB_REQUIRED_SCHEMA_VERSION 18 -> 19
* Epic 0 forensic doc — phase_14_score_distribution_analysis.md
- Defaults derived from llm_interactions.retrieval_scores (99k+50k score
points across consulting.classify + retrieval.rerank_cross)
- Heavy-tail confirmed (p98/p50 ~ 4-5x); within-call clamp dominates
percentile choice in dominant K=20-50 regime
- Note 05 catalog redesign needed for whk-wms (0 distinct symbols, 0
distinct roles) — flagged for Phase 14.2
* A/B verdicts captured
- 16q quick at MIN=3 / p95,p98,p99: all FAIL (q69 boundary)
- 16q quick at MIN=10 / p95: PASS (mean +0.019, 0 regressions, 3 improvements)
- 120q full at MIN=10 / p95: FAIL per-question (mean parity 0.413=0.413,
7 boundary regressions across 4 categories, 3 of 7 in
architecture_structure)
- Per sprint plan §10 risk #1: ship flag-off; Phase 14.1 will retune.
* Phase 11+ feature-doc backfill (operator request 2026-05-04)
- new: docs/features/{mlx-watchdog,uvts-validation,column-voting-retrieval,
local-llm-runtime,sparse-retrieval}.md
- extended: docs/features/service-resilience.md (Phase 11.6.x additions)
- Standing rule saved as memory feedback_per_feature_docs_required.md
* Follow-up sprint stubs scoped
- sprint_plan_phase_14_1_adaptive_per_category_gate.md (~3 days, ~$15)
- sprint_plan_phase_14_2_note_05_sparse_fingerprints.md (~7 days, ~$25)
Decision-fork outcomes
----------------------
| Fork | Provisional | Outcome |
|---|---|---|
| #2 percentile default | 0.98 | 0.95 (Epic 0 data) |
| #5 catalog bit policy | static 64/64/64/64 | adaptive (deferred Phase 14.2) |
| #8 gate ordering | pre-rerank | pre-rerank (confirmed) |
| #9 default flip | per-Note conditional | flag-off (Phase 14.1 will flip) |
OpenAI spend (actual): ~$13. Well under sprint $25-50 budget.
Tests + lint
------------
* go test -race ./internal/{retrieval,config,metrics,tsdb}: all green
* golangci-lint run on affected packages: 0 issues
* Live smoke: /healthz green, retrieve returns 20 (gate off), 279 V0017
audit rows in 4h (Phase 13 Epic 6 fix verified in production)
Memory observations
-------------------
* rw0mzergwcqct8abpw0dli9x — Phase 14 Epic 8 doc-backfill scope
* sc4iwy3of9ndn5kowja1i14i — Epic 0 forensic + audit-writer gap
* omr2rs5jppqrvee2k0l1xtd1 — Epic 1 gate code complete
* re4k7rpd3hjt5a52l8qwx8fp — Epic 2 verdict + Phase 14.1 scope
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Audit of internal/api/server.go (167 routes) vs docs/user/api-reference.md
surfaced 19 genuinely missing endpoints. v0.10.0 commit noted this as
out-of-scope; this commit resolves the gap.
Audit method: extract mux.HandleFunc registrations from server.go, extract
documented "VERB /path" headings from api-reference.md, normalize both to
strip path parameters and trailing prefix slashes, diff. Of the initial
24-entry code-only set, 5 are false positives (combined headers like
"POST /v1/admin/features/start|stop|restart" cover the individual verbs;
"GET|POST /v1/jiminy/protocol/metrics" covers both methods on one route).
Added sections:
Jiminy / J17 (10 endpoints, all under "## Jiminy Inner-Voice"):
GET|POST /v1/jiminy/protocol/metrics # snapshot + reset
GET /v1/jiminy/protocol/status # per-session J17 state
POST /v1/jiminy/checkpoint # tier-transition checkpoint
POST /v1/jiminy/resume-protocol # restore from checkpoint
POST /v1/jiminy/extension # operator-driven tier hold
POST /v1/jiminy/strict # toggle strict mode per session
POST /v1/jiminy/reformulate # advisory -> imperative rewrite
POST /v1/jiminy/classify # pre-Write/Edit pass/deny gate
GET /v1/jiminy/latest # most recent guidance (warm store)
POST /v1/jiminy/warm # eager cache warmup
Memory / Graph (3 endpoints, under "## Memory Operations"):
GET /v1/memory/graph/topology # node/edge counts per layer
GET /v1/memory/graph/neighborhood # local 1-3 hop walk
GET /v1/memory/spaces # root listing of all spaces
Observability (2 endpoints, under "## Metrics & Monitoring"):
GET /v1/metrics/trends # TSDB time-series query
GET /v1/prometheus # Prometheus scrape endpoint
Dashboard / Viz (4 endpoints, new "## Dashboard / Visualization (internal)"
section before MCP Server Tools — operator-internal endpoints backing the
browser dashboard at /ui/):
GET /api/graph/data # force-directed graph data
GET /api/graph/fields # schema field catalog
GET /api/graph/health # explorer health
GET /viz/topology # standalone HTML topology view
Each entry has handler-signature-derived request/response shape, query
parameter table, sample curl/JSON examples following the existing
api-reference convention. TOC updated with new "Dashboard / Visualization
(internal)" entry and renumbered tail.
Out of scope (deliberate, deferred):
- 28 "docs-only" entries from the audit are confirmed false positives
from prefix-matching path normalization (code registers /v1/memory/nodes/
with trailing slash and routes the suffix; docs spell out the full
/v1/memory/nodes/{node_id}/archive form correctly)
- /v1/symbols root path is partially covered by /v1/symbols/relationships
+ /v1/symbols/{id}/relationships in docs; root listing endpoint
documentation can land later if/when its handler grows specific shape
- /v1/conversation/observations covered indirectly by the flag-for-org
endpoint documentation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(release): promote Unreleased -> v0.9.0 Promote the Unreleased CHANGELOG block to v0.9.0 (2026-05-06) ahead of release.yml / goreleaser tag push. New ### Breaking subsection captures two operator-visible cutovers since v0.8.5: (1) Phase 13.5 LLM runtime port 8101 -> 8102 + .env migration required; (2) Phase 13.6 MLX_* -> LLM_* env-var rename (legacy aliases retained for >= 1 release cycle). New ### Added entries: Phase 10.5 closeout (UBENCH framework promotion, commit 0389b49) and Claude Code GitHub App workflows (PRs #378, #379). All previously-Unreleased entries (Phase 14.2.3, 14.2.x, 14.1.x, 14, 13.6, 13.5, 13.2, 13.1) carried forward unchanged into the v0.9.0 block. Fresh empty Unreleased section seeded above. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(submodule): bump homebrew-mdemg to v0.9.0 formula + docs Bumps packaging/homebrew-mdemg pointer a235977 -> 6077097, which incorporates: - f9358cd Brew formula update for mdemg version v0.8.5 (goreleaser, prior) - b4a0d2c Brew formula update for mdemg version v0.9.0 (goreleaser, this release) - 6077097 docs: v0.9.0 -- CHANGELOG, README What's New, beta-testing version pin Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(api): /healthz returns build-time version, not stale literal "0.6.0" `config.FromEnv()` defaulted MdemgVersion/MdemgCommit to literal "0.6.0"/ "unknown" when MDEMG_VERSION/MDEMG_COMMIT envs were unset. Both /healthz and /readyz serialize cfg.MdemgVersion, so they reported "0.6.0" forever regardless of the actual binary's ldflags-injected cli.Version. Fix: defaults to "" in config; cli/config_loader.go injects cli.Version / cli.Commit (the build-time vars set by goreleaser ldflags) when the env override is unset. Operators can still pin via MDEMG_VERSION env. Live-verified: dev build (no ldflags) now reports {"version":"dev"} on /healthz instead of the lying "0.6.0". Production builds via goreleaser will report the real semver tag. TestHandleHealthz unaffected (sets cfg.MdemgVersion directly). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(service): replace decommissioned mlx-server LaunchAgent with llama-server Phase 13.5 cutover (2026-05-03) replaced mlx_lm.server (port 8101) with llama.cpp llama-server (port 8102) as the production LLM runtime, but the embedded launchd plist template + service install code paths were never updated. Any operator running 'mdemg service install' from a fresh checkout got the decommissioned mlx_lm.server agent — mdemg's startup preflight then failed because LLM_ENDPOINT=http://127.0.0.1:8102/v1 wasn't reachable. Changes: - New packaging/launchd/com.mdemg.llama-server.plist with the Phase 13.5 production flags (--ctx-size 32768 --parallel 4 --cont-batching --metrics --jinja). Byte-identical mirror at internal/cli/launchd_templates/ for the embed.FS (CI sync-check enforced). - Removed packaging/launchd/com.mdemg.mlx-server.plist + embed.FS mirror. mlx_lm.server is decommissioned and known-broken on M5 + macOS 26.3.x; keeping the template would just risk re-deploying it. - internal/cli/service_darwin.go: launchdServices entry replaced with com.mdemg.llama-server. resolveMLXLMBin renamed to resolveLlamaServerBin with primary env MDEMG_LLAMA_SERVER_BIN, deprecation alias for MDEMG_MLX_LM_BIN (slog.Warn at boot, retained ≥1 release cycle per the Phase 13.6 deprecation pattern), PATH lookup of `llama-server`. resolveMDEMGModelPath default updated to the canonical Phase 13.5 GGUF filepath (.local-models/mdemg-llm-v1-gguf/mdemg-llm-v1.Q5_K_M.gguf) since llama-server takes a `.gguf` filepath, not an HF-format directory like mlx_lm.server. Install error message updated for the new env var name + remediation steps (`brew install llama.cpp`). - migrateLegacyMLXServerPlist() added: if a pre-cutover com.mdemg.mlx-server plist is bootstrapped on the operator's machine, Install() boots it out and renames the file to .disabled-phase13_5 (matches the manual operator convention from Phase 13.5 rollout). Best-effort: failures don't block the install. - internal/cli/service_darwin_test.go fully rewritten: * TestLaunchdServicesIncludesLlamaServer asserts the new entry exists and is Optional=false (production matches Hotfix 11.6.3.1; the old test asserted Optional=true, a latent lie since 2026-05-02 that Linux CI never caught because of //go:build darwin) * TestLlamaServerPlistEmbedded replaces TestMLXServerPlistEmbedded; additionally asserts mlx-server.plist is NOT in embed.FS * Two resolver tests for the primary env var * New TestResolveLlamaServerBinFallsBackToMLXAlias proves the Phase 13.6 deprecation alias path works * resolveMDEMGModelPath tests updated for the new GGUF default - internal/cli/watchdog.go: help text references com.mdemg.llama-server (instead of com.mdemg.mlx-server) and llama-server (instead of mlx_lm.server). Notes that mdemg_mlx_health_state metric name is retained for dashboard compatibility. Tested: - Tier 1 unit: 7/7 new tests pass; full ./internal/cli/... suite green (61s wall-clock). - Tier 2 integration: golangci-lint run ./internal/cli/ — 0 issues. CI plist sync-check (diff -q packaging/launchd/*.plist internal/cli/launchd_templates/) — 6/6 byte-identical. - Tier 3 live e2e: deferred. Running mdemg service install on the operator's currently-serving machine would briefly bootout the running llama-server LaunchAgent (PID 20527 actively serving production inference). The hand-installed llama-server plist on the operator's machine is byte-equivalent (modulo template substitutions) to what this commit will install via `mdemg service install` on a fresh operator setup, so the operator can verify on next planned redeploy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(sprint): MODEL-DIST-001 sprint plan + quant manifest skeleton Epic 0 of Sprint MODEL-DIST-001 — Local LoRA Distribution via Ollama Library. Sprint plan in 12-section v1.0 format. Supersedes parts of the speculative spec at docs/research/mdemg_sprint_ideas/MDEMG_FT_LORA_PACKAGING_SPEC.md (HF Hub vs Ollama Library; adapter-only vs both-fused-and-adapter; Apple Silicon scope vs cross-platform). Configurability Contract — every operator-visible value is dynamic per the framework's no-hardcoding rule. 12 env vars + flag overrides + sensible defaults. ModelFetcher interface decouples CLI from Ollama-specific knowledge; v1 ships OllamaFetcher only, future backends (HF / S3 / GitHub Release / file) plug in via factory dispatch on MDEMG_MODEL_BACKEND without touching the CLI surface. Forensic from Epic 0: - adapters/tier1/adapters.safetensors verified present (514 MB MLX, Phase 5 SFT Iter 2400 best output) - mdemg-llm-v1.Q5_K_M.gguf SHA256 captured (9.8 GB; 144ad7231...) - f16 GGUF intermediate NOT on disk; Epic 1 will regenerate via convert_hf_to_gguf.py from the MLX merged model (~5 min) - qwen3:14b model-layer digest captured from Ollama registry; manifest digest to be computed at Epic 3 for Modelfile FROM @sha256: pinning quant_manifest.json skeleton with Q5_K_M SHA pre-populated; Q4_K_M / Q8_0 / adapter SHAs filled in during Epics 1+2. Estimated effort 5–7 dev-days. OpenAI spend $0. Risk medium (Ollama publish one-way; MLX→PEFT→GGUF LoRA conversion is the riskiest engineering item with documented contingency to defer to MODEL-DIST-002 if blocked). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 1 — built Q4_K_M + Q8_0 fused GGUFs Pipeline (CLAUDE.md Phase 13.5 documented path): 1. mlx_lm.fuse --dequantize: mlx-community/Qwen3-14B-4bit + adapters/tier1/ -> 29.6 GB bf16 HF safetensors at .local-models/qwen3-14b-mdemg-v1-bf16/ 2. convert_hf_to_gguf.py --outtype f16 -> 30 GB f16 GGUF (required neural/.venv interpreter with torch + transformers + gguf installed; /opt/homebrew/bin/convert_hf_to_gguf.py uses system python which lacks these — installed gguf/sentencepiece/protobuf into neural/.venv) 3. llama-quantize Q4_K_M -> 9.0 GB (4.87 BPW; 40s wall on M5) 4. llama-quantize Q8_0 -> 16 GB (8.50 BPW; 11s wall on M5) 5. Live smoke per new quant via llama-server on port 18102 — both serve /v1/models cleanly with embedded chat_template SHAs captured in quant_manifest.json: Q4_K_M: 401161710c22f0ae...411d42ea Q5_K_M: 144ad723101d688f...d5f5d54 (matches Epic 0 baseline) Q8_0: fc14dcb40af1bb58...8db6089 f16: 436cd6f41a684805...3217bd (intermediate, retained for Epic 2) Resource matrix updated with empirical sizes (Q4_K_M is 9.0 GB vs estimated 6.5 GB; min RAM revised 8 -> 12 GB to cover ~3 GB working memory above weights). 14B params x 4.87 BPW ≈ 8.5 GB matches the formula. GGUF binary artifacts stay local — .local-models/ gitignored per .gitignore:70. Sprint deliverable in git is just the manifest update. Production llama-server (PID 20527 on port 8102) undisturbed throughout Epic 1; live smokes used port 18102. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): Epic 2 — defer adapter to MODEL-DIST-002 Adapter (LoRA-only Modelfile via ADAPTER directive) deferred per the sprint plan's documented contingency clause. Fused-only path (Epics 1, 3, 4, 5) continues — that's the primary operator value. Forensic findings (epic_2_forensic.md): - MLX adapter is well-formed: 560 tensors, 40 layers x 7 target_modules, rank 32, alpha 64, scale 20.0. - convert_lora_to_gguf.py is NOT in brew install llama.cpp; would need manual fetch from llama.cpp source. - MLX -> PEFT requires tensor transposition: MLX lora_a is (in, rank); PEFT expects (rank, in). Same for lora_b. - Estimated 80-95 min to complete vs ~30 min budget remaining for Epic 2. - Hit the contingency criterion: "MLX -> PEFT conversion blocked by tooling gaps." Decision: defer adapter scope to MODEL-DIST-002 (new follow-up sprint, to be planned separately). Fused-only ships this sprint. Knock-on changes (in-flight to subsequent epics): - Epic 3: drop Modelfile.adapter; publish only 3 fused quants. - Epic 4 CLI: --adapter flag accepted at parse-time but errors with "lands in MODEL-DIST-002"; machinery preserved for forward-compat. - Epic 6 e2e: drop adapter-pull step. - Epic 7 feature doc: adapter section notes "coming in MODEL-DIST-002". Artifacts preserved on disk for MODEL-DIST-002 pickup: - adapters/tier1/adapters.safetensors (MLX, 514 MB) - .local-models/mdemg-llm-v1-gguf/mdemg-llm-v1.f16.gguf (30 GB, retained as base for llama-server --lora verification later) quant_manifest.json adapter block updated with status=deferred + reason. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 3 — 3 Modelfiles + local ollama create (push pending) Authored 3 Ollama Modelfiles in packaging/ollama/: Modelfile.Q4_K_M — 9.0 GB, 12 GB min RAM, 16 GB recommended Modelfile.Q5_K_M — 11 GB, 14 GB min RAM, 24 GB recommended (production canonical) Modelfile.Q8_0 — 16 GB, 20 GB min RAM, 32 GB recommended Common shape: FROM ./../../.local-models/mdemg-llm-v1-gguf/...gguf relative path (operator-machine local); num_ctx 32768, num_predict 4096, stop tokens <|im_end|>/<|im_start|>; Apache-2.0 LICENSE; SYSTEM positioning block. No TEMPLATE directive — chat template baked into GGUF metadata (Qwen3 chat_template.jinja preserved through mlx_lm.fuse --dequantize → convert_hf → llama-quantize pipeline). packaging/ollama/README.md documents the publish workflow including the fork-customization path (operators publishing under a different namespace follow MDEMG_MODEL_NAMESPACE per the Configurability Contract). Local ollama create completed for all 3: reh3376/mdemg-llm-v1:Q4_K_M ID 5c3a7252c295 reh3376/mdemg-llm-v1:Q5_K_M ID 08c13b480864 reh3376/mdemg-llm-v1:Q8_0 ID 6b1006facd36 Layers de-duplicated: config + params + system layers (3 layers) are identical across all 3 quants; only the model blob (GGUF) differs. ** ollama push deferred ** — one-way action gated on operator confirmation per Sprint Plan §10 Risk #8. Operator must claim reh3376 namespace on ollama.com and generate API token before push proceeds. Local-create proves the Modelfiles are well-formed; push is a separate decision. Once pushed, manifest digests captured into quant_manifest.json (ollama_manifest_digest field per quant) for mdemg model verify. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 4 — `mdemg model` CLI + pluggable Fetcher interface Sprint MODEL-DIST-001 Epic 4 — the bulk of the operator-facing surface. New CLI subcommand group: mdemg model pull # fetch + symlink + SHA verify mdemg model list # show pulled models mdemg model verify # re-check SHAs vs quant manifest mdemg model remove # destructive (requires --yes) mdemg model where # print resolved path for shell scripting Pluggable backend (internal/cli/model_fetcher.go): type Fetcher interface { Name, Fetch, Verify, Remove } NewFetcher dispatches on cfg.ModelBackend (env: MDEMG_MODEL_BACKEND) v1 ships OllamaFetcher only; future backends (hf, s3, github-release, file) plug in via factory branch — CLI surface unchanged. OllamaFetcher (internal/cli/model_fetcher_ollama.go): Encapsulates ALL Ollama-specific concepts: `ollama pull` invocation, manifest path under <OLLAMA_MODELS>/manifests/<OLLAMA_HOST>/<ns>/<n>/<tag>, mediaType=application/vnd.ollama.image.model layer filtering, blob path under <OLLAMA_MODELS>/blobs/sha256-<digest>, symlink under <MDEMG_MODEL_DIR>, idempotent. Configurability Contract (no hardcoding; memory: feedback_no_hardcoded_values.md): 12 env vars + flag overrides, each with v1-production-tuned defaults so `mdemg model pull` with no flags Just Works. See sprint plan §3. Live-verified all 3 resolution paths: `--quant Q5_K_M` → namespace=reh3376 `--namespace acme --name custom-model` → namespace=acme name=custom `MDEMG_MODEL_NAMESPACE=acme env` → env overrides applied Added to internal/config/config.go: ModelBackend, ModelNamespace, ModelName, ModelQuants, ModelRamTiers, ModelQuant, AdapterBase, ModelDir, OllamaModelsRoot, OllamaRegistryHost, ModelManifestPath. Embedded quant manifest (internal/cli/quant_manifest.json via embed.FS): Runtime source-of-truth for SHA verification. Operator override via MDEMG_MODEL_MANIFEST_PATH for air-gapped deployments. Mirrors docs/development/model-dist-001/quant_manifest.json. RAM-tier auto-pick: Default JSON `{"<16":"Q4_K_M","<24":"Q5_K_M","default":"Q8_0"}` maps host RAM (sysctl on darwin, /proc/meminfo on linux) to quant. Operator override via MDEMG_MODEL_RAM_TIERS. Adapter path (--adapter flag) returns ErrAdapterDeferred per Epic 2's contingency exit — adapter publication lands in MODEL-DIST-002. Flag machinery preserved for forward compatibility. Tests (22, all green) in internal/cli/model_test.go: - Backend factory dispatch (5 cases incl. case-insensitive, default, error) - Quant allowlist parsing (5 cases incl. whitespace + empty entries) - RAM-tier JSON parsing (default + operator override + malformed) - PickQuantForRAM (7 boundary cases) - ResolveQuant across paths (auto, explicit, rejection, operator-custom) - QuantManifest load (embedded + file override + missing-file error) - Ollama tag composition (fused + adapter forms) - Manifest path composition under custom OLLAMA_MODELS/OLLAMA_HOST - Blob path digest prefix handling - Adapter deferred error - Manifest JSON parser (mediaType filtering + malformed + no-model-layer) Grep audit (verification checklist): grep on internal/cli/model*.go for hardcoded values found only in help text Long/example strings documenting defaults to operators — not in logic. Behavior values all flow through cfg.Model* fields. Build + lint clean. Full cli test suite (61s wall) green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 5 — V0021 model_install_events hypertable + writer Sprint MODEL-DIST-001 Epic 5 — observability for `mdemg model` operations. Grafana panels deferred to Sprint B (Grafana audit). New migration: internal/tsdb/migrations/021_model_install_events.sql Hypertable on recorded_at, 7-day chunks, 3 indexes (quant-time, failed-events partial, backend-event-time). Columns: event_id CUIDv2 PK + recorded_at, event_type (pull/verify/remove), backend_name, namespace, model_name, quant, adapter bool, success bool, latency_ms, sha256, size_bytes, err_message (1 KB cap). New writer: internal/tsdb/model_install_writer.go Synchronous single-row INSERT (not buffered + CopyFrom — CLI is one-shot, writes are infrequent vs the V0017/V0018/V0019/V0020 retrieval- path writers that fire per-request). Nil-pool no-op for degraded mode. errMessageMaxLen=1024 truncation at write time. New modelInstallPool interface (Exec-shaped) avoids touching the existing CopyFrom-shaped poolIface used by buffered writers. Wiring: internal/cli/model.go gets recordModelEvent(parent, cfg, row) helper: - Returns immediately if !cfg.TSDBEnabled || cfg.TSDBHost=="" - 2s timeout on connect (TSDB unreachable doesn't block CLI exit) - Logs warning + degrades gracefully on any TSDB error Called from runModelPull (success + failure paths), runModelVerify (single sweep row), runModelRemove (success + failure paths). Schema version bump: internal/config/config.go: TSDB_REQUIRED_SCHEMA_VERSION default 20→21. CI validator at .github/workflows/ci.yml:60-65 counts SQL files in internal/tsdb/migrations/ and asserts equality; now 21 files = 21 in config = passes. Build + lint clean. Existing tsdb / cli test suites green; no new tests added for the writer itself (single INSERT mirrors V0017/V0018/V0019 patterns already covered; integration is operational verification at Epic 6 once tsdb is up in the dev stack). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): Epic 7 — local-model-distribution feature doc Sprint MODEL-DIST-001 Epic 7 — operator-facing feature documentation following the standard Why / Choices / How / How-to-use shape (memory: feedback_per_feature_docs_required.md). Contents: - Why: gap between brew install and a working local LLM after Phase 13.5 - Choices: backend matrix (Ollama vs HF vs GitHub vs S3 vs file://), artifact form (fused vs adapter), Apple Silicon scope, "Ollama runtime rejected (broken on M5+macOS 26.3.x), Ollama distribution only" - How it works: ASCII flow diagram covering CLI dispatch -> Fetcher interface -> OllamaFetcher (preflight, ollama pull, manifest discovery, blob resolve, symlink, SHA verify) -> V0021 observability row - How to use: * Quick start (3 commands: brew install ollama, mdemg model pull, curl /v1/models) * Explicit quant selection * Managing pulled models (list / verify / where / remove) * Forks + enterprise (MDEMG_MODEL_NAMESPACE override) * Air-gapped (MDEMG_MODEL_MANIFEST_PATH override) * Resource matrix per quant (disk, min RAM, recommended RAM, BPW) * Full Configurability Contract table (11 env vars + flags + defaults) * V0021 observability schema - Troubleshooting: ollama missing, SHA mismatch, quant allowlist rejection, RAM auto-detection failure, out-of-disk, symlink permission - Forward-looking: MODEL-DIST-002 adapter, Sprint B Grafana panels, future backends, cross-platform - References: all source-of-truth files cross-linked Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): Epic 8 — Documentation Update (main repo) Sprint MODEL-DIST-001 Epic 8 — final epic, never cut (memory: feedback_sequential_epics.md). This commit lands the main-repo doc updates. The packaging/homebrew-mdemg/ submodule docs (README, CHANGELOG, formula caveats text) update at v0.10.0 release-tag time per the v0.9.0 release flow precedent — that's when goreleaser auto-regenerates mdemg.rb from .goreleaser.yaml's caveats template, and the tap-side README/CHANGELOG get edited in lockstep. Changes: - CHANGELOG.md: comprehensive Unreleased entry documenting Epics 0-5 + 7 landed in this sprint. Epic 3 ollama push and Epic 6 Tier 3 e2e marked as gated on operator confirmation. Adapter path explicitly deferred to MODEL-DIST-002 with epic_2_forensic.md cross-reference. Captures the Configurability Contract enumeration, the 3 quant SHAs, the Fetcher interface design, the V0021 hypertable, and the explicit out-of-scope list. - CLAUDE.md: new "Model Distribution (Sprint MODEL-DIST-001)" subsection in Architecture Notes, slotted ABOVE the existing Compose embed entry for visibility. Captures the pluggable-backend design, the Ollama-as- distribution-only constraint, the on-disk symlink + manifest discovery flow, the 11-knob Configurability Contract surface, the no-hardcoding enforcement, the TSDB V0021 hookup, and the Apple Silicon v1 scope. - README.md: new "Step 2b (optional): Pull the local LLM" section between Step 2 (Initialize/Start) and Open the Dashboard. 3-command quick start (brew install ollama -> mdemg model pull -> set MDEMG_MODEL_PATH). Cross-references the feature doc for the full Configurability Contract. - .goreleaser.yaml: caveats template updated to include `mdemg model pull` instructions. Goreleaser regenerates the homebrew formula's caveats block from this on the next v* tag push, so v0.10.0 will ship the new text to brew users automatically. Deferred to v0.10.0 release-tag time (handled per v0.9.0 precedent): - packaging/homebrew-mdemg/README.md update - packaging/homebrew-mdemg/CHANGELOG.md update - packaging/homebrew-mdemg/mdemg.rb regeneration (automatic via goreleaser from the .goreleaser.yaml change in this commit) - Submodule pointer bump in main repo Deferred to Epic 6 close (after operator does ollama push): - post.md sprint-close document - Capture of remote Ollama manifest digests into quant_manifest.json Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 3 closeout — Ollama Library push complete All 3 fused quants now live on Ollama Library: https://ollama.com/reh3376/mdemg-llm-v1:Q4_K_M https://ollama.com/reh3376/mdemg-llm-v1:Q5_K_M https://ollama.com/reh3376/mdemg-llm-v1:Q8_0 End-to-end integrity verified: remote model-layer digests captured via GET https://registry.ollama.ai/v2/reh3376/mdemg-llm-v1/manifests/<quant> match the local Epic 1 SHAs exactly: Q4_K_M 401161710c22f0ae...411d42ea (matches Epic 1) Q5_K_M 144ad723101d688f...d5f5d54 (matches Epic 1) Q8_0 fc14dcb40af1bb58...8db6089 (matches Epic 1) Captured into quant_manifest.json (both docs canonical + internal/cli embed.FS mirror, byte-synced): - ollama_manifest_digest per quant (computed from the manifest body): Q4_K_M sha256:a210cccb12311773fd70bfa81f221ca0f7940a315bef87b84608caf894533b1b Q5_K_M sha256:ae6e54fe1ee0b487ae41260687ed14c46c30d1ffb0fece936282418b5bcb78e1 Q8_0 sha256:93df4d64bfa751506f7afba8bf08b891ea828575b838adec17b9399ad85be718 - Corrected size_bytes (Epic 1 used approximate values; replaced with registry-reported exact bytes for each tag): Q4_K_M 9.0 GB -> 8.4 GB (9001753408 B; was 9658404096) Q5_K_M 11 GB -> 9.8 GB (10514569568 B; was 11811160064) Q8_0 16 GB -> 14.6 GB (15698534208 B; was 17179869184) - Status flipped from "local-create done; push pending" to "published". Embedded runtime manifest (internal/cli/quant_manifest.json) re-built into the binary via embed.FS. TestLoadQuantManifest_EmbeddedFallback green with new values. Epic 3 of Sprint MODEL-DIST-001 now COMPLETE. Epic 6 (Tier 3 live e2e — `mdemg model pull` against the published tags + llama-server load on port 18102 + sanity inference) is now unblocked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): sprint close — post.md Sprint MODEL-DIST-001 close-out per memory rule (feedback_sprint_plan_format.md §11 — sprint plans live in docs/development/<sprint-line>/ with the standard post.md companion). Sections (CLAUDE.md sprint-plan section guidance): - Outcome: 3 quants live on Ollama Library, mdemg model pull is the canonical install path - Process: how the plan held under reality (operator-surfaced no- hardcoding rule revised the plan in-place to add the Configurability Contract before code was written) - Findings: 5 smooth parts + 5 friction items, both honest: * convert_hf_to_gguf.py python deps gap (silent ModuleNotFoundError) * mlx_lm.fuse adapter-path requirement * convert_lora_to_gguf.py missing from brew install llama.cpp (proximate Epic 2 deferral trigger) * mdemg tsdb migrate CWD-aware .env loader quirk * Epic 1 size estimates off vs registry-reported exact bytes - Current state: per-layer state matrix - Testing & benchmarking: all 3 tiers documented (Tier 3 e2e captured V0021 rows for both pull + verify event_types — live-verified) - Risks & opportunities (forward): MODEL-DIST-002 adapter scope, Sprint B Grafana, cross-platform, HFFetcher slot, CWD-aware .env loader QoL - Sprint commits: 9 commits on dev01, mapped to their epics Closes Sprint MODEL-DIST-001 functionally. Operational sprint close (v0.10.0 release tag + tap-repo doc updates) is a separate motion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(release): promote Unreleased -> v0.10.0 Promote the Sprint MODEL-DIST-001 entry from Unreleased to v0.10.0 (2026-05-11) ahead of release.yml / goreleaser tag push. Fresh empty Unreleased section seeded above. v0.10.0 ships: - mdemg model pull|list|verify|remove|where — one-command path from brew install mdemg to a working local LLM - Pluggable ModelFetcher interface (Ollama in v1, slots for HF/S3/GHR/file) - 3 fused GGUF quants live on Ollama Library at reh3376/mdemg-llm-v1 (:Q4_K_M 8.4 GB / :Q5_K_M 9.8 GB / :Q8_0 14.6 GB) - 11-knob Configurability Contract (every operator-visible value dynamic) - TSDB V0021 model_install_events hypertable + writer - docs/features/local-model-distribution.md Adapter (LoRA-only) path deferred to MODEL-DIST-002 per the sprint plan's documented contingency (epic_2_forensic.md). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(submodule + docs): bump homebrew-mdemg to v0.10.0 + cli-reference Model Distribution section Stage 4 + Stage 5 of v0.10.0 release. Submodule pointer bump: packaging/homebrew-mdemg 6077097 -> c3aa68b incorporates: - 42d7390 — goreleaser auto-bumped mdemg.rb to version "0.10.0" + new caveats text on v0.10.0 tag push - c3aa68b — manual docs round-trip: CHANGELOG v0.10.0 entry, README Optional Pull-the-local-LLM section in Quick Start (full Ollama Library doc with quant matrix, list/verify/where/remove subcommands, fork variants via MDEMG_MODEL_NAMESPACE, architecture note "Ollama is distribution-only"), Upgrading to v0.10.0 + What's New in v0.10.0 blocks, default-LLM rotation history extended, mdemg_beta_testing.md version pin v0.9.0 -> v0.10.0 docs/user/cli-reference.md (per Stage 5 user request to align refs with current codebase): - New ## Model Distribution top-level section before ## Synergy Optimization (model command group is GroupID="config" in root.go but a top-level cli-ref section is cleaner for discoverability). Documents all 5 subcommands (pull, list, verify, remove, where) with flag tables, usage examples, the full Configurability Contract (11 knobs), the architecture note (Ollama is distribution-only). - Updated Environment Variable Reference with new "Model Distribution (Sprint MODEL-DIST-001, v0.10.0)" subsection — 11 env vars + defaults table. - Updated Command Tree Summary with the new model subcommand group slotted between Configuration and Advanced. docs/user/api-reference.md unchanged: Sprint MODEL-DIST-001 added zero HTTP endpoints (CLI-only sprint; observability via TSDB V0021 row writer is server-side internal). Audit also surfaced ~25 routes of pre-existing drift between code and docs (mostly path-parameter notation: `/v1/backup/` in code vs `/v1/backup/{id}` in docs — same routes — plus 3 undocumented /api/graph/* endpoints and 2 undocumented /v1/admin/features/{restart,stop} actions). That drift is out-of-scope for v0.10.0 and belongs in its own follow-up sprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(cli): add mdemg model run wrapper (follow-up #1 to MODEL-DIST-001) One-shot or interactive REPL chat against the configured LLM endpoint (default: llama-server at port 8102 per Phase 13.5). Closes the gap operators noted between `ollama run` and the mdemg framework. Two modes: - One-shot: `mdemg model run -p "hello"` or positional arg after `--` - Interactive REPL: no prompt; reads stdin line-by-line, accumulates conversation history across turns Pure stdlib HTTP (no llmclient retries/breakers/recording). CLI invocations are intentionally NOT recorded to llm_interactions — this is an ad-hoc exploration tool, not a production code path; keeping the training-data corpus clean. Every operator-visible value is dynamic per the no-hardcoding rule: --endpoint override cfg.EffectiveLLMEndpoint --model override cfg.LLMModel (final fallback: mdemg-llm-v1) --prompt/-p one-shot prompt (omit for REPL) --system/-s system message --temperature (default 0.7) --max-tokens (default 1024) --timeout (default 60s) Live-verified end-to-end on the operator's running llama-server on port 8102 with mdemg-llm-v1: one-shot worked; system+prompt with --model override worked. 13 unit tests in model_run_test.go covering: message composition (system first, no-system skip, history preservation), config resolution (flag > cfg > final fallback), OpenAI-compat HTTP shape, error paths (HTTP error, inline error object, no choices, timeout), trailing-slash endpoint normalization, body-bounding helper. All green. Renamed local body-bounding helper to `truncateRunBody` to avoid name collision with a same-named helper in internal/cli/data.go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(api): document 19 previously-undocumented endpoints (follow-up #2) Audit of internal/api/server.go (167 routes) vs docs/user/api-reference.md surfaced 19 genuinely missing endpoints. v0.10.0 commit noted this as out-of-scope; this commit resolves the gap. Audit method: extract mux.HandleFunc registrations from server.go, extract documented "VERB /path" headings from api-reference.md, normalize both to strip path parameters and trailing prefix slashes, diff. Of the initial 24-entry code-only set, 5 are false positives (combined headers like "POST /v1/admin/features/start|stop|restart" cover the individual verbs; "GET|POST /v1/jiminy/protocol/metrics" covers both methods on one route). Added sections: Jiminy / J17 (10 endpoints, all under "## Jiminy Inner-Voice"): GET|POST /v1/jiminy/protocol/metrics # snapshot + reset GET /v1/jiminy/protocol/status # per-session J17 state POST /v1/jiminy/checkpoint # tier-transition checkpoint POST /v1/jiminy/resume-protocol # restore from checkpoint POST /v1/jiminy/extension # operator-driven tier hold POST /v1/jiminy/strict # toggle strict mode per session POST /v1/jiminy/reformulate # advisory -> imperative rewrite POST /v1/jiminy/classify # pre-Write/Edit pass/deny gate GET /v1/jiminy/latest # most recent guidance (warm store) POST /v1/jiminy/warm # eager cache warmup Memory / Graph (3 endpoints, under "## Memory Operations"): GET /v1/memory/graph/topology # node/edge counts per layer GET /v1/memory/graph/neighborhood # local 1-3 hop walk GET /v1/memory/spaces # root listing of all spaces Observability (2 endpoints, under "## Metrics & Monitoring"): GET /v1/metrics/trends # TSDB time-series query GET /v1/prometheus # Prometheus scrape endpoint Dashboard / Viz (4 endpoints, new "## Dashboard / Visualization (internal)" section before MCP Server Tools — operator-internal endpoints backing the browser dashboard at /ui/): GET /api/graph/data # force-directed graph data GET /api/graph/fields # schema field catalog GET /api/graph/health # explorer health GET /viz/topology # standalone HTML topology view Each entry has handler-signature-derived request/response shape, query parameter table, sample curl/JSON examples following the existing api-reference convention. TOC updated with new "Dashboard / Visualization (internal)" entry and renumbered tail. Out of scope (deliberate, deferred): - 28 "docs-only" entries from the audit are confirmed false positives from prefix-matching path normalization (code registers /v1/memory/nodes/ with trailing slash and routes the suffix; docs spell out the full /v1/memory/nodes/{node_id}/archive form correctly) - /v1/symbols root path is partially covered by /v1/symbols/relationships + /v1/symbols/{id}/relationships in docs; root listing endpoint documentation can land later if/when its handler grows specific shape - /v1/conversation/observations covered indirectly by the flag-for-org endpoint documentation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Roger Henley <rogerhenley345@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* docs(release): promote Unreleased -> v0.9.0 Promote the Unreleased CHANGELOG block to v0.9.0 (2026-05-06) ahead of release.yml / goreleaser tag push. New ### Breaking subsection captures two operator-visible cutovers since v0.8.5: (1) Phase 13.5 LLM runtime port 8101 -> 8102 + .env migration required; (2) Phase 13.6 MLX_* -> LLM_* env-var rename (legacy aliases retained for >= 1 release cycle). New ### Added entries: Phase 10.5 closeout (UBENCH framework promotion, commit 0389b49) and Claude Code GitHub App workflows (PRs #378, #379). All previously-Unreleased entries (Phase 14.2.3, 14.2.x, 14.1.x, 14, 13.6, 13.5, 13.2, 13.1) carried forward unchanged into the v0.9.0 block. Fresh empty Unreleased section seeded above. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(submodule): bump homebrew-mdemg to v0.9.0 formula + docs Bumps packaging/homebrew-mdemg pointer a235977 -> 6077097, which incorporates: - f9358cd Brew formula update for mdemg version v0.8.5 (goreleaser, prior) - b4a0d2c Brew formula update for mdemg version v0.9.0 (goreleaser, this release) - 6077097 docs: v0.9.0 -- CHANGELOG, README What's New, beta-testing version pin Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(api): /healthz returns build-time version, not stale literal "0.6.0" `config.FromEnv()` defaulted MdemgVersion/MdemgCommit to literal "0.6.0"/ "unknown" when MDEMG_VERSION/MDEMG_COMMIT envs were unset. Both /healthz and /readyz serialize cfg.MdemgVersion, so they reported "0.6.0" forever regardless of the actual binary's ldflags-injected cli.Version. Fix: defaults to "" in config; cli/config_loader.go injects cli.Version / cli.Commit (the build-time vars set by goreleaser ldflags) when the env override is unset. Operators can still pin via MDEMG_VERSION env. Live-verified: dev build (no ldflags) now reports {"version":"dev"} on /healthz instead of the lying "0.6.0". Production builds via goreleaser will report the real semver tag. TestHandleHealthz unaffected (sets cfg.MdemgVersion directly). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(service): replace decommissioned mlx-server LaunchAgent with llama-server Phase 13.5 cutover (2026-05-03) replaced mlx_lm.server (port 8101) with llama.cpp llama-server (port 8102) as the production LLM runtime, but the embedded launchd plist template + service install code paths were never updated. Any operator running 'mdemg service install' from a fresh checkout got the decommissioned mlx_lm.server agent — mdemg's startup preflight then failed because LLM_ENDPOINT=http://127.0.0.1:8102/v1 wasn't reachable. Changes: - New packaging/launchd/com.mdemg.llama-server.plist with the Phase 13.5 production flags (--ctx-size 32768 --parallel 4 --cont-batching --metrics --jinja). Byte-identical mirror at internal/cli/launchd_templates/ for the embed.FS (CI sync-check enforced). - Removed packaging/launchd/com.mdemg.mlx-server.plist + embed.FS mirror. mlx_lm.server is decommissioned and known-broken on M5 + macOS 26.3.x; keeping the template would just risk re-deploying it. - internal/cli/service_darwin.go: launchdServices entry replaced with com.mdemg.llama-server. resolveMLXLMBin renamed to resolveLlamaServerBin with primary env MDEMG_LLAMA_SERVER_BIN, deprecation alias for MDEMG_MLX_LM_BIN (slog.Warn at boot, retained ≥1 release cycle per the Phase 13.6 deprecation pattern), PATH lookup of `llama-server`. resolveMDEMGModelPath default updated to the canonical Phase 13.5 GGUF filepath (.local-models/mdemg-llm-v1-gguf/mdemg-llm-v1.Q5_K_M.gguf) since llama-server takes a `.gguf` filepath, not an HF-format directory like mlx_lm.server. Install error message updated for the new env var name + remediation steps (`brew install llama.cpp`). - migrateLegacyMLXServerPlist() added: if a pre-cutover com.mdemg.mlx-server plist is bootstrapped on the operator's machine, Install() boots it out and renames the file to .disabled-phase13_5 (matches the manual operator convention from Phase 13.5 rollout). Best-effort: failures don't block the install. - internal/cli/service_darwin_test.go fully rewritten: * TestLaunchdServicesIncludesLlamaServer asserts the new entry exists and is Optional=false (production matches Hotfix 11.6.3.1; the old test asserted Optional=true, a latent lie since 2026-05-02 that Linux CI never caught because of //go:build darwin) * TestLlamaServerPlistEmbedded replaces TestMLXServerPlistEmbedded; additionally asserts mlx-server.plist is NOT in embed.FS * Two resolver tests for the primary env var * New TestResolveLlamaServerBinFallsBackToMLXAlias proves the Phase 13.6 deprecation alias path works * resolveMDEMGModelPath tests updated for the new GGUF default - internal/cli/watchdog.go: help text references com.mdemg.llama-server (instead of com.mdemg.mlx-server) and llama-server (instead of mlx_lm.server). Notes that mdemg_mlx_health_state metric name is retained for dashboard compatibility. Tested: - Tier 1 unit: 7/7 new tests pass; full ./internal/cli/... suite green (61s wall-clock). - Tier 2 integration: golangci-lint run ./internal/cli/ — 0 issues. CI plist sync-check (diff -q packaging/launchd/*.plist internal/cli/launchd_templates/) — 6/6 byte-identical. - Tier 3 live e2e: deferred. Running mdemg service install on the operator's currently-serving machine would briefly bootout the running llama-server LaunchAgent (PID 20527 actively serving production inference). The hand-installed llama-server plist on the operator's machine is byte-equivalent (modulo template substitutions) to what this commit will install via `mdemg service install` on a fresh operator setup, so the operator can verify on next planned redeploy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(sprint): MODEL-DIST-001 sprint plan + quant manifest skeleton Epic 0 of Sprint MODEL-DIST-001 — Local LoRA Distribution via Ollama Library. Sprint plan in 12-section v1.0 format. Supersedes parts of the speculative spec at docs/research/mdemg_sprint_ideas/MDEMG_FT_LORA_PACKAGING_SPEC.md (HF Hub vs Ollama Library; adapter-only vs both-fused-and-adapter; Apple Silicon scope vs cross-platform). Configurability Contract — every operator-visible value is dynamic per the framework's no-hardcoding rule. 12 env vars + flag overrides + sensible defaults. ModelFetcher interface decouples CLI from Ollama-specific knowledge; v1 ships OllamaFetcher only, future backends (HF / S3 / GitHub Release / file) plug in via factory dispatch on MDEMG_MODEL_BACKEND without touching the CLI surface. Forensic from Epic 0: - adapters/tier1/adapters.safetensors verified present (514 MB MLX, Phase 5 SFT Iter 2400 best output) - mdemg-llm-v1.Q5_K_M.gguf SHA256 captured (9.8 GB; 144ad7231...) - f16 GGUF intermediate NOT on disk; Epic 1 will regenerate via convert_hf_to_gguf.py from the MLX merged model (~5 min) - qwen3:14b model-layer digest captured from Ollama registry; manifest digest to be computed at Epic 3 for Modelfile FROM @sha256: pinning quant_manifest.json skeleton with Q5_K_M SHA pre-populated; Q4_K_M / Q8_0 / adapter SHAs filled in during Epics 1+2. Estimated effort 5–7 dev-days. OpenAI spend $0. Risk medium (Ollama publish one-way; MLX→PEFT→GGUF LoRA conversion is the riskiest engineering item with documented contingency to defer to MODEL-DIST-002 if blocked). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 1 — built Q4_K_M + Q8_0 fused GGUFs Pipeline (CLAUDE.md Phase 13.5 documented path): 1. mlx_lm.fuse --dequantize: mlx-community/Qwen3-14B-4bit + adapters/tier1/ -> 29.6 GB bf16 HF safetensors at .local-models/qwen3-14b-mdemg-v1-bf16/ 2. convert_hf_to_gguf.py --outtype f16 -> 30 GB f16 GGUF (required neural/.venv interpreter with torch + transformers + gguf installed; /opt/homebrew/bin/convert_hf_to_gguf.py uses system python which lacks these — installed gguf/sentencepiece/protobuf into neural/.venv) 3. llama-quantize Q4_K_M -> 9.0 GB (4.87 BPW; 40s wall on M5) 4. llama-quantize Q8_0 -> 16 GB (8.50 BPW; 11s wall on M5) 5. Live smoke per new quant via llama-server on port 18102 — both serve /v1/models cleanly with embedded chat_template SHAs captured in quant_manifest.json: Q4_K_M: 401161710c22f0ae...411d42ea Q5_K_M: 144ad723101d688f...d5f5d54 (matches Epic 0 baseline) Q8_0: fc14dcb40af1bb58...8db6089 f16: 436cd6f41a684805...3217bd (intermediate, retained for Epic 2) Resource matrix updated with empirical sizes (Q4_K_M is 9.0 GB vs estimated 6.5 GB; min RAM revised 8 -> 12 GB to cover ~3 GB working memory above weights). 14B params x 4.87 BPW ≈ 8.5 GB matches the formula. GGUF binary artifacts stay local — .local-models/ gitignored per .gitignore:70. Sprint deliverable in git is just the manifest update. Production llama-server (PID 20527 on port 8102) undisturbed throughout Epic 1; live smokes used port 18102. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): Epic 2 — defer adapter to MODEL-DIST-002 Adapter (LoRA-only Modelfile via ADAPTER directive) deferred per the sprint plan's documented contingency clause. Fused-only path (Epics 1, 3, 4, 5) continues — that's the primary operator value. Forensic findings (epic_2_forensic.md): - MLX adapter is well-formed: 560 tensors, 40 layers x 7 target_modules, rank 32, alpha 64, scale 20.0. - convert_lora_to_gguf.py is NOT in brew install llama.cpp; would need manual fetch from llama.cpp source. - MLX -> PEFT requires tensor transposition: MLX lora_a is (in, rank); PEFT expects (rank, in). Same for lora_b. - Estimated 80-95 min to complete vs ~30 min budget remaining for Epic 2. - Hit the contingency criterion: "MLX -> PEFT conversion blocked by tooling gaps." Decision: defer adapter scope to MODEL-DIST-002 (new follow-up sprint, to be planned separately). Fused-only ships this sprint. Knock-on changes (in-flight to subsequent epics): - Epic 3: drop Modelfile.adapter; publish only 3 fused quants. - Epic 4 CLI: --adapter flag accepted at parse-time but errors with "lands in MODEL-DIST-002"; machinery preserved for forward-compat. - Epic 6 e2e: drop adapter-pull step. - Epic 7 feature doc: adapter section notes "coming in MODEL-DIST-002". Artifacts preserved on disk for MODEL-DIST-002 pickup: - adapters/tier1/adapters.safetensors (MLX, 514 MB) - .local-models/mdemg-llm-v1-gguf/mdemg-llm-v1.f16.gguf (30 GB, retained as base for llama-server --lora verification later) quant_manifest.json adapter block updated with status=deferred + reason. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 3 — 3 Modelfiles + local ollama create (push pending) Authored 3 Ollama Modelfiles in packaging/ollama/: Modelfile.Q4_K_M — 9.0 GB, 12 GB min RAM, 16 GB recommended Modelfile.Q5_K_M — 11 GB, 14 GB min RAM, 24 GB recommended (production canonical) Modelfile.Q8_0 — 16 GB, 20 GB min RAM, 32 GB recommended Common shape: FROM ./../../.local-models/mdemg-llm-v1-gguf/...gguf relative path (operator-machine local); num_ctx 32768, num_predict 4096, stop tokens <|im_end|>/<|im_start|>; Apache-2.0 LICENSE; SYSTEM positioning block. No TEMPLATE directive — chat template baked into GGUF metadata (Qwen3 chat_template.jinja preserved through mlx_lm.fuse --dequantize → convert_hf → llama-quantize pipeline). packaging/ollama/README.md documents the publish workflow including the fork-customization path (operators publishing under a different namespace follow MDEMG_MODEL_NAMESPACE per the Configurability Contract). Local ollama create completed for all 3: reh3376/mdemg-llm-v1:Q4_K_M ID 5c3a7252c295 reh3376/mdemg-llm-v1:Q5_K_M ID 08c13b480864 reh3376/mdemg-llm-v1:Q8_0 ID 6b1006facd36 Layers de-duplicated: config + params + system layers (3 layers) are identical across all 3 quants; only the model blob (GGUF) differs. ** ollama push deferred ** — one-way action gated on operator confirmation per Sprint Plan §10 Risk #8. Operator must claim reh3376 namespace on ollama.com and generate API token before push proceeds. Local-create proves the Modelfiles are well-formed; push is a separate decision. Once pushed, manifest digests captured into quant_manifest.json (ollama_manifest_digest field per quant) for mdemg model verify. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 4 — `mdemg model` CLI + pluggable Fetcher interface Sprint MODEL-DIST-001 Epic 4 — the bulk of the operator-facing surface. New CLI subcommand group: mdemg model pull # fetch + symlink + SHA verify mdemg model list # show pulled models mdemg model verify # re-check SHAs vs quant manifest mdemg model remove # destructive (requires --yes) mdemg model where # print resolved path for shell scripting Pluggable backend (internal/cli/model_fetcher.go): type Fetcher interface { Name, Fetch, Verify, Remove } NewFetcher dispatches on cfg.ModelBackend (env: MDEMG_MODEL_BACKEND) v1 ships OllamaFetcher only; future backends (hf, s3, github-release, file) plug in via factory branch — CLI surface unchanged. OllamaFetcher (internal/cli/model_fetcher_ollama.go): Encapsulates ALL Ollama-specific concepts: `ollama pull` invocation, manifest path under <OLLAMA_MODELS>/manifests/<OLLAMA_HOST>/<ns>/<n>/<tag>, mediaType=application/vnd.ollama.image.model layer filtering, blob path under <OLLAMA_MODELS>/blobs/sha256-<digest>, symlink under <MDEMG_MODEL_DIR>, idempotent. Configurability Contract (no hardcoding; memory: feedback_no_hardcoded_values.md): 12 env vars + flag overrides, each with v1-production-tuned defaults so `mdemg model pull` with no flags Just Works. See sprint plan §3. Live-verified all 3 resolution paths: `--quant Q5_K_M` → namespace=reh3376 `--namespace acme --name custom-model` → namespace=acme name=custom `MDEMG_MODEL_NAMESPACE=acme env` → env overrides applied Added to internal/config/config.go: ModelBackend, ModelNamespace, ModelName, ModelQuants, ModelRamTiers, ModelQuant, AdapterBase, ModelDir, OllamaModelsRoot, OllamaRegistryHost, ModelManifestPath. Embedded quant manifest (internal/cli/quant_manifest.json via embed.FS): Runtime source-of-truth for SHA verification. Operator override via MDEMG_MODEL_MANIFEST_PATH for air-gapped deployments. Mirrors docs/development/model-dist-001/quant_manifest.json. RAM-tier auto-pick: Default JSON `{"<16":"Q4_K_M","<24":"Q5_K_M","default":"Q8_0"}` maps host RAM (sysctl on darwin, /proc/meminfo on linux) to quant. Operator override via MDEMG_MODEL_RAM_TIERS. Adapter path (--adapter flag) returns ErrAdapterDeferred per Epic 2's contingency exit — adapter publication lands in MODEL-DIST-002. Flag machinery preserved for forward compatibility. Tests (22, all green) in internal/cli/model_test.go: - Backend factory dispatch (5 cases incl. case-insensitive, default, error) - Quant allowlist parsing (5 cases incl. whitespace + empty entries) - RAM-tier JSON parsing (default + operator override + malformed) - PickQuantForRAM (7 boundary cases) - ResolveQuant across paths (auto, explicit, rejection, operator-custom) - QuantManifest load (embedded + file override + missing-file error) - Ollama tag composition (fused + adapter forms) - Manifest path composition under custom OLLAMA_MODELS/OLLAMA_HOST - Blob path digest prefix handling - Adapter deferred error - Manifest JSON parser (mediaType filtering + malformed + no-model-layer) Grep audit (verification checklist): grep on internal/cli/model*.go for hardcoded values found only in help text Long/example strings documenting defaults to operators — not in logic. Behavior values all flow through cfg.Model* fields. Build + lint clean. Full cli test suite (61s wall) green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 5 — V0021 model_install_events hypertable + writer Sprint MODEL-DIST-001 Epic 5 — observability for `mdemg model` operations. Grafana panels deferred to Sprint B (Grafana audit). New migration: internal/tsdb/migrations/021_model_install_events.sql Hypertable on recorded_at, 7-day chunks, 3 indexes (quant-time, failed-events partial, backend-event-time). Columns: event_id CUIDv2 PK + recorded_at, event_type (pull/verify/remove), backend_name, namespace, model_name, quant, adapter bool, success bool, latency_ms, sha256, size_bytes, err_message (1 KB cap). New writer: internal/tsdb/model_install_writer.go Synchronous single-row INSERT (not buffered + CopyFrom — CLI is one-shot, writes are infrequent vs the V0017/V0018/V0019/V0020 retrieval- path writers that fire per-request). Nil-pool no-op for degraded mode. errMessageMaxLen=1024 truncation at write time. New modelInstallPool interface (Exec-shaped) avoids touching the existing CopyFrom-shaped poolIface used by buffered writers. Wiring: internal/cli/model.go gets recordModelEvent(parent, cfg, row) helper: - Returns immediately if !cfg.TSDBEnabled || cfg.TSDBHost=="" - 2s timeout on connect (TSDB unreachable doesn't block CLI exit) - Logs warning + degrades gracefully on any TSDB error Called from runModelPull (success + failure paths), runModelVerify (single sweep row), runModelRemove (success + failure paths). Schema version bump: internal/config/config.go: TSDB_REQUIRED_SCHEMA_VERSION default 20→21. CI validator at .github/workflows/ci.yml:60-65 counts SQL files in internal/tsdb/migrations/ and asserts equality; now 21 files = 21 in config = passes. Build + lint clean. Existing tsdb / cli test suites green; no new tests added for the writer itself (single INSERT mirrors V0017/V0018/V0019 patterns already covered; integration is operational verification at Epic 6 once tsdb is up in the dev stack). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): Epic 7 — local-model-distribution feature doc Sprint MODEL-DIST-001 Epic 7 — operator-facing feature documentation following the standard Why / Choices / How / How-to-use shape (memory: feedback_per_feature_docs_required.md). Contents: - Why: gap between brew install and a working local LLM after Phase 13.5 - Choices: backend matrix (Ollama vs HF vs GitHub vs S3 vs file://), artifact form (fused vs adapter), Apple Silicon scope, "Ollama runtime rejected (broken on M5+macOS 26.3.x), Ollama distribution only" - How it works: ASCII flow diagram covering CLI dispatch -> Fetcher interface -> OllamaFetcher (preflight, ollama pull, manifest discovery, blob resolve, symlink, SHA verify) -> V0021 observability row - How to use: * Quick start (3 commands: brew install ollama, mdemg model pull, curl /v1/models) * Explicit quant selection * Managing pulled models (list / verify / where / remove) * Forks + enterprise (MDEMG_MODEL_NAMESPACE override) * Air-gapped (MDEMG_MODEL_MANIFEST_PATH override) * Resource matrix per quant (disk, min RAM, recommended RAM, BPW) * Full Configurability Contract table (11 env vars + flags + defaults) * V0021 observability schema - Troubleshooting: ollama missing, SHA mismatch, quant allowlist rejection, RAM auto-detection failure, out-of-disk, symlink permission - Forward-looking: MODEL-DIST-002 adapter, Sprint B Grafana panels, future backends, cross-platform - References: all source-of-truth files cross-linked Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): Epic 8 — Documentation Update (main repo) Sprint MODEL-DIST-001 Epic 8 — final epic, never cut (memory: feedback_sequential_epics.md). This commit lands the main-repo doc updates. The packaging/homebrew-mdemg/ submodule docs (README, CHANGELOG, formula caveats text) update at v0.10.0 release-tag time per the v0.9.0 release flow precedent — that's when goreleaser auto-regenerates mdemg.rb from .goreleaser.yaml's caveats template, and the tap-side README/CHANGELOG get edited in lockstep. Changes: - CHANGELOG.md: comprehensive Unreleased entry documenting Epics 0-5 + 7 landed in this sprint. Epic 3 ollama push and Epic 6 Tier 3 e2e marked as gated on operator confirmation. Adapter path explicitly deferred to MODEL-DIST-002 with epic_2_forensic.md cross-reference. Captures the Configurability Contract enumeration, the 3 quant SHAs, the Fetcher interface design, the V0021 hypertable, and the explicit out-of-scope list. - CLAUDE.md: new "Model Distribution (Sprint MODEL-DIST-001)" subsection in Architecture Notes, slotted ABOVE the existing Compose embed entry for visibility. Captures the pluggable-backend design, the Ollama-as- distribution-only constraint, the on-disk symlink + manifest discovery flow, the 11-knob Configurability Contract surface, the no-hardcoding enforcement, the TSDB V0021 hookup, and the Apple Silicon v1 scope. - README.md: new "Step 2b (optional): Pull the local LLM" section between Step 2 (Initialize/Start) and Open the Dashboard. 3-command quick start (brew install ollama -> mdemg model pull -> set MDEMG_MODEL_PATH). Cross-references the feature doc for the full Configurability Contract. - .goreleaser.yaml: caveats template updated to include `mdemg model pull` instructions. Goreleaser regenerates the homebrew formula's caveats block from this on the next v* tag push, so v0.10.0 will ship the new text to brew users automatically. Deferred to v0.10.0 release-tag time (handled per v0.9.0 precedent): - packaging/homebrew-mdemg/README.md update - packaging/homebrew-mdemg/CHANGELOG.md update - packaging/homebrew-mdemg/mdemg.rb regeneration (automatic via goreleaser from the .goreleaser.yaml change in this commit) - Submodule pointer bump in main repo Deferred to Epic 6 close (after operator does ollama push): - post.md sprint-close document - Capture of remote Ollama manifest digests into quant_manifest.json Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 3 closeout — Ollama Library push complete All 3 fused quants now live on Ollama Library: https://ollama.com/reh3376/mdemg-llm-v1:Q4_K_M https://ollama.com/reh3376/mdemg-llm-v1:Q5_K_M https://ollama.com/reh3376/mdemg-llm-v1:Q8_0 End-to-end integrity verified: remote model-layer digests captured via GET https://registry.ollama.ai/v2/reh3376/mdemg-llm-v1/manifests/<quant> match the local Epic 1 SHAs exactly: Q4_K_M 401161710c22f0ae...411d42ea (matches Epic 1) Q5_K_M 144ad723101d688f...d5f5d54 (matches Epic 1) Q8_0 fc14dcb40af1bb58...8db6089 (matches Epic 1) Captured into quant_manifest.json (both docs canonical + internal/cli embed.FS mirror, byte-synced): - ollama_manifest_digest per quant (computed from the manifest body): Q4_K_M sha256:a210cccb12311773fd70bfa81f221ca0f7940a315bef87b84608caf894533b1b Q5_K_M sha256:ae6e54fe1ee0b487ae41260687ed14c46c30d1ffb0fece936282418b5bcb78e1 Q8_0 sha256:93df4d64bfa751506f7afba8bf08b891ea828575b838adec17b9399ad85be718 - Corrected size_bytes (Epic 1 used approximate values; replaced with registry-reported exact bytes for each tag): Q4_K_M 9.0 GB -> 8.4 GB (9001753408 B; was 9658404096) Q5_K_M 11 GB -> 9.8 GB (10514569568 B; was 11811160064) Q8_0 16 GB -> 14.6 GB (15698534208 B; was 17179869184) - Status flipped from "local-create done; push pending" to "published". Embedded runtime manifest (internal/cli/quant_manifest.json) re-built into the binary via embed.FS. TestLoadQuantManifest_EmbeddedFallback green with new values. Epic 3 of Sprint MODEL-DIST-001 now COMPLETE. Epic 6 (Tier 3 live e2e — `mdemg model pull` against the published tags + llama-server load on port 18102 + sanity inference) is now unblocked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): sprint close — post.md Sprint MODEL-DIST-001 close-out per memory rule (feedback_sprint_plan_format.md §11 — sprint plans live in docs/development/<sprint-line>/ with the standard post.md companion). Sections (CLAUDE.md sprint-plan section guidance): - Outcome: 3 quants live on Ollama Library, mdemg model pull is the canonical install path - Process: how the plan held under reality (operator-surfaced no- hardcoding rule revised the plan in-place to add the Configurability Contract before code was written) - Findings: 5 smooth parts + 5 friction items, both honest: * convert_hf_to_gguf.py python deps gap (silent ModuleNotFoundError) * mlx_lm.fuse adapter-path requirement * convert_lora_to_gguf.py missing from brew install llama.cpp (proximate Epic 2 deferral trigger) * mdemg tsdb migrate CWD-aware .env loader quirk * Epic 1 size estimates off vs registry-reported exact bytes - Current state: per-layer state matrix - Testing & benchmarking: all 3 tiers documented (Tier 3 e2e captured V0021 rows for both pull + verify event_types — live-verified) - Risks & opportunities (forward): MODEL-DIST-002 adapter scope, Sprint B Grafana, cross-platform, HFFetcher slot, CWD-aware .env loader QoL - Sprint commits: 9 commits on dev01, mapped to their epics Closes Sprint MODEL-DIST-001 functionally. Operational sprint close (v0.10.0 release tag + tap-repo doc updates) is a separate motion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(release): promote Unreleased -> v0.10.0 Promote the Sprint MODEL-DIST-001 entry from Unreleased to v0.10.0 (2026-05-11) ahead of release.yml / goreleaser tag push. Fresh empty Unreleased section seeded above. v0.10.0 ships: - mdemg model pull|list|verify|remove|where — one-command path from brew install mdemg to a working local LLM - Pluggable ModelFetcher interface (Ollama in v1, slots for HF/S3/GHR/file) - 3 fused GGUF quants live on Ollama Library at reh3376/mdemg-llm-v1 (:Q4_K_M 8.4 GB / :Q5_K_M 9.8 GB / :Q8_0 14.6 GB) - 11-knob Configurability Contract (every operator-visible value dynamic) - TSDB V0021 model_install_events hypertable + writer - docs/features/local-model-distribution.md Adapter (LoRA-only) path deferred to MODEL-DIST-002 per the sprint plan's documented contingency (epic_2_forensic.md). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(submodule + docs): bump homebrew-mdemg to v0.10.0 + cli-reference Model Distribution section Stage 4 + Stage 5 of v0.10.0 release. Submodule pointer bump: packaging/homebrew-mdemg 6077097 -> c3aa68b incorporates: - 42d7390 — goreleaser auto-bumped mdemg.rb to version "0.10.0" + new caveats text on v0.10.0 tag push - c3aa68b — manual docs round-trip: CHANGELOG v0.10.0 entry, README Optional Pull-the-local-LLM section in Quick Start (full Ollama Library doc with quant matrix, list/verify/where/remove subcommands, fork variants via MDEMG_MODEL_NAMESPACE, architecture note "Ollama is distribution-only"), Upgrading to v0.10.0 + What's New in v0.10.0 blocks, default-LLM rotation history extended, mdemg_beta_testing.md version pin v0.9.0 -> v0.10.0 docs/user/cli-reference.md (per Stage 5 user request to align refs with current codebase): - New ## Model Distribution top-level section before ## Synergy Optimization (model command group is GroupID="config" in root.go but a top-level cli-ref section is cleaner for discoverability). Documents all 5 subcommands (pull, list, verify, remove, where) with flag tables, usage examples, the full Configurability Contract (11 knobs), the architecture note (Ollama is distribution-only). - Updated Environment Variable Reference with new "Model Distribution (Sprint MODEL-DIST-001, v0.10.0)" subsection — 11 env vars + defaults table. - Updated Command Tree Summary with the new model subcommand group slotted between Configuration and Advanced. docs/user/api-reference.md unchanged: Sprint MODEL-DIST-001 added zero HTTP endpoints (CLI-only sprint; observability via TSDB V0021 row writer is server-side internal). Audit also surfaced ~25 routes of pre-existing drift between code and docs (mostly path-parameter notation: `/v1/backup/` in code vs `/v1/backup/{id}` in docs — same routes — plus 3 undocumented /api/graph/* endpoints and 2 undocumented /v1/admin/features/{restart,stop} actions). That drift is out-of-scope for v0.10.0 and belongs in its own follow-up sprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(cli): add mdemg model run wrapper (follow-up #1 to MODEL-DIST-001) One-shot or interactive REPL chat against the configured LLM endpoint (default: llama-server at port 8102 per Phase 13.5). Closes the gap operators noted between `ollama run` and the mdemg framework. Two modes: - One-shot: `mdemg model run -p "hello"` or positional arg after `--` - Interactive REPL: no prompt; reads stdin line-by-line, accumulates conversation history across turns Pure stdlib HTTP (no llmclient retries/breakers/recording). CLI invocations are intentionally NOT recorded to llm_interactions — this is an ad-hoc exploration tool, not a production code path; keeping the training-data corpus clean. Every operator-visible value is dynamic per the no-hardcoding rule: --endpoint override cfg.EffectiveLLMEndpoint --model override cfg.LLMModel (final fallback: mdemg-llm-v1) --prompt/-p one-shot prompt (omit for REPL) --system/-s system message --temperature (default 0.7) --max-tokens (default 1024) --timeout (default 60s) Live-verified end-to-end on the operator's running llama-server on port 8102 with mdemg-llm-v1: one-shot worked; system+prompt with --model override worked. 13 unit tests in model_run_test.go covering: message composition (system first, no-system skip, history preservation), config resolution (flag > cfg > final fallback), OpenAI-compat HTTP shape, error paths (HTTP error, inline error object, no choices, timeout), trailing-slash endpoint normalization, body-bounding helper. All green. Renamed local body-bounding helper to `truncateRunBody` to avoid name collision with a same-named helper in internal/cli/data.go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(api): document 19 previously-undocumented endpoints (follow-up #2) Audit of internal/api/server.go (167 routes) vs docs/user/api-reference.md surfaced 19 genuinely missing endpoints. v0.10.0 commit noted this as out-of-scope; this commit resolves the gap. Audit method: extract mux.HandleFunc registrations from server.go, extract documented "VERB /path" headings from api-reference.md, normalize both to strip path parameters and trailing prefix slashes, diff. Of the initial 24-entry code-only set, 5 are false positives (combined headers like "POST /v1/admin/features/start|stop|restart" cover the individual verbs; "GET|POST /v1/jiminy/protocol/metrics" covers both methods on one route). Added sections: Jiminy / J17 (10 endpoints, all under "## Jiminy Inner-Voice"): GET|POST /v1/jiminy/protocol/metrics # snapshot + reset GET /v1/jiminy/protocol/status # per-session J17 state POST /v1/jiminy/checkpoint # tier-transition checkpoint POST /v1/jiminy/resume-protocol # restore from checkpoint POST /v1/jiminy/extension # operator-driven tier hold POST /v1/jiminy/strict # toggle strict mode per session POST /v1/jiminy/reformulate # advisory -> imperative rewrite POST /v1/jiminy/classify # pre-Write/Edit pass/deny gate GET /v1/jiminy/latest # most recent guidance (warm store) POST /v1/jiminy/warm # eager cache warmup Memory / Graph (3 endpoints, under "## Memory Operations"): GET /v1/memory/graph/topology # node/edge counts per layer GET /v1/memory/graph/neighborhood # local 1-3 hop walk GET /v1/memory/spaces # root listing of all spaces Observability (2 endpoints, under "## Metrics & Monitoring"): GET /v1/metrics/trends # TSDB time-series query GET /v1/prometheus # Prometheus scrape endpoint Dashboard / Viz (4 endpoints, new "## Dashboard / Visualization (internal)" section before MCP Server Tools — operator-internal endpoints backing the browser dashboard at /ui/): GET /api/graph/data # force-directed graph data GET /api/graph/fields # schema field catalog GET /api/graph/health # explorer health GET /viz/topology # standalone HTML topology view Each entry has handler-signature-derived request/response shape, query parameter table, sample curl/JSON examples following the existing api-reference convention. TOC updated with new "Dashboard / Visualization (internal)" entry and renumbered tail. Out of scope (deliberate, deferred): - 28 "docs-only" entries from the audit are confirmed false positives from prefix-matching path normalization (code registers /v1/memory/nodes/ with trailing slash and routes the suffix; docs spell out the full /v1/memory/nodes/{node_id}/archive form correctly) - /v1/symbols root path is partially covered by /v1/symbols/relationships + /v1/symbols/{id}/relationships in docs; root listing endpoint documentation can land later if/when its handler grows specific shape - /v1/conversation/observations covered indirectly by the flag-for-org endpoint documentation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(grafana-audit): Epic 0 — sprint plan + audit harness Sprint GRAFANA-AUDIT-001 Epic 0. Builds the per-panel audit harness: walks every panel in deploy/docker/grafana/dashboards/*.json, extracts rawSql/sql targets, substitutes Grafana macros (\$__timeFilter, \$__timeFrom/To, \$__interval, \$__unixEpoch*) + template variables (\$space_id, \$instance + multi-value variants like \${space_id:raw}), executes via docker exec mdemg-timescaledb-1 psql, classifies each panel target as PASS / EMPTY / FAIL / SKIP. Tier 1 unit tests (17 tests, all green): - Template-variable substitution: time_filter / from-to / unix epoch / interval / interval_ms / space_id (3 syntaxes) / instance (3 syntaxes) / multi-macro composite query - Table extraction (FROM/JOIN with alias, case-insensitive, no-table) - Panel walking (flat, nested rows, targets-with-sql vs no-sql) Smoke test against mdemg-overview.json IMMEDIATELY validated the operator's "diminished observability" report — 5 of 13 panels FAIL, 1 EMPTY, 7 PASS on the front-page dashboard: FAIL Request Rate FAIL Error Rate FAIL Circuit Breakers FAIL Requests by Status FAIL Rate Limit Rejections EMPTY Request Latency Distribution (t0; t1/t2 PASS) The original 11-panel sample missed these because it sampled different panels. Lesson: trust the rigorous audit, not the sample. Sprint proceeds to Epic 1 (full audit across all 146 panels) immediately. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(grafana-audit): Epic 1 + 2 — full audit + findings Sprint GRAFANA-AUDIT-001 Epics 1 + 2. Per-panel rigorous audit of all 165 target executions across 146 panels in 8 dashboards. Headline: PASS 125 (76%) — executes, returns rows in 24h window EMPTY 19 (12%) — executes, 0 rows FAIL 3 (2%) — SQL error SKIP 18 (11%) — non-SQL panel types Harness fix mid-Epic-1: \$__interval substitution was wrapping the value in quotes, but Grafana convention has panel SQL provide its own outer quotes — producing doubled quotes and 18 false-positive FAILs. Fixed: substitute bare value. Verified by re-run: 20→3 FAILs. Real failures (Epic 2 findings): (a) 3 SQL bugs on mdemg-llm-routing.json — all three panels hardcoded `mdemg-dev` (unquoted) in WHERE clauses instead of '\$space_id' template variable. PG parses `mdemg-dev` as subtraction. (b) 5 schema-drift EMPTYs — panel filter expects metric_type or labels shape that doesn't match server emission: - mdemg_j17_events_total: panel 'counter', server 'gauge' - mdemg_rsic_action_total: panel status='success', server status='completed' - 2 more suspected pending full-SQL inspection. (c) 2 missing-server-side metrics — mdemg_rate_limit_rejected_total and mdemg_http_request_duration_seconds_p50 not emitted. Will be documented; server emission is follow-up. (d) ~11 sparse-data EMPTYs — panel SQL correct, no rows in 24h window. Widening time-range in Epic 4. Projected post-Epic-3/4: 133 PASS, ≤11 EMPTY, 0 FAIL. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(grafana): Epic 3 — 5 panels recovered (3 FAIL + 2 schema-drift) Sprint GRAFANA-AUDIT-001 Epic 3. Minimum-change JSON edits to fix category (a) SQL bugs and category (b) schema-drift EMPTYs identified in Epic 1/2. mdemg-llm-routing.json (3 panels, all category-a SQL bugs): - LLM call distribution by model_name (24h) - LLM latency p50 / p95 / p99 by task × model - LLM error rate % by task_name (selected range) Bug: WHERE clause was `(\$space_id = '' OR space_id = '\$space_id')` — the first \$space_id was unquoted, so PG parsed `mdemg-dev = ''` as `column "mdemg-dev"` which doesn't exist. Also breached the no-hardcoding rule (memory: feedback_no_hardcoded_values.md). Fix: wrap the first variable reference in quotes → `('\$space_id' = '' OR space_id = '\$space_id')` — a proper string-literal comparison that also serves as the All-spaces guard the panel author intended. Verdict: 3 FAIL -> 3 PASS. mdemg-llm-routing is now 4/4 PASS. mdemg-j17.json :: Total Events (1 panel, category-b drift): Panel filtered `metric_type = 'counter'` (Prometheus naming convention because metric is `mdemg_j17_events_total`). Server actually emits `metric_type = 'gauge'`. 6,393 rows in 7d; 0 panel matches. Fix: align panel filter to `'gauge'`. Verdict: EMPTY -> PASS. mdemg-rsic.json :: Action Success Rate t0 (1 panel target, category-b drift): Panel filtered `labels->>'status' = 'success'`. Server actually emits `'completed'` (181 rows in 24h; 0 panel matches). Fix: align panel filter to `'completed'`. The t1 'failed' target retained unchanged — its EMPTY result is now accurate observation (server emits no `'failed'` actions; 0 = legitimate zero). Verdict: 1/2 EMPTY -> PASS, 1/2 EMPTY accurate-zero. Audit verdict counts: Before: 125 PASS, 19 EMPTY, 3 FAIL, 18 SKIP After: 130 PASS, 17 EMPTY, 0 FAIL, 18 SKIP Remaining 17 EMPTYs (Epic 4 disposition): - 5 category-c emission regression — 4 rsic metrics stopped at 2026-05-07/08 (server-side investigation queued as follow-up) - 2 category-c never-emitted — Rate Limit Rejections, p50 latency - 8 category-d sparse-data on ft-training — widen time-range - 1 mdemg-jiminy :: Effectiveness Trends — CTE pending inspection - 1 mdemg-rsic :: Action Success Rate t1 (accurate-zero) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(grafana-audit): Epic 4 + 7 — feature doc + sprint close Sprint GRAFANA-AUDIT-001 closeout (Epics 4 + 5 + 6 + 7 combined as a single doc-only commit; Epic 5 deferred and Epic 6 deferred-to-operator as documented in post.md). New: docs/features/observability-dashboards.md (286 lines) — full operator-facing inventory of the 8 dashboards with: - Per-dashboard purpose + panel count + primary tables - Audit verdict table (130/17/0/18 post-Epic-3) - Epic 3 fix log: 3 SQL bugs + 2 schema-drift filters - Known gaps in 3 buckets: (c) emission regression (4 May-7-8 metrics, current codebase has zero refs — server removed emission), (c) never-emitted (mdemg_rate_limit_rejected_total + mdemg_http_request_duration_seconds_p50), (d) sparse/zero data on this dev TSDB (ft-training tables) - Refresh expectations per table - Operator playbook for re-running scripts/grafana_panel_audit.py - Forward-looking: CI integration, coverage expansion, server-side emission restore New: docs/development/grafana-audit-001/post.md — sprint close per memory rule, covers process / smooth-parts / friction / sprint-plan vs reality / current state / risks-opportunities / commits. Epic deferrals (documented in post.md): - Epic 5 (coverage expansion for 11 unused TSDB tables): deferred because most target tables are zero on this dev TSDB. Adding panels would create more EMPTYs, defeating the goal. - Epic 6 (Tier 3 browser e2e): deferred to operator; not blocking. CHANGELOG Unreleased entry covers the sprint at high level + cross- references the feature doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Roger Henley <rogerhenley345@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* docs(release): promote Unreleased -> v0.9.0 Promote the Unreleased CHANGELOG block to v0.9.0 (2026-05-06) ahead of release.yml / goreleaser tag push. New ### Breaking subsection captures two operator-visible cutovers since v0.8.5: (1) Phase 13.5 LLM runtime port 8101 -> 8102 + .env migration required; (2) Phase 13.6 MLX_* -> LLM_* env-var rename (legacy aliases retained for >= 1 release cycle). New ### Added entries: Phase 10.5 closeout (UBENCH framework promotion, commit 0389b49) and Claude Code GitHub App workflows (PRs #378, #379). All previously-Unreleased entries (Phase 14.2.3, 14.2.x, 14.1.x, 14, 13.6, 13.5, 13.2, 13.1) carried forward unchanged into the v0.9.0 block. Fresh empty Unreleased section seeded above. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(submodule): bump homebrew-mdemg to v0.9.0 formula + docs Bumps packaging/homebrew-mdemg pointer a235977 -> 6077097, which incorporates: - f9358cd Brew formula update for mdemg version v0.8.5 (goreleaser, prior) - b4a0d2c Brew formula update for mdemg version v0.9.0 (goreleaser, this release) - 6077097 docs: v0.9.0 -- CHANGELOG, README What's New, beta-testing version pin Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(api): /healthz returns build-time version, not stale literal "0.6.0" `config.FromEnv()` defaulted MdemgVersion/MdemgCommit to literal "0.6.0"/ "unknown" when MDEMG_VERSION/MDEMG_COMMIT envs were unset. Both /healthz and /readyz serialize cfg.MdemgVersion, so they reported "0.6.0" forever regardless of the actual binary's ldflags-injected cli.Version. Fix: defaults to "" in config; cli/config_loader.go injects cli.Version / cli.Commit (the build-time vars set by goreleaser ldflags) when the env override is unset. Operators can still pin via MDEMG_VERSION env. Live-verified: dev build (no ldflags) now reports {"version":"dev"} on /healthz instead of the lying "0.6.0". Production builds via goreleaser will report the real semver tag. TestHandleHealthz unaffected (sets cfg.MdemgVersion directly). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(service): replace decommissioned mlx-server LaunchAgent with llama-server Phase 13.5 cutover (2026-05-03) replaced mlx_lm.server (port 8101) with llama.cpp llama-server (port 8102) as the production LLM runtime, but the embedded launchd plist template + service install code paths were never updated. Any operator running 'mdemg service install' from a fresh checkout got the decommissioned mlx_lm.server agent — mdemg's startup preflight then failed because LLM_ENDPOINT=http://127.0.0.1:8102/v1 wasn't reachable. Changes: - New packaging/launchd/com.mdemg.llama-server.plist with the Phase 13.5 production flags (--ctx-size 32768 --parallel 4 --cont-batching --metrics --jinja). Byte-identical mirror at internal/cli/launchd_templates/ for the embed.FS (CI sync-check enforced). - Removed packaging/launchd/com.mdemg.mlx-server.plist + embed.FS mirror. mlx_lm.server is decommissioned and known-broken on M5 + macOS 26.3.x; keeping the template would just risk re-deploying it. - internal/cli/service_darwin.go: launchdServices entry replaced with com.mdemg.llama-server. resolveMLXLMBin renamed to resolveLlamaServerBin with primary env MDEMG_LLAMA_SERVER_BIN, deprecation alias for MDEMG_MLX_LM_BIN (slog.Warn at boot, retained ≥1 release cycle per the Phase 13.6 deprecation pattern), PATH lookup of `llama-server`. resolveMDEMGModelPath default updated to the canonical Phase 13.5 GGUF filepath (.local-models/mdemg-llm-v1-gguf/mdemg-llm-v1.Q5_K_M.gguf) since llama-server takes a `.gguf` filepath, not an HF-format directory like mlx_lm.server. Install error message updated for the new env var name + remediation steps (`brew install llama.cpp`). - migrateLegacyMLXServerPlist() added: if a pre-cutover com.mdemg.mlx-server plist is bootstrapped on the operator's machine, Install() boots it out and renames the file to .disabled-phase13_5 (matches the manual operator convention from Phase 13.5 rollout). Best-effort: failures don't block the install. - internal/cli/service_darwin_test.go fully rewritten: * TestLaunchdServicesIncludesLlamaServer asserts the new entry exists and is Optional=false (production matches Hotfix 11.6.3.1; the old test asserted Optional=true, a latent lie since 2026-05-02 that Linux CI never caught because of //go:build darwin) * TestLlamaServerPlistEmbedded replaces TestMLXServerPlistEmbedded; additionally asserts mlx-server.plist is NOT in embed.FS * Two resolver tests for the primary env var * New TestResolveLlamaServerBinFallsBackToMLXAlias proves the Phase 13.6 deprecation alias path works * resolveMDEMGModelPath tests updated for the new GGUF default - internal/cli/watchdog.go: help text references com.mdemg.llama-server (instead of com.mdemg.mlx-server) and llama-server (instead of mlx_lm.server). Notes that mdemg_mlx_health_state metric name is retained for dashboard compatibility. Tested: - Tier 1 unit: 7/7 new tests pass; full ./internal/cli/... suite green (61s wall-clock). - Tier 2 integration: golangci-lint run ./internal/cli/ — 0 issues. CI plist sync-check (diff -q packaging/launchd/*.plist internal/cli/launchd_templates/) — 6/6 byte-identical. - Tier 3 live e2e: deferred. Running mdemg service install on the operator's currently-serving machine would briefly bootout the running llama-server LaunchAgent (PID 20527 actively serving production inference). The hand-installed llama-server plist on the operator's machine is byte-equivalent (modulo template substitutions) to what this commit will install via `mdemg service install` on a fresh operator setup, so the operator can verify on next planned redeploy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(sprint): MODEL-DIST-001 sprint plan + quant manifest skeleton Epic 0 of Sprint MODEL-DIST-001 — Local LoRA Distribution via Ollama Library. Sprint plan in 12-section v1.0 format. Supersedes parts of the speculative spec at docs/research/mdemg_sprint_ideas/MDEMG_FT_LORA_PACKAGING_SPEC.md (HF Hub vs Ollama Library; adapter-only vs both-fused-and-adapter; Apple Silicon scope vs cross-platform). Configurability Contract — every operator-visible value is dynamic per the framework's no-hardcoding rule. 12 env vars + flag overrides + sensible defaults. ModelFetcher interface decouples CLI from Ollama-specific knowledge; v1 ships OllamaFetcher only, future backends (HF / S3 / GitHub Release / file) plug in via factory dispatch on MDEMG_MODEL_BACKEND without touching the CLI surface. Forensic from Epic 0: - adapters/tier1/adapters.safetensors verified present (514 MB MLX, Phase 5 SFT Iter 2400 best output) - mdemg-llm-v1.Q5_K_M.gguf SHA256 captured (9.8 GB; 144ad7231...) - f16 GGUF intermediate NOT on disk; Epic 1 will regenerate via convert_hf_to_gguf.py from the MLX merged model (~5 min) - qwen3:14b model-layer digest captured from Ollama registry; manifest digest to be computed at Epic 3 for Modelfile FROM @sha256: pinning quant_manifest.json skeleton with Q5_K_M SHA pre-populated; Q4_K_M / Q8_0 / adapter SHAs filled in during Epics 1+2. Estimated effort 5–7 dev-days. OpenAI spend $0. Risk medium (Ollama publish one-way; MLX→PEFT→GGUF LoRA conversion is the riskiest engineering item with documented contingency to defer to MODEL-DIST-002 if blocked). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 1 — built Q4_K_M + Q8_0 fused GGUFs Pipeline (CLAUDE.md Phase 13.5 documented path): 1. mlx_lm.fuse --dequantize: mlx-community/Qwen3-14B-4bit + adapters/tier1/ -> 29.6 GB bf16 HF safetensors at .local-models/qwen3-14b-mdemg-v1-bf16/ 2. convert_hf_to_gguf.py --outtype f16 -> 30 GB f16 GGUF (required neural/.venv interpreter with torch + transformers + gguf installed; /opt/homebrew/bin/convert_hf_to_gguf.py uses system python which lacks these — installed gguf/sentencepiece/protobuf into neural/.venv) 3. llama-quantize Q4_K_M -> 9.0 GB (4.87 BPW; 40s wall on M5) 4. llama-quantize Q8_0 -> 16 GB (8.50 BPW; 11s wall on M5) 5. Live smoke per new quant via llama-server on port 18102 — both serve /v1/models cleanly with embedded chat_template SHAs captured in quant_manifest.json: Q4_K_M: 401161710c22f0ae...411d42ea Q5_K_M: 144ad723101d688f...d5f5d54 (matches Epic 0 baseline) Q8_0: fc14dcb40af1bb58...8db6089 f16: 436cd6f41a684805...3217bd (intermediate, retained for Epic 2) Resource matrix updated with empirical sizes (Q4_K_M is 9.0 GB vs estimated 6.5 GB; min RAM revised 8 -> 12 GB to cover ~3 GB working memory above weights). 14B params x 4.87 BPW ≈ 8.5 GB matches the formula. GGUF binary artifacts stay local — .local-models/ gitignored per .gitignore:70. Sprint deliverable in git is just the manifest update. Production llama-server (PID 20527 on port 8102) undisturbed throughout Epic 1; live smokes used port 18102. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): Epic 2 — defer adapter to MODEL-DIST-002 Adapter (LoRA-only Modelfile via ADAPTER directive) deferred per the sprint plan's documented contingency clause. Fused-only path (Epics 1, 3, 4, 5) continues — that's the primary operator value. Forensic findings (epic_2_forensic.md): - MLX adapter is well-formed: 560 tensors, 40 layers x 7 target_modules, rank 32, alpha 64, scale 20.0. - convert_lora_to_gguf.py is NOT in brew install llama.cpp; would need manual fetch from llama.cpp source. - MLX -> PEFT requires tensor transposition: MLX lora_a is (in, rank); PEFT expects (rank, in). Same for lora_b. - Estimated 80-95 min to complete vs ~30 min budget remaining for Epic 2. - Hit the contingency criterion: "MLX -> PEFT conversion blocked by tooling gaps." Decision: defer adapter scope to MODEL-DIST-002 (new follow-up sprint, to be planned separately). Fused-only ships this sprint. Knock-on changes (in-flight to subsequent epics): - Epic 3: drop Modelfile.adapter; publish only 3 fused quants. - Epic 4 CLI: --adapter flag accepted at parse-time but errors with "lands in MODEL-DIST-002"; machinery preserved for forward-compat. - Epic 6 e2e: drop adapter-pull step. - Epic 7 feature doc: adapter section notes "coming in MODEL-DIST-002". Artifacts preserved on disk for MODEL-DIST-002 pickup: - adapters/tier1/adapters.safetensors (MLX, 514 MB) - .local-models/mdemg-llm-v1-gguf/mdemg-llm-v1.f16.gguf (30 GB, retained as base for llama-server --lora verification later) quant_manifest.json adapter block updated with status=deferred + reason. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 3 — 3 Modelfiles + local ollama create (push pending) Authored 3 Ollama Modelfiles in packaging/ollama/: Modelfile.Q4_K_M — 9.0 GB, 12 GB min RAM, 16 GB recommended Modelfile.Q5_K_M — 11 GB, 14 GB min RAM, 24 GB recommended (production canonical) Modelfile.Q8_0 — 16 GB, 20 GB min RAM, 32 GB recommended Common shape: FROM ./../../.local-models/mdemg-llm-v1-gguf/...gguf relative path (operator-machine local); num_ctx 32768, num_predict 4096, stop tokens <|im_end|>/<|im_start|>; Apache-2.0 LICENSE; SYSTEM positioning block. No TEMPLATE directive — chat template baked into GGUF metadata (Qwen3 chat_template.jinja preserved through mlx_lm.fuse --dequantize → convert_hf → llama-quantize pipeline). packaging/ollama/README.md documents the publish workflow including the fork-customization path (operators publishing under a different namespace follow MDEMG_MODEL_NAMESPACE per the Configurability Contract). Local ollama create completed for all 3: reh3376/mdemg-llm-v1:Q4_K_M ID 5c3a7252c295 reh3376/mdemg-llm-v1:Q5_K_M ID 08c13b480864 reh3376/mdemg-llm-v1:Q8_0 ID 6b1006facd36 Layers de-duplicated: config + params + system layers (3 layers) are identical across all 3 quants; only the model blob (GGUF) differs. ** ollama push deferred ** — one-way action gated on operator confirmation per Sprint Plan §10 Risk #8. Operator must claim reh3376 namespace on ollama.com and generate API token before push proceeds. Local-create proves the Modelfiles are well-formed; push is a separate decision. Once pushed, manifest digests captured into quant_manifest.json (ollama_manifest_digest field per quant) for mdemg model verify. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 4 — `mdemg model` CLI + pluggable Fetcher interface Sprint MODEL-DIST-001 Epic 4 — the bulk of the operator-facing surface. New CLI subcommand group: mdemg model pull # fetch + symlink + SHA verify mdemg model list # show pulled models mdemg model verify # re-check SHAs vs quant manifest mdemg model remove # destructive (requires --yes) mdemg model where # print resolved path for shell scripting Pluggable backend (internal/cli/model_fetcher.go): type Fetcher interface { Name, Fetch, Verify, Remove } NewFetcher dispatches on cfg.ModelBackend (env: MDEMG_MODEL_BACKEND) v1 ships OllamaFetcher only; future backends (hf, s3, github-release, file) plug in via factory branch — CLI surface unchanged. OllamaFetcher (internal/cli/model_fetcher_ollama.go): Encapsulates ALL Ollama-specific concepts: `ollama pull` invocation, manifest path under <OLLAMA_MODELS>/manifests/<OLLAMA_HOST>/<ns>/<n>/<tag>, mediaType=application/vnd.ollama.image.model layer filtering, blob path under <OLLAMA_MODELS>/blobs/sha256-<digest>, symlink under <MDEMG_MODEL_DIR>, idempotent. Configurability Contract (no hardcoding; memory: feedback_no_hardcoded_values.md): 12 env vars + flag overrides, each with v1-production-tuned defaults so `mdemg model pull` with no flags Just Works. See sprint plan §3. Live-verified all 3 resolution paths: `--quant Q5_K_M` → namespace=reh3376 `--namespace acme --name custom-model` → namespace=acme name=custom `MDEMG_MODEL_NAMESPACE=acme env` → env overrides applied Added to internal/config/config.go: ModelBackend, ModelNamespace, ModelName, ModelQuants, ModelRamTiers, ModelQuant, AdapterBase, ModelDir, OllamaModelsRoot, OllamaRegistryHost, ModelManifestPath. Embedded quant manifest (internal/cli/quant_manifest.json via embed.FS): Runtime source-of-truth for SHA verification. Operator override via MDEMG_MODEL_MANIFEST_PATH for air-gapped deployments. Mirrors docs/development/model-dist-001/quant_manifest.json. RAM-tier auto-pick: Default JSON `{"<16":"Q4_K_M","<24":"Q5_K_M","default":"Q8_0"}` maps host RAM (sysctl on darwin, /proc/meminfo on linux) to quant. Operator override via MDEMG_MODEL_RAM_TIERS. Adapter path (--adapter flag) returns ErrAdapterDeferred per Epic 2's contingency exit — adapter publication lands in MODEL-DIST-002. Flag machinery preserved for forward compatibility. Tests (22, all green) in internal/cli/model_test.go: - Backend factory dispatch (5 cases incl. case-insensitive, default, error) - Quant allowlist parsing (5 cases incl. whitespace + empty entries) - RAM-tier JSON parsing (default + operator override + malformed) - PickQuantForRAM (7 boundary cases) - ResolveQuant across paths (auto, explicit, rejection, operator-custom) - QuantManifest load (embedded + file override + missing-file error) - Ollama tag composition (fused + adapter forms) - Manifest path composition under custom OLLAMA_MODELS/OLLAMA_HOST - Blob path digest prefix handling - Adapter deferred error - Manifest JSON parser (mediaType filtering + malformed + no-model-layer) Grep audit (verification checklist): grep on internal/cli/model*.go for hardcoded values found only in help text Long/example strings documenting defaults to operators — not in logic. Behavior values all flow through cfg.Model* fields. Build + lint clean. Full cli test suite (61s wall) green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 5 — V0021 model_install_events hypertable + writer Sprint MODEL-DIST-001 Epic 5 — observability for `mdemg model` operations. Grafana panels deferred to Sprint B (Grafana audit). New migration: internal/tsdb/migrations/021_model_install_events.sql Hypertable on recorded_at, 7-day chunks, 3 indexes (quant-time, failed-events partial, backend-event-time). Columns: event_id CUIDv2 PK + recorded_at, event_type (pull/verify/remove), backend_name, namespace, model_name, quant, adapter bool, success bool, latency_ms, sha256, size_bytes, err_message (1 KB cap). New writer: internal/tsdb/model_install_writer.go Synchronous single-row INSERT (not buffered + CopyFrom — CLI is one-shot, writes are infrequent vs the V0017/V0018/V0019/V0020 retrieval- path writers that fire per-request). Nil-pool no-op for degraded mode. errMessageMaxLen=1024 truncation at write time. New modelInstallPool interface (Exec-shaped) avoids touching the existing CopyFrom-shaped poolIface used by buffered writers. Wiring: internal/cli/model.go gets recordModelEvent(parent, cfg, row) helper: - Returns immediately if !cfg.TSDBEnabled || cfg.TSDBHost=="" - 2s timeout on connect (TSDB unreachable doesn't block CLI exit) - Logs warning + degrades gracefully on any TSDB error Called from runModelPull (success + failure paths), runModelVerify (single sweep row), runModelRemove (success + failure paths). Schema version bump: internal/config/config.go: TSDB_REQUIRED_SCHEMA_VERSION default 20→21. CI validator at .github/workflows/ci.yml:60-65 counts SQL files in internal/tsdb/migrations/ and asserts equality; now 21 files = 21 in config = passes. Build + lint clean. Existing tsdb / cli test suites green; no new tests added for the writer itself (single INSERT mirrors V0017/V0018/V0019 patterns already covered; integration is operational verification at Epic 6 once tsdb is up in the dev stack). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): Epic 7 — local-model-distribution feature doc Sprint MODEL-DIST-001 Epic 7 — operator-facing feature documentation following the standard Why / Choices / How / How-to-use shape (memory: feedback_per_feature_docs_required.md). Contents: - Why: gap between brew install and a working local LLM after Phase 13.5 - Choices: backend matrix (Ollama vs HF vs GitHub vs S3 vs file://), artifact form (fused vs adapter), Apple Silicon scope, "Ollama runtime rejected (broken on M5+macOS 26.3.x), Ollama distribution only" - How it works: ASCII flow diagram covering CLI dispatch -> Fetcher interface -> OllamaFetcher (preflight, ollama pull, manifest discovery, blob resolve, symlink, SHA verify) -> V0021 observability row - How to use: * Quick start (3 commands: brew install ollama, mdemg model pull, curl /v1/models) * Explicit quant selection * Managing pulled models (list / verify / where / remove) * Forks + enterprise (MDEMG_MODEL_NAMESPACE override) * Air-gapped (MDEMG_MODEL_MANIFEST_PATH override) * Resource matrix per quant (disk, min RAM, recommended RAM, BPW) * Full Configurability Contract table (11 env vars + flags + defaults) * V0021 observability schema - Troubleshooting: ollama missing, SHA mismatch, quant allowlist rejection, RAM auto-detection failure, out-of-disk, symlink permission - Forward-looking: MODEL-DIST-002 adapter, Sprint B Grafana panels, future backends, cross-platform - References: all source-of-truth files cross-linked Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): Epic 8 — Documentation Update (main repo) Sprint MODEL-DIST-001 Epic 8 — final epic, never cut (memory: feedback_sequential_epics.md). This commit lands the main-repo doc updates. The packaging/homebrew-mdemg/ submodule docs (README, CHANGELOG, formula caveats text) update at v0.10.0 release-tag time per the v0.9.0 release flow precedent — that's when goreleaser auto-regenerates mdemg.rb from .goreleaser.yaml's caveats template, and the tap-side README/CHANGELOG get edited in lockstep. Changes: - CHANGELOG.md: comprehensive Unreleased entry documenting Epics 0-5 + 7 landed in this sprint. Epic 3 ollama push and Epic 6 Tier 3 e2e marked as gated on operator confirmation. Adapter path explicitly deferred to MODEL-DIST-002 with epic_2_forensic.md cross-reference. Captures the Configurability Contract enumeration, the 3 quant SHAs, the Fetcher interface design, the V0021 hypertable, and the explicit out-of-scope list. - CLAUDE.md: new "Model Distribution (Sprint MODEL-DIST-001)" subsection in Architecture Notes, slotted ABOVE the existing Compose embed entry for visibility. Captures the pluggable-backend design, the Ollama-as- distribution-only constraint, the on-disk symlink + manifest discovery flow, the 11-knob Configurability Contract surface, the no-hardcoding enforcement, the TSDB V0021 hookup, and the Apple Silicon v1 scope. - README.md: new "Step 2b (optional): Pull the local LLM" section between Step 2 (Initialize/Start) and Open the Dashboard. 3-command quick start (brew install ollama -> mdemg model pull -> set MDEMG_MODEL_PATH). Cross-references the feature doc for the full Configurability Contract. - .goreleaser.yaml: caveats template updated to include `mdemg model pull` instructions. Goreleaser regenerates the homebrew formula's caveats block from this on the next v* tag push, so v0.10.0 will ship the new text to brew users automatically. Deferred to v0.10.0 release-tag time (handled per v0.9.0 precedent): - packaging/homebrew-mdemg/README.md update - packaging/homebrew-mdemg/CHANGELOG.md update - packaging/homebrew-mdemg/mdemg.rb regeneration (automatic via goreleaser from the .goreleaser.yaml change in this commit) - Submodule pointer bump in main repo Deferred to Epic 6 close (after operator does ollama push): - post.md sprint-close document - Capture of remote Ollama manifest digests into quant_manifest.json Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-001): Epic 3 closeout — Ollama Library push complete All 3 fused quants now live on Ollama Library: https://ollama.com/reh3376/mdemg-llm-v1:Q4_K_M https://ollama.com/reh3376/mdemg-llm-v1:Q5_K_M https://ollama.com/reh3376/mdemg-llm-v1:Q8_0 End-to-end integrity verified: remote model-layer digests captured via GET https://registry.ollama.ai/v2/reh3376/mdemg-llm-v1/manifests/<quant> match the local Epic 1 SHAs exactly: Q4_K_M 401161710c22f0ae...411d42ea (matches Epic 1) Q5_K_M 144ad723101d688f...d5f5d54 (matches Epic 1) Q8_0 fc14dcb40af1bb58...8db6089 (matches Epic 1) Captured into quant_manifest.json (both docs canonical + internal/cli embed.FS mirror, byte-synced): - ollama_manifest_digest per quant (computed from the manifest body): Q4_K_M sha256:a210cccb12311773fd70bfa81f221ca0f7940a315bef87b84608caf894533b1b Q5_K_M sha256:ae6e54fe1ee0b487ae41260687ed14c46c30d1ffb0fece936282418b5bcb78e1 Q8_0 sha256:93df4d64bfa751506f7afba8bf08b891ea828575b838adec17b9399ad85be718 - Corrected size_bytes (Epic 1 used approximate values; replaced with registry-reported exact bytes for each tag): Q4_K_M 9.0 GB -> 8.4 GB (9001753408 B; was 9658404096) Q5_K_M 11 GB -> 9.8 GB (10514569568 B; was 11811160064) Q8_0 16 GB -> 14.6 GB (15698534208 B; was 17179869184) - Status flipped from "local-create done; push pending" to "published". Embedded runtime manifest (internal/cli/quant_manifest.json) re-built into the binary via embed.FS. TestLoadQuantManifest_EmbeddedFallback green with new values. Epic 3 of Sprint MODEL-DIST-001 now COMPLETE. Epic 6 (Tier 3 live e2e — `mdemg model pull` against the published tags + llama-server load on port 18102 + sanity inference) is now unblocked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-001): sprint close — post.md Sprint MODEL-DIST-001 close-out per memory rule (feedback_sprint_plan_format.md §11 — sprint plans live in docs/development/<sprint-line>/ with the standard post.md companion). Sections (CLAUDE.md sprint-plan section guidance): - Outcome: 3 quants live on Ollama Library, mdemg model pull is the canonical install path - Process: how the plan held under reality (operator-surfaced no- hardcoding rule revised the plan in-place to add the Configurability Contract before code was written) - Findings: 5 smooth parts + 5 friction items, both honest: * convert_hf_to_gguf.py python deps gap (silent ModuleNotFoundError) * mlx_lm.fuse adapter-path requirement * convert_lora_to_gguf.py missing from brew install llama.cpp (proximate Epic 2 deferral trigger) * mdemg tsdb migrate CWD-aware .env loader quirk * Epic 1 size estimates off vs registry-reported exact bytes - Current state: per-layer state matrix - Testing & benchmarking: all 3 tiers documented (Tier 3 e2e captured V0021 rows for both pull + verify event_types — live-verified) - Risks & opportunities (forward): MODEL-DIST-002 adapter scope, Sprint B Grafana, cross-platform, HFFetcher slot, CWD-aware .env loader QoL - Sprint commits: 9 commits on dev01, mapped to their epics Closes Sprint MODEL-DIST-001 functionally. Operational sprint close (v0.10.0 release tag + tap-repo doc updates) is a separate motion. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(release): promote Unreleased -> v0.10.0 Promote the Sprint MODEL-DIST-001 entry from Unreleased to v0.10.0 (2026-05-11) ahead of release.yml / goreleaser tag push. Fresh empty Unreleased section seeded above. v0.10.0 ships: - mdemg model pull|list|verify|remove|where — one-command path from brew install mdemg to a working local LLM - Pluggable ModelFetcher interface (Ollama in v1, slots for HF/S3/GHR/file) - 3 fused GGUF quants live on Ollama Library at reh3376/mdemg-llm-v1 (:Q4_K_M 8.4 GB / :Q5_K_M 9.8 GB / :Q8_0 14.6 GB) - 11-knob Configurability Contract (every operator-visible value dynamic) - TSDB V0021 model_install_events hypertable + writer - docs/features/local-model-distribution.md Adapter (LoRA-only) path deferred to MODEL-DIST-002 per the sprint plan's documented contingency (epic_2_forensic.md). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(submodule + docs): bump homebrew-mdemg to v0.10.0 + cli-reference Model Distribution section Stage 4 + Stage 5 of v0.10.0 release. Submodule pointer bump: packaging/homebrew-mdemg 6077097 -> c3aa68b incorporates: - 42d7390 — goreleaser auto-bumped mdemg.rb to version "0.10.0" + new caveats text on v0.10.0 tag push - c3aa68b — manual docs round-trip: CHANGELOG v0.10.0 entry, README Optional Pull-the-local-LLM section in Quick Start (full Ollama Library doc with quant matrix, list/verify/where/remove subcommands, fork variants via MDEMG_MODEL_NAMESPACE, architecture note "Ollama is distribution-only"), Upgrading to v0.10.0 + What's New in v0.10.0 blocks, default-LLM rotation history extended, mdemg_beta_testing.md version pin v0.9.0 -> v0.10.0 docs/user/cli-reference.md (per Stage 5 user request to align refs with current codebase): - New ## Model Distribution top-level section before ## Synergy Optimization (model command group is GroupID="config" in root.go but a top-level cli-ref section is cleaner for discoverability). Documents all 5 subcommands (pull, list, verify, remove, where) with flag tables, usage examples, the full Configurability Contract (11 knobs), the architecture note (Ollama is distribution-only). - Updated Environment Variable Reference with new "Model Distribution (Sprint MODEL-DIST-001, v0.10.0)" subsection — 11 env vars + defaults table. - Updated Command Tree Summary with the new model subcommand group slotted between Configuration and Advanced. docs/user/api-reference.md unchanged: Sprint MODEL-DIST-001 added zero HTTP endpoints (CLI-only sprint; observability via TSDB V0021 row writer is server-side internal). Audit also surfaced ~25 routes of pre-existing drift between code and docs (mostly path-parameter notation: `/v1/backup/` in code vs `/v1/backup/{id}` in docs — same routes — plus 3 undocumented /api/graph/* endpoints and 2 undocumented /v1/admin/features/{restart,stop} actions). That drift is out-of-scope for v0.10.0 and belongs in its own follow-up sprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(cli): add mdemg model run wrapper (follow-up #1 to MODEL-DIST-001) One-shot or interactive REPL chat against the configured LLM endpoint (default: llama-server at port 8102 per Phase 13.5). Closes the gap operators noted between `ollama run` and the mdemg framework. Two modes: - One-shot: `mdemg model run -p "hello"` or positional arg after `--` - Interactive REPL: no prompt; reads stdin line-by-line, accumulates conversation history across turns Pure stdlib HTTP (no llmclient retries/breakers/recording). CLI invocations are intentionally NOT recorded to llm_interactions — this is an ad-hoc exploration tool, not a production code path; keeping the training-data corpus clean. Every operator-visible value is dynamic per the no-hardcoding rule: --endpoint override cfg.EffectiveLLMEndpoint --model override cfg.LLMModel (final fallback: mdemg-llm-v1) --prompt/-p one-shot prompt (omit for REPL) --system/-s system message --temperature (default 0.7) --max-tokens (default 1024) --timeout (default 60s) Live-verified end-to-end on the operator's running llama-server on port 8102 with mdemg-llm-v1: one-shot worked; system+prompt with --model override worked. 13 unit tests in model_run_test.go covering: message composition (system first, no-system skip, history preservation), config resolution (flag > cfg > final fallback), OpenAI-compat HTTP shape, error paths (HTTP error, inline error object, no choices, timeout), trailing-slash endpoint normalization, body-bounding helper. All green. Renamed local body-bounding helper to `truncateRunBody` to avoid name collision with a same-named helper in internal/cli/data.go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(api): document 19 previously-undocumented endpoints (follow-up #2) Audit of internal/api/server.go (167 routes) vs docs/user/api-reference.md surfaced 19 genuinely missing endpoints. v0.10.0 commit noted this as out-of-scope; this commit resolves the gap. Audit method: extract mux.HandleFunc registrations from server.go, extract documented "VERB /path" headings from api-reference.md, normalize both to strip path parameters and trailing prefix slashes, diff. Of the initial 24-entry code-only set, 5 are false positives (combined headers like "POST /v1/admin/features/start|stop|restart" cover the individual verbs; "GET|POST /v1/jiminy/protocol/metrics" covers both methods on one route). Added sections: Jiminy / J17 (10 endpoints, all under "## Jiminy Inner-Voice"): GET|POST /v1/jiminy/protocol/metrics # snapshot + reset GET /v1/jiminy/protocol/status # per-session J17 state POST /v1/jiminy/checkpoint # tier-transition checkpoint POST /v1/jiminy/resume-protocol # restore from checkpoint POST /v1/jiminy/extension # operator-driven tier hold POST /v1/jiminy/strict # toggle strict mode per session POST /v1/jiminy/reformulate # advisory -> imperative rewrite POST /v1/jiminy/classify # pre-Write/Edit pass/deny gate GET /v1/jiminy/latest # most recent guidance (warm store) POST /v1/jiminy/warm # eager cache warmup Memory / Graph (3 endpoints, under "## Memory Operations"): GET /v1/memory/graph/topology # node/edge counts per layer GET /v1/memory/graph/neighborhood # local 1-3 hop walk GET /v1/memory/spaces # root listing of all spaces Observability (2 endpoints, under "## Metrics & Monitoring"): GET /v1/metrics/trends # TSDB time-series query GET /v1/prometheus # Prometheus scrape endpoint Dashboard / Viz (4 endpoints, new "## Dashboard / Visualization (internal)" section before MCP Server Tools — operator-internal endpoints backing the browser dashboard at /ui/): GET /api/graph/data # force-directed graph data GET /api/graph/fields # schema field catalog GET /api/graph/health # explorer health GET /viz/topology # standalone HTML topology view Each entry has handler-signature-derived request/response shape, query parameter table, sample curl/JSON examples following the existing api-reference convention. TOC updated with new "Dashboard / Visualization (internal)" entry and renumbered tail. Out of scope (deliberate, deferred): - 28 "docs-only" entries from the audit are confirmed false positives from prefix-matching path normalization (code registers /v1/memory/nodes/ with trailing slash and routes the suffix; docs spell out the full /v1/memory/nodes/{node_id}/archive form correctly) - /v1/symbols root path is partially covered by /v1/symbols/relationships + /v1/symbols/{id}/relationships in docs; root listing endpoint documentation can land later if/when its handler grows specific shape - /v1/conversation/observations covered indirectly by the flag-for-org endpoint documentation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(grafana-audit): Epic 0 — sprint plan + audit harness Sprint GRAFANA-AUDIT-001 Epic 0. Builds the per-panel audit harness: walks every panel in deploy/docker/grafana/dashboards/*.json, extracts rawSql/sql targets, substitutes Grafana macros (\$__timeFilter, \$__timeFrom/To, \$__interval, \$__unixEpoch*) + template variables (\$space_id, \$instance + multi-value variants like \${space_id:raw}), executes via docker exec mdemg-timescaledb-1 psql, classifies each panel target as PASS / EMPTY / FAIL / SKIP. Tier 1 unit tests (17 tests, all green): - Template-variable substitution: time_filter / from-to / unix epoch / interval / interval_ms / space_id (3 syntaxes) / instance (3 syntaxes) / multi-macro composite query - Table extraction (FROM/JOIN with alias, case-insensitive, no-table) - Panel walking (flat, nested rows, targets-with-sql vs no-sql) Smoke test against mdemg-overview.json IMMEDIATELY validated the operator's "diminished observability" report — 5 of 13 panels FAIL, 1 EMPTY, 7 PASS on the front-page dashboard: FAIL Request Rate FAIL Error Rate FAIL Circuit Breakers FAIL Requests by Status FAIL Rate Limit Rejections EMPTY Request Latency Distribution (t0; t1/t2 PASS) The original 11-panel sample missed these because it sampled different panels. Lesson: trust the rigorous audit, not the sample. Sprint proceeds to Epic 1 (full audit across all 146 panels) immediately. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(grafana-audit): Epic 1 + 2 — full audit + findings Sprint GRAFANA-AUDIT-001 Epics 1 + 2. Per-panel rigorous audit of all 165 target executions across 146 panels in 8 dashboards. Headline: PASS 125 (76%) — executes, returns rows in 24h window EMPTY 19 (12%) — executes, 0 rows FAIL 3 (2%) — SQL error SKIP 18 (11%) — non-SQL panel types Harness fix mid-Epic-1: \$__interval substitution was wrapping the value in quotes, but Grafana convention has panel SQL provide its own outer quotes — producing doubled quotes and 18 false-positive FAILs. Fixed: substitute bare value. Verified by re-run: 20→3 FAILs. Real failures (Epic 2 findings): (a) 3 SQL bugs on mdemg-llm-routing.json — all three panels hardcoded `mdemg-dev` (unquoted) in WHERE clauses instead of '\$space_id' template variable. PG parses `mdemg-dev` as subtraction. (b) 5 schema-drift EMPTYs — panel filter expects metric_type or labels shape that doesn't match server emission: - mdemg_j17_events_total: panel 'counter', server 'gauge' - mdemg_rsic_action_total: panel status='success', server status='completed' - 2 more suspected pending full-SQL inspection. (c) 2 missing-server-side metrics — mdemg_rate_limit_rejected_total and mdemg_http_request_duration_seconds_p50 not emitted. Will be documented; server emission is follow-up. (d) ~11 sparse-data EMPTYs — panel SQL correct, no rows in 24h window. Widening time-range in Epic 4. Projected post-Epic-3/4: 133 PASS, ≤11 EMPTY, 0 FAIL. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(grafana): Epic 3 — 5 panels recovered (3 FAIL + 2 schema-drift) Sprint GRAFANA-AUDIT-001 Epic 3. Minimum-change JSON edits to fix category (a) SQL bugs and category (b) schema-drift EMPTYs identified in Epic 1/2. mdemg-llm-routing.json (3 panels, all category-a SQL bugs): - LLM call distribution by model_name (24h) - LLM latency p50 / p95 / p99 by task × model - LLM error rate % by task_name (selected range) Bug: WHERE clause was `(\$space_id = '' OR space_id = '\$space_id')` — the first \$space_id was unquoted, so PG parsed `mdemg-dev = ''` as `column "mdemg-dev"` which doesn't exist. Also breached the no-hardcoding rule (memory: feedback_no_hardcoded_values.md). Fix: wrap the first variable reference in quotes → `('\$space_id' = '' OR space_id = '\$space_id')` — a proper string-literal comparison that also serves as the All-spaces guard the panel author intended. Verdict: 3 FAIL -> 3 PASS. mdemg-llm-routing is now 4/4 PASS. mdemg-j17.json :: Total Events (1 panel, category-b drift): Panel filtered `metric_type = 'counter'` (Prometheus naming convention because metric is `mdemg_j17_events_total`). Server actually emits `metric_type = 'gauge'`. 6,393 rows in 7d; 0 panel matches. Fix: align panel filter to `'gauge'`. Verdict: EMPTY -> PASS. mdemg-rsic.json :: Action Success Rate t0 (1 panel target, category-b drift): Panel filtered `labels->>'status' = 'success'`. Server actually emits `'completed'` (181 rows in 24h; 0 panel matches). Fix: align panel filter to `'completed'`. The t1 'failed' target retained unchanged — its EMPTY result is now accurate observation (server emits no `'failed'` actions; 0 = legitimate zero). Verdict: 1/2 EMPTY -> PASS, 1/2 EMPTY accurate-zero. Audit verdict counts: Before: 125 PASS, 19 EMPTY, 3 FAIL, 18 SKIP After: 130 PASS, 17 EMPTY, 0 FAIL, 18 SKIP Remaining 17 EMPTYs (Epic 4 disposition): - 5 category-c emission regression — 4 rsic metrics stopped at 2026-05-07/08 (server-side investigation queued as follow-up) - 2 category-c never-emitted — Rate Limit Rejections, p50 latency - 8 category-d sparse-data on ft-training — widen time-range - 1 mdemg-jiminy :: Effectiveness Trends — CTE pending inspection - 1 mdemg-rsic :: Action Success Rate t1 (accurate-zero) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(grafana-audit): Epic 4 + 7 — feature doc + sprint close Sprint GRAFANA-AUDIT-001 closeout (Epics 4 + 5 + 6 + 7 combined as a single doc-only commit; Epic 5 deferred and Epic 6 deferred-to-operator as documented in post.md). New: docs/features/observability-dashboards.md (286 lines) — full operator-facing inventory of the 8 dashboards with: - Per-dashboard purpose + panel count + primary tables - Audit verdict table (130/17/0/18 post-Epic-3) - Epic 3 fix log: 3 SQL bugs + 2 schema-drift filters - Known gaps in 3 buckets: (c) emission regression (4 May-7-8 metrics, current codebase has zero refs — server removed emission), (c) never-emitted (mdemg_rate_limit_rejected_total + mdemg_http_request_duration_seconds_p50), (d) sparse/zero data on this dev TSDB (ft-training tables) - Refresh expectations per table - Operator playbook for re-running scripts/grafana_panel_audit.py - Forward-looking: CI integration, coverage expansion, server-side emission restore New: docs/development/grafana-audit-001/post.md — sprint close per memory rule, covers process / smooth-parts / friction / sprint-plan vs reality / current state / risks-opportunities / commits. Epic deferrals (documented in post.md): - Epic 5 (coverage expansion for 11 unused TSDB tables): deferred because most target tables are zero on this dev TSDB. Adding panels would create more EMPTYs, defeating the goal. - Epic 6 (Tier 3 browser e2e): deferred to operator; not blocking. CHANGELOG Unreleased entry covers the sprint at high level + cross- references the feature doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-002): Epic 0 — sprint plan + workspace prep Sprint MODEL-DIST-002 picks up the adapter-only path deferred from MODEL-DIST-001 Epic 2. Resolves the tooling gap documented in epic_2_forensic.md. Workspace prep: - Vendored convert_lora_to_gguf.py from llama.cpp source (master, pinned 2026-05-21) into scripts/vendor/llama_cpp/ with MIT LICENSE attribution and a README documenting refresh policy. brew install llama.cpp ships convert_hf_to_gguf.py but NOT convert_lora_to_gguf.py; vendoring is the cleanest path (vs requiring operators to clone llama.cpp source). - pip install peft==0.19.1 + accelerate==1.13.0 + psutil==7.2.2 into neural/.venv (the same venv that has torch + transformers + gguf from MODEL-DIST-001 Epic 1). PEFT is needed for PEFT-format schema validation + as a dependency of convert_lora_to_gguf.py. - Inspected convert_lora_to_gguf.py — expects directory with adapter_config.json + adapter_model.safetensors in PEFT layout. Confirms the MLX → PEFT direction is `lora_A: (rank, input)` and `lora_B: (output, rank)` (script line 41-42 docstring). Sprint plan in 12-section v1.0 format. 7 epics, 1-2 dev-day estimate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-002): Epic 1-3 — MLX adapter → PEFT → GGUF LoRA + live verify Sprint MODEL-DIST-002 Epics 1, 2, 3 (combined commit). Epic 1 — MLX → PEFT converter (scripts/mlx_adapter_to_peft.py + 14 unit tests): Reads adapters/tier1/adapters.safetensors (514 MB MLX format, 560 tensors, Phase 5 SFT Iter 2400 best). Per the analysis in MODEL-DIST-001 epic_2_forensic.md: Key rename: model.layers.<N>.<module>.lora_a -> base_model.model.model.layers.<N>.<module>.lora_A.weight Tensor transpose: lora_a (input,rank) -> (rank,input) lora_b (rank,output) -> (output,rank) Emits PEFT-format adapter_config.json + adapter_model.safetensors. Single-adapter PEFT layout (.lora_A.weight, not .lora_A.default.weight) required by convert_lora_to_gguf.py. Epic 2 — PEFT → GGUF LoRA (scripts/vendor/llama_cpp/convert_lora_to_gguf.py): Pinned to llama.cpp release b9000 (self-contained version; upstream master refactored to a conversion/ Python package with 30+ model files, excessive vendoring scope). README documents refresh policy. Output: .local-models/mdemg-llm-v1-adapter.gguf SHA256: 0cfaf4bae3215a4aea664a8d28ae9a41d73ee740cbcce5c2eef950232cfe1de5 Size: 257 MB (vs ~9 GB fused Q5_K_M; ~35x smaller download) Tensor count: 560 (matches expected 40 layers x 7 target_modules x 2) Epic 3 — Live verification (docs/development/model-dist-002/verification.md): Side-port llama-server on 127.0.0.1:18103 with f16 base + adapter; sanity prompt vs production 8102 fused model returns semantically-aligned outputs on the same prompt — both describe MDEMG as a knowledge-graph memory system. Confirms the MLX-PEFT-GGUF chain is structurally correct. Iteration during Epic 2 (worth noting): - Initial vendored convert_lora_to_gguf.py from upstream master failed with ImportError (refactored to use conversion/ package). Pinned to b9000 release which is self-contained. - Initial PEFT keys used .default.weight suffix (multi-adapter layout); convert_lora_to_gguf.py rejected with \"Not a lora_A or lora_B tensor.\" Switched to single-adapter layout (.weight) which the script accepts. Test results: 14/14 Tier 1 tests green; PEFT output loads via peft.PeftConfig.from_pretrained; GGUF emission completes with all 560 tensors; runtime adapter application produces coherent outputs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(model-dist-002): Epic 4 local — Modelfile.adapter + ollama create Authored packaging/ollama/Modelfile.adapter: FROM qwen3:14b ADAPTER ../../.local-models/mdemg-llm-v1-adapter.gguf PARAMETER num_ctx 32768 num_predict 4096 stop "<|im_end|>" stop "<|im_start|>" SYSTEM (Qwen3-14B mdemg fine-tune positioning) LICENSE Apache 2.0 (inherits from base) Local ollama create succeeded: reh3376/mdemg-llm-v1-adapter:latest Local ID dda290492091 Layers: qwen3:14b base (a8cc1361...) + adapter blob (0cfaf4ba...) + template + license + params + system quant_manifest.json adapter block updated: status: "deferred to MODEL-DIST-002" -> "local-create done; push pending" sha256, size_bytes, ollama_local_id captured pipeline field added (MLX -> PEFT -> GGUF LoRA chain) Push is operator-gated per MODEL-DIST-001 pattern (one-way action). After push, ollama_manifest_digest will be captured and embedded quant_manifest.json will be updated alongside. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(cli): enable mdemg model pull --adapter (MODEL-DIST-002 Epic 5+6) Lifts the ErrAdapterDeferred guard from MODEL-DIST-001's deferred adapter path now that reh3376/mdemg-llm-v1-adapter:latest is published. CLI changes: - model_fetcher_ollama.go: removed deferral guard from Fetch; switched readModelBlobDigest to target application/vnd.ollama.image.adapter mediaType for adapter pulls; added destFilename() helper so adapter symlinks land at <name>-adapter.gguf (no quant suffix). - model.go: SHA verify in runModelPull now branches on req.Adapter to look up mf.Adapter when pulling the adapter form; tag printout shows <ns>/<name>-adapter:latest for adapter pulls instead of the resolved fused quant. - model_fetcher.go: ErrAdapterDeferred sentinel retained for future non-Ollama backends that ship fused-only first; not currently returned. QuantManifest gained Adapter *QuantRecord field. Manifest updates (both embedded + canonical): - adapter SHA256 0cfaf4bae3215a4aea664a8d28ae9a41d73ee740cbcce5c2eef950232cfe1de5 - Ollama manifest digest sha256:57b98b97ede0e340e8c530aabf579136616ba670281fe04b14777164e655c278 - ollama_media_type application/vnd.ollama.image.adapter Tests: - Removed TestOllamaFetcher_AdapterDeferred. - Added TestDestFilename_FusedQuantAndAdapter (6 cases). - Added TestOllamaFetcher_ReadAdapterBlobDigest_FiltersOnAdapterMediaType. Tier 3 live e2e: mdemg model pull --adapter completed in 987 ms, SHA verify ok, symlink at ~/.mdemg/models/mdemg-llm-v1-adapter.gguf, and llama-server --lora produced coherent inference against the symlinked adapter ("MDEMG is a knowledge graph memory system..."). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(model-dist-002): flip adapter section to shipped + sprint close Epic 7 (Documentation Update — never cut). - docs/features/local-model-distribution.md: adapter section flipped from "deferred to MODEL-DIST-002" to "shipped 2026-05-25"; status header updated; Configurability Contract table adds --adapter flag row. - CHANGELOG.md: Unreleased gains "Sprint MODEL-DIST-002 — Adapter-only distribution path shipped" entry with full pipeline + verification + SHA + Ollama manifest digest. - CLAUDE.md Model Distribution architecture note: replaces "adapter-only deferred to MODEL-DIST-002+" with the operator-facing recipe and the pinned-toolchain pointer. - docs/development/model-dist-002/post.md: sprint close with epic-by-epic outcomes, acceptance criteria check-off, surprise log, and forward- looking notes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Roger Henley <rogerhenley345@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* docs(release): promote Unreleased -> v0.9.0
Promote the Unreleased CHANGELOG block to v0.9.0 (2026-05-06) ahead of
release.yml / goreleaser tag push.
New ### Breaking subsection captures two operator-visible cutovers since
v0.8.5: (1) Phase 13.5 LLM runtime port 8101 -> 8102 + .env migration
required; (2) Phase 13.6 MLX_* -> LLM_* env-var rename (legacy aliases
retained for >= 1 release cycle).
New ### Added entries: Phase 10.5 closeout (UBENCH framework promotion,
commit 0389b49) and Claude Code GitHub App workflows (PRs #378, #379).
All previously-Unreleased entries (Phase 14.2.3, 14.2.x, 14.1.x, 14, 13.6,
13.5, 13.2, 13.1) carried forward unchanged into the v0.9.0 block. Fresh
empty Unreleased section seeded above.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.9.0 formula + docs
Bumps packaging/homebrew-mdemg pointer a235977 -> 6077097, which incorporates:
- f9358cd Brew formula update for mdemg version v0.8.5 (goreleaser, prior)
- b4a0d2c Brew formula update for mdemg version v0.9.0 (goreleaser, this release)
- 6077097 docs: v0.9.0 -- CHANGELOG, README What's New, beta-testing version pin
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(api): /healthz returns build-time version, not stale literal "0.6.0"
`config.FromEnv()` defaulted MdemgVersion/MdemgCommit to literal "0.6.0"/
"unknown" when MDEMG_VERSION/MDEMG_COMMIT envs were unset. Both /healthz
and /readyz serialize cfg.MdemgVersion, so they reported "0.6.0" forever
regardless of the actual binary's ldflags-injected cli.Version.
Fix: defaults to "" in config; cli/config_loader.go injects cli.Version /
cli.Commit (the build-time vars set by goreleaser ldflags) when the env
override is unset. Operators can still pin via MDEMG_VERSION env.
Live-verified: dev build (no ldflags) now reports {"version":"dev"} on
/healthz instead of the lying "0.6.0". Production builds via goreleaser
will report the real semver tag.
TestHandleHealthz unaffected (sets cfg.MdemgVersion directly).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(service): replace decommissioned mlx-server LaunchAgent with llama-server
Phase 13.5 cutover (2026-05-03) replaced mlx_lm.server (port 8101) with
llama.cpp llama-server (port 8102) as the production LLM runtime, but the
embedded launchd plist template + service install code paths were never
updated. Any operator running 'mdemg service install' from a fresh checkout
got the decommissioned mlx_lm.server agent — mdemg's startup preflight then
failed because LLM_ENDPOINT=http://127.0.0.1:8102/v1 wasn't reachable.
Changes:
- New packaging/launchd/com.mdemg.llama-server.plist with the Phase 13.5
production flags (--ctx-size 32768 --parallel 4 --cont-batching --metrics
--jinja). Byte-identical mirror at internal/cli/launchd_templates/ for
the embed.FS (CI sync-check enforced).
- Removed packaging/launchd/com.mdemg.mlx-server.plist + embed.FS mirror.
mlx_lm.server is decommissioned and known-broken on M5 + macOS 26.3.x;
keeping the template would just risk re-deploying it.
- internal/cli/service_darwin.go: launchdServices entry replaced with
com.mdemg.llama-server. resolveMLXLMBin renamed to resolveLlamaServerBin
with primary env MDEMG_LLAMA_SERVER_BIN, deprecation alias for
MDEMG_MLX_LM_BIN (slog.Warn at boot, retained ≥1 release cycle per the
Phase 13.6 deprecation pattern), PATH lookup of `llama-server`.
resolveMDEMGModelPath default updated to the canonical Phase 13.5 GGUF
filepath (.local-models/mdemg-llm-v1-gguf/mdemg-llm-v1.Q5_K_M.gguf) since
llama-server takes a `.gguf` filepath, not an HF-format directory like
mlx_lm.server. Install error message updated for the new env var name +
remediation steps (`brew install llama.cpp`).
- migrateLegacyMLXServerPlist() added: if a pre-cutover com.mdemg.mlx-server
plist is bootstrapped on the operator's machine, Install() boots it out
and renames the file to .disabled-phase13_5 (matches the manual operator
convention from Phase 13.5 rollout). Best-effort: failures don't block
the install.
- internal/cli/service_darwin_test.go fully rewritten:
* TestLaunchdServicesIncludesLlamaServer asserts the new entry exists
and is Optional=false (production matches Hotfix 11.6.3.1; the old
test asserted Optional=true, a latent lie since 2026-05-02 that
Linux CI never caught because of //go:build darwin)
* TestLlamaServerPlistEmbedded replaces TestMLXServerPlistEmbedded;
additionally asserts mlx-server.plist is NOT in embed.FS
* Two resolver tests for the primary env var
* New TestResolveLlamaServerBinFallsBackToMLXAlias proves the
Phase 13.6 deprecation alias path works
* resolveMDEMGModelPath tests updated for the new GGUF default
- internal/cli/watchdog.go: help text references com.mdemg.llama-server
(instead of com.mdemg.mlx-server) and llama-server (instead of
mlx_lm.server). Notes that mdemg_mlx_health_state metric name is
retained for dashboard compatibility.
Tested:
- Tier 1 unit: 7/7 new tests pass; full ./internal/cli/... suite green
(61s wall-clock).
- Tier 2 integration: golangci-lint run ./internal/cli/ — 0 issues.
CI plist sync-check (diff -q packaging/launchd/*.plist
internal/cli/launchd_templates/) — 6/6 byte-identical.
- Tier 3 live e2e: deferred. Running mdemg service install on the
operator's currently-serving machine would briefly bootout the running
llama-server LaunchAgent (PID 20527 actively serving production
inference). The hand-installed llama-server plist on the operator's
machine is byte-equivalent (modulo template substitutions) to what
this commit will install via `mdemg service install` on a fresh
operator setup, so the operator can verify on next planned redeploy.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(sprint): MODEL-DIST-001 sprint plan + quant manifest skeleton
Epic 0 of Sprint MODEL-DIST-001 — Local LoRA Distribution via Ollama Library.
Sprint plan in 12-section v1.0 format. Supersedes parts of the speculative spec
at docs/research/mdemg_sprint_ideas/MDEMG_FT_LORA_PACKAGING_SPEC.md (HF Hub vs
Ollama Library; adapter-only vs both-fused-and-adapter; Apple Silicon scope vs
cross-platform).
Configurability Contract — every operator-visible value is dynamic per the
framework's no-hardcoding rule. 12 env vars + flag overrides + sensible
defaults. ModelFetcher interface decouples CLI from Ollama-specific knowledge;
v1 ships OllamaFetcher only, future backends (HF / S3 / GitHub Release / file)
plug in via factory dispatch on MDEMG_MODEL_BACKEND without touching the CLI
surface.
Forensic from Epic 0:
- adapters/tier1/adapters.safetensors verified present (514 MB MLX, Phase 5
SFT Iter 2400 best output)
- mdemg-llm-v1.Q5_K_M.gguf SHA256 captured (9.8 GB; 144ad7231...)
- f16 GGUF intermediate NOT on disk; Epic 1 will regenerate via
convert_hf_to_gguf.py from the MLX merged model (~5 min)
- qwen3:14b model-layer digest captured from Ollama registry; manifest digest
to be computed at Epic 3 for Modelfile FROM @sha256: pinning
quant_manifest.json skeleton with Q5_K_M SHA pre-populated; Q4_K_M / Q8_0 /
adapter SHAs filled in during Epics 1+2.
Estimated effort 5–7 dev-days. OpenAI spend $0. Risk medium (Ollama publish
one-way; MLX→PEFT→GGUF LoRA conversion is the riskiest engineering item with
documented contingency to defer to MODEL-DIST-002 if blocked).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(model-dist-001): Epic 1 — built Q4_K_M + Q8_0 fused GGUFs
Pipeline (CLAUDE.md Phase 13.5 documented path):
1. mlx_lm.fuse --dequantize: mlx-community/Qwen3-14B-4bit + adapters/tier1/
-> 29.6 GB bf16 HF safetensors at .local-models/qwen3-14b-mdemg-v1-bf16/
2. convert_hf_to_gguf.py --outtype f16 -> 30 GB f16 GGUF (required
neural/.venv interpreter with torch + transformers + gguf installed;
/opt/homebrew/bin/convert_hf_to_gguf.py uses system python which lacks
these — installed gguf/sentencepiece/protobuf into neural/.venv)
3. llama-quantize Q4_K_M -> 9.0 GB (4.87 BPW; 40s wall on M5)
4. llama-quantize Q8_0 -> 16 GB (8.50 BPW; 11s wall on M5)
5. Live smoke per new quant via llama-server on port 18102 — both serve
/v1/models cleanly with embedded chat_template
SHAs captured in quant_manifest.json:
Q4_K_M: 401161710c22f0ae...411d42ea
Q5_K_M: 144ad723101d688f...d5f5d54 (matches Epic 0 baseline)
Q8_0: fc14dcb40af1bb58...8db6089
f16: 436cd6f41a684805...3217bd (intermediate, retained for Epic 2)
Resource matrix updated with empirical sizes (Q4_K_M is 9.0 GB vs estimated
6.5 GB; min RAM revised 8 -> 12 GB to cover ~3 GB working memory above
weights). 14B params x 4.87 BPW ≈ 8.5 GB matches the formula.
GGUF binary artifacts stay local — .local-models/ gitignored per
.gitignore:70. Sprint deliverable in git is just the manifest update.
Production llama-server (PID 20527 on port 8102) undisturbed throughout
Epic 1; live smokes used port 18102.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(model-dist-001): Epic 2 — defer adapter to MODEL-DIST-002
Adapter (LoRA-only Modelfile via ADAPTER directive) deferred per the sprint
plan's documented contingency clause. Fused-only path (Epics 1, 3, 4, 5)
continues — that's the primary operator value.
Forensic findings (epic_2_forensic.md):
- MLX adapter is well-formed: 560 tensors, 40 layers x 7 target_modules,
rank 32, alpha 64, scale 20.0.
- convert_lora_to_gguf.py is NOT in brew install llama.cpp; would need
manual fetch from llama.cpp source.
- MLX -> PEFT requires tensor transposition: MLX lora_a is (in, rank);
PEFT expects (rank, in). Same for lora_b.
- Estimated 80-95 min to complete vs ~30 min budget remaining for Epic 2.
- Hit the contingency criterion: "MLX -> PEFT conversion blocked by
tooling gaps."
Decision: defer adapter scope to MODEL-DIST-002 (new follow-up sprint, to
be planned separately). Fused-only ships this sprint.
Knock-on changes (in-flight to subsequent epics):
- Epic 3: drop Modelfile.adapter; publish only 3 fused quants.
- Epic 4 CLI: --adapter flag accepted at parse-time but errors with
"lands in MODEL-DIST-002"; machinery preserved for forward-compat.
- Epic 6 e2e: drop adapter-pull step.
- Epic 7 feature doc: adapter section notes "coming in MODEL-DIST-002".
Artifacts preserved on disk for MODEL-DIST-002 pickup:
- adapters/tier1/adapters.safetensors (MLX, 514 MB)
- .local-models/mdemg-llm-v1-gguf/mdemg-llm-v1.f16.gguf (30 GB,
retained as base for llama-server --lora verification later)
quant_manifest.json adapter block updated with status=deferred + reason.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(model-dist-001): Epic 3 — 3 Modelfiles + local ollama create (push pending)
Authored 3 Ollama Modelfiles in packaging/ollama/:
Modelfile.Q4_K_M — 9.0 GB, 12 GB min RAM, 16 GB recommended
Modelfile.Q5_K_M — 11 GB, 14 GB min RAM, 24 GB recommended (production canonical)
Modelfile.Q8_0 — 16 GB, 20 GB min RAM, 32 GB recommended
Common shape: FROM ./../../.local-models/mdemg-llm-v1-gguf/...gguf relative
path (operator-machine local); num_ctx 32768, num_predict 4096, stop tokens
<|im_end|>/<|im_start|>; Apache-2.0 LICENSE; SYSTEM positioning block.
No TEMPLATE directive — chat template baked into GGUF metadata (Qwen3
chat_template.jinja preserved through mlx_lm.fuse --dequantize → convert_hf
→ llama-quantize pipeline).
packaging/ollama/README.md documents the publish workflow including the
fork-customization path (operators publishing under a different namespace
follow MDEMG_MODEL_NAMESPACE per the Configurability Contract).
Local ollama create completed for all 3:
reh3376/mdemg-llm-v1:Q4_K_M ID 5c3a7252c295
reh3376/mdemg-llm-v1:Q5_K_M ID 08c13b480864
reh3376/mdemg-llm-v1:Q8_0 ID 6b1006facd36
Layers de-duplicated: config + params + system layers (3 layers) are
identical across all 3 quants; only the model blob (GGUF) differs.
** ollama push deferred ** — one-way action gated on operator confirmation
per Sprint Plan §10 Risk #8. Operator must claim reh3376 namespace on
ollama.com and generate API token before push proceeds. Local-create proves
the Modelfiles are well-formed; push is a separate decision.
Once pushed, manifest digests captured into quant_manifest.json
(ollama_manifest_digest field per quant) for mdemg model verify.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(model-dist-001): Epic 4 — `mdemg model` CLI + pluggable Fetcher interface
Sprint MODEL-DIST-001 Epic 4 — the bulk of the operator-facing surface.
New CLI subcommand group:
mdemg model pull # fetch + symlink + SHA verify
mdemg model list # show pulled models
mdemg model verify # re-check SHAs vs quant manifest
mdemg model remove # destructive (requires --yes)
mdemg model where # print resolved path for shell scripting
Pluggable backend (internal/cli/model_fetcher.go):
type Fetcher interface { Name, Fetch, Verify, Remove }
NewFetcher dispatches on cfg.ModelBackend (env: MDEMG_MODEL_BACKEND)
v1 ships OllamaFetcher only; future backends (hf, s3, github-release,
file) plug in via factory branch — CLI surface unchanged.
OllamaFetcher (internal/cli/model_fetcher_ollama.go):
Encapsulates ALL Ollama-specific concepts: `ollama pull` invocation,
manifest path under <OLLAMA_MODELS>/manifests/<OLLAMA_HOST>/<ns>/<n>/<tag>,
mediaType=application/vnd.ollama.image.model layer filtering,
blob path under <OLLAMA_MODELS>/blobs/sha256-<digest>, symlink under
<MDEMG_MODEL_DIR>, idempotent.
Configurability Contract (no hardcoding; memory: feedback_no_hardcoded_values.md):
12 env vars + flag overrides, each with v1-production-tuned defaults so
`mdemg model pull` with no flags Just Works. See sprint plan §3.
Live-verified all 3 resolution paths:
`--quant Q5_K_M` → namespace=reh3376
`--namespace acme --name custom-model` → namespace=acme name=custom
`MDEMG_MODEL_NAMESPACE=acme env` → env overrides applied
Added to internal/config/config.go: ModelBackend, ModelNamespace,
ModelName, ModelQuants, ModelRamTiers, ModelQuant, AdapterBase,
ModelDir, OllamaModelsRoot, OllamaRegistryHost, ModelManifestPath.
Embedded quant manifest (internal/cli/quant_manifest.json via embed.FS):
Runtime source-of-truth for SHA verification. Operator override via
MDEMG_MODEL_MANIFEST_PATH for air-gapped deployments. Mirrors
docs/development/model-dist-001/quant_manifest.json.
RAM-tier auto-pick:
Default JSON `{"<16":"Q4_K_M","<24":"Q5_K_M","default":"Q8_0"}` maps
host RAM (sysctl on darwin, /proc/meminfo on linux) to quant. Operator
override via MDEMG_MODEL_RAM_TIERS.
Adapter path (--adapter flag) returns ErrAdapterDeferred per Epic 2's
contingency exit — adapter publication lands in MODEL-DIST-002. Flag
machinery preserved for forward compatibility.
Tests (22, all green) in internal/cli/model_test.go:
- Backend factory dispatch (5 cases incl. case-insensitive, default, error)
- Quant allowlist parsing (5 cases incl. whitespace + empty entries)
- RAM-tier JSON parsing (default + operator override + malformed)
- PickQuantForRAM (7 boundary cases)
- ResolveQuant across paths (auto, explicit, rejection, operator-custom)
- QuantManifest load (embedded + file override + missing-file error)
- Ollama tag composition (fused + adapter forms)
- Manifest path composition under custom OLLAMA_MODELS/OLLAMA_HOST
- Blob path digest prefix handling
- Adapter deferred error
- Manifest JSON parser (mediaType filtering + malformed + no-model-layer)
Grep audit (verification checklist):
grep on internal/cli/model*.go for hardcoded values found only in help
text Long/example strings documenting defaults to operators — not in
logic. Behavior values all flow through cfg.Model* fields.
Build + lint clean. Full cli test suite (61s wall) green.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(model-dist-001): Epic 5 — V0021 model_install_events hypertable + writer
Sprint MODEL-DIST-001 Epic 5 — observability for `mdemg model` operations.
Grafana panels deferred to Sprint B (Grafana audit).
New migration:
internal/tsdb/migrations/021_model_install_events.sql
Hypertable on recorded_at, 7-day chunks, 3 indexes (quant-time,
failed-events partial, backend-event-time). Columns: event_id CUIDv2
PK + recorded_at, event_type (pull/verify/remove), backend_name,
namespace, model_name, quant, adapter bool, success bool, latency_ms,
sha256, size_bytes, err_message (1 KB cap).
New writer:
internal/tsdb/model_install_writer.go
Synchronous single-row INSERT (not buffered + CopyFrom — CLI is
one-shot, writes are infrequent vs the V0017/V0018/V0019/V0020 retrieval-
path writers that fire per-request). Nil-pool no-op for degraded mode.
errMessageMaxLen=1024 truncation at write time. New modelInstallPool
interface (Exec-shaped) avoids touching the existing CopyFrom-shaped
poolIface used by buffered writers.
Wiring:
internal/cli/model.go gets recordModelEvent(parent, cfg, row) helper:
- Returns immediately if !cfg.TSDBEnabled || cfg.TSDBHost==""
- 2s timeout on connect (TSDB unreachable doesn't block CLI exit)
- Logs warning + degrades gracefully on any TSDB error
Called from runModelPull (success + failure paths), runModelVerify
(single sweep row), runModelRemove (success + failure paths).
Schema version bump:
internal/config/config.go: TSDB_REQUIRED_SCHEMA_VERSION default 20→21.
CI validator at .github/workflows/ci.yml:60-65 counts SQL files in
internal/tsdb/migrations/ and asserts equality; now 21 files = 21
in config = passes.
Build + lint clean. Existing tsdb / cli test suites green; no new tests
added for the writer itself (single INSERT mirrors V0017/V0018/V0019
patterns already covered; integration is operational verification at
Epic 6 once tsdb is up in the dev stack).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(model-dist-001): Epic 7 — local-model-distribution feature doc
Sprint MODEL-DIST-001 Epic 7 — operator-facing feature documentation
following the standard Why / Choices / How / How-to-use shape (memory:
feedback_per_feature_docs_required.md).
Contents:
- Why: gap between brew install and a working local LLM after Phase 13.5
- Choices: backend matrix (Ollama vs HF vs GitHub vs S3 vs file://),
artifact form (fused vs adapter), Apple Silicon scope, "Ollama runtime
rejected (broken on M5+macOS 26.3.x), Ollama distribution only"
- How it works: ASCII flow diagram covering CLI dispatch -> Fetcher
interface -> OllamaFetcher (preflight, ollama pull, manifest discovery,
blob resolve, symlink, SHA verify) -> V0021 observability row
- How to use:
* Quick start (3 commands: brew install ollama, mdemg model pull,
curl /v1/models)
* Explicit quant selection
* Managing pulled models (list / verify / where / remove)
* Forks + enterprise (MDEMG_MODEL_NAMESPACE override)
* Air-gapped (MDEMG_MODEL_MANIFEST_PATH override)
* Resource matrix per quant (disk, min RAM, recommended RAM, BPW)
* Full Configurability Contract table (11 env vars + flags + defaults)
* V0021 observability schema
- Troubleshooting: ollama missing, SHA mismatch, quant allowlist
rejection, RAM auto-detection failure, out-of-disk, symlink permission
- Forward-looking: MODEL-DIST-002 adapter, Sprint B Grafana panels,
future backends, cross-platform
- References: all source-of-truth files cross-linked
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(model-dist-001): Epic 8 — Documentation Update (main repo)
Sprint MODEL-DIST-001 Epic 8 — final epic, never cut (memory:
feedback_sequential_epics.md).
This commit lands the main-repo doc updates. The packaging/homebrew-mdemg/
submodule docs (README, CHANGELOG, formula caveats text) update at
v0.10.0 release-tag time per the v0.9.0 release flow precedent — that's
when goreleaser auto-regenerates mdemg.rb from .goreleaser.yaml's caveats
template, and the tap-side README/CHANGELOG get edited in lockstep.
Changes:
- CHANGELOG.md: comprehensive Unreleased entry documenting Epics 0-5 + 7
landed in this sprint. Epic 3 ollama push and Epic 6 Tier 3 e2e marked
as gated on operator confirmation. Adapter path explicitly deferred to
MODEL-DIST-002 with epic_2_forensic.md cross-reference. Captures the
Configurability Contract enumeration, the 3 quant SHAs, the Fetcher
interface design, the V0021 hypertable, and the explicit out-of-scope
list.
- CLAUDE.md: new "Model Distribution (Sprint MODEL-DIST-001)" subsection
in Architecture Notes, slotted ABOVE the existing Compose embed entry
for visibility. Captures the pluggable-backend design, the Ollama-as-
distribution-only constraint, the on-disk symlink + manifest discovery
flow, the 11-knob Configurability Contract surface, the no-hardcoding
enforcement, the TSDB V0021 hookup, and the Apple Silicon v1 scope.
- README.md: new "Step 2b (optional): Pull the local LLM" section
between Step 2 (Initialize/Start) and Open the Dashboard. 3-command
quick start (brew install ollama -> mdemg model pull -> set
MDEMG_MODEL_PATH). Cross-references the feature doc for the full
Configurability Contract.
- .goreleaser.yaml: caveats template updated to include `mdemg model pull`
instructions. Goreleaser regenerates the homebrew formula's caveats
block from this on the next v* tag push, so v0.10.0 will ship the new
text to brew users automatically.
Deferred to v0.10.0 release-tag time (handled per v0.9.0 precedent):
- packaging/homebrew-mdemg/README.md update
- packaging/homebrew-mdemg/CHANGELOG.md update
- packaging/homebrew-mdemg/mdemg.rb regeneration (automatic via
goreleaser from the .goreleaser.yaml change in this commit)
- Submodule pointer bump in main repo
Deferred to Epic 6 close (after operator does ollama push):
- post.md sprint-close document
- Capture of remote Ollama manifest digests into quant_manifest.json
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(model-dist-001): Epic 3 closeout — Ollama Library push complete
All 3 fused quants now live on Ollama Library:
https://ollama.com/reh3376/mdemg-llm-v1:Q4_K_M
https://ollama.com/reh3376/mdemg-llm-v1:Q5_K_M
https://ollama.com/reh3376/mdemg-llm-v1:Q8_0
End-to-end integrity verified: remote model-layer digests captured via
GET https://registry.ollama.ai/v2/reh3376/mdemg-llm-v1/manifests/<quant>
match the local Epic 1 SHAs exactly:
Q4_K_M 401161710c22f0ae...411d42ea (matches Epic 1)
Q5_K_M 144ad723101d688f...d5f5d54 (matches Epic 1)
Q8_0 fc14dcb40af1bb58...8db6089 (matches Epic 1)
Captured into quant_manifest.json (both docs canonical + internal/cli
embed.FS mirror, byte-synced):
- ollama_manifest_digest per quant (computed from the manifest body):
Q4_K_M sha256:a210cccb12311773fd70bfa81f221ca0f7940a315bef87b84608caf894533b1b
Q5_K_M sha256:ae6e54fe1ee0b487ae41260687ed14c46c30d1ffb0fece936282418b5bcb78e1
Q8_0 sha256:93df4d64bfa751506f7afba8bf08b891ea828575b838adec17b9399ad85be718
- Corrected size_bytes (Epic 1 used approximate values; replaced with
registry-reported exact bytes for each tag):
Q4_K_M 9.0 GB -> 8.4 GB (9001753408 B; was 9658404096)
Q5_K_M 11 GB -> 9.8 GB (10514569568 B; was 11811160064)
Q8_0 16 GB -> 14.6 GB (15698534208 B; was 17179869184)
- Status flipped from "local-create done; push pending" to "published".
Embedded runtime manifest (internal/cli/quant_manifest.json) re-built into
the binary via embed.FS. TestLoadQuantManifest_EmbeddedFallback green
with new values.
Epic 3 of Sprint MODEL-DIST-001 now COMPLETE. Epic 6 (Tier 3 live e2e —
`mdemg model pull` against the published tags + llama-server load on
port 18102 + sanity inference) is now unblocked.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(model-dist-001): sprint close — post.md
Sprint MODEL-DIST-001 close-out per memory rule
(feedback_sprint_plan_format.md §11 — sprint plans live in
docs/development/<sprint-line>/ with the standard post.md companion).
Sections (CLAUDE.md sprint-plan section guidance):
- Outcome: 3 quants live on Ollama Library, mdemg model pull is the
canonical install path
- Process: how the plan held under reality (operator-surfaced no-
hardcoding rule revised the plan in-place to add the Configurability
Contract before code was written)
- Findings: 5 smooth parts + 5 friction items, both honest:
* convert_hf_to_gguf.py python deps gap (silent ModuleNotFoundError)
* mlx_lm.fuse adapter-path requirement
* convert_lora_to_gguf.py missing from brew install llama.cpp
(proximate Epic 2 deferral trigger)
* mdemg tsdb migrate CWD-aware .env loader quirk
* Epic 1 size estimates off vs registry-reported exact bytes
- Current state: per-layer state matrix
- Testing & benchmarking: all 3 tiers documented (Tier 3 e2e captured
V0021 rows for both pull + verify event_types — live-verified)
- Risks & opportunities (forward): MODEL-DIST-002 adapter scope, Sprint
B Grafana, cross-platform, HFFetcher slot, CWD-aware .env loader QoL
- Sprint commits: 9 commits on dev01, mapped to their epics
Closes Sprint MODEL-DIST-001 functionally. Operational sprint close
(v0.10.0 release tag + tap-repo doc updates) is a separate motion.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(release): promote Unreleased -> v0.10.0
Promote the Sprint MODEL-DIST-001 entry from Unreleased to v0.10.0
(2026-05-11) ahead of release.yml / goreleaser tag push. Fresh empty
Unreleased section seeded above.
v0.10.0 ships:
- mdemg model pull|list|verify|remove|where — one-command path from
brew install mdemg to a working local LLM
- Pluggable ModelFetcher interface (Ollama in v1, slots for HF/S3/GHR/file)
- 3 fused GGUF quants live on Ollama Library at reh3376/mdemg-llm-v1
(:Q4_K_M 8.4 GB / :Q5_K_M 9.8 GB / :Q8_0 14.6 GB)
- 11-knob Configurability Contract (every operator-visible value dynamic)
- TSDB V0021 model_install_events hypertable + writer
- docs/features/local-model-distribution.md
Adapter (LoRA-only) path deferred to MODEL-DIST-002 per the sprint plan's
documented contingency (epic_2_forensic.md).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore(submodule + docs): bump homebrew-mdemg to v0.10.0 + cli-reference Model Distribution section
Stage 4 + Stage 5 of v0.10.0 release.
Submodule pointer bump:
packaging/homebrew-mdemg 6077097 -> c3aa68b
incorporates:
- 42d7390 — goreleaser auto-bumped mdemg.rb to version "0.10.0" + new
caveats text on v0.10.0 tag push
- c3aa68b — manual docs round-trip: CHANGELOG v0.10.0 entry,
README Optional Pull-the-local-LLM section in Quick Start (full
Ollama Library doc with quant matrix, list/verify/where/remove
subcommands, fork variants via MDEMG_MODEL_NAMESPACE, architecture
note "Ollama is distribution-only"), Upgrading to v0.10.0 +
What's New in v0.10.0 blocks, default-LLM rotation history extended,
mdemg_beta_testing.md version pin v0.9.0 -> v0.10.0
docs/user/cli-reference.md (per Stage 5 user request to align refs
with current codebase):
- New ## Model Distribution top-level section before ## Synergy
Optimization (model command group is GroupID="config" in root.go
but a top-level cli-ref section is cleaner for discoverability).
Documents all 5 subcommands (pull, list, verify, remove, where) with
flag tables, usage examples, the full Configurability Contract (11
knobs), the architecture note (Ollama is distribution-only).
- Updated Environment Variable Reference with new "Model Distribution
(Sprint MODEL-DIST-001, v0.10.0)" subsection — 11 env vars +
defaults table.
- Updated Command Tree Summary with the new model subcommand group
slotted between Configuration and Advanced.
docs/user/api-reference.md unchanged: Sprint MODEL-DIST-001 added zero
HTTP endpoints (CLI-only sprint; observability via TSDB V0021 row
writer is server-side internal). Audit also surfaced ~25 routes of
pre-existing drift between code and docs (mostly path-parameter
notation: `/v1/backup/` in code vs `/v1/backup/{id}` in docs — same
routes — plus 3 undocumented /api/graph/* endpoints and 2
undocumented /v1/admin/features/{restart,stop} actions). That drift
is out-of-scope for v0.10.0 and belongs in its own follow-up sprint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(cli): add mdemg model run wrapper (follow-up #1 to MODEL-DIST-001)
One-shot or interactive REPL chat against the configured LLM endpoint
(default: llama-server at port 8102 per Phase 13.5). Closes the gap
operators noted between `ollama run` and the mdemg framework.
Two modes:
- One-shot: `mdemg model run -p "hello"` or positional arg after `--`
- Interactive REPL: no prompt; reads stdin line-by-line, accumulates
conversation history across turns
Pure stdlib HTTP (no llmclient retries/breakers/recording). CLI
invocations are intentionally NOT recorded to llm_interactions — this
is an ad-hoc exploration tool, not a production code path; keeping the
training-data corpus clean.
Every operator-visible value is dynamic per the no-hardcoding rule:
--endpoint override cfg.EffectiveLLMEndpoint
--model override cfg.LLMModel (final fallback: mdemg-llm-v1)
--prompt/-p one-shot prompt (omit for REPL)
--system/-s system message
--temperature (default 0.7)
--max-tokens (default 1024)
--timeout (default 60s)
Live-verified end-to-end on the operator's running llama-server on
port 8102 with mdemg-llm-v1: one-shot worked; system+prompt with
--model override worked.
13 unit tests in model_run_test.go covering: message composition
(system first, no-system skip, history preservation), config
resolution (flag > cfg > final fallback), OpenAI-compat HTTP shape,
error paths (HTTP error, inline error object, no choices, timeout),
trailing-slash endpoint normalization, body-bounding helper. All green.
Renamed local body-bounding helper to `truncateRunBody` to avoid name
collision with a same-named helper in internal/cli/data.go.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(api): document 19 previously-undocumented endpoints (follow-up #2)
Audit of internal/api/server.go (167 routes) vs docs/user/api-reference.md
surfaced 19 genuinely missing endpoints. v0.10.0 commit noted this as
out-of-scope; this commit resolves the gap.
Audit method: extract mux.HandleFunc registrations from server.go, extract
documented "VERB /path" headings from api-reference.md, normalize both to
strip path parameters and trailing prefix slashes, diff. Of the initial
24-entry code-only set, 5 are false positives (combined headers like
"POST /v1/admin/features/start|stop|restart" cover the individual verbs;
"GET|POST /v1/jiminy/protocol/metrics" covers both methods on one route).
Added sections:
Jiminy / J17 (10 endpoints, all under "## Jiminy Inner-Voice"):
GET|POST /v1/jiminy/protocol/metrics # snapshot + reset
GET /v1/jiminy/protocol/status # per-session J17 state
POST /v1/jiminy/checkpoint # tier-transition checkpoint
POST /v1/jiminy/resume-protocol # restore from checkpoint
POST /v1/jiminy/extension # operator-driven tier hold
POST /v1/jiminy/strict # toggle strict mode per session
POST /v1/jiminy/reformulate # advisory -> imperative rewrite
POST /v1/jiminy/classify # pre-Write/Edit pass/deny gate
GET /v1/jiminy/latest # most recent guidance (warm store)
POST /v1/jiminy/warm # eager cache warmup
Memory / Graph (3 endpoints, under "## Memory Operations"):
GET /v1/memory/graph/topology # node/edge counts per layer
GET /v1/memory/graph/neighborhood # local 1-3 hop walk
GET /v1/memory/spaces # root listing of all spaces
Observability (2 endpoints, under "## Metrics & Monitoring"):
GET /v1/metrics/trends # TSDB time-series query
GET /v1/prometheus # Prometheus scrape endpoint
Dashboard / Viz (4 endpoints, new "## Dashboard / Visualization (internal)"
section before MCP Server Tools — operator-internal endpoints backing the
browser dashboard at /ui/):
GET /api/graph/data # force-directed graph data
GET /api/graph/fields # schema field catalog
GET /api/graph/health # explorer health
GET /viz/topology # standalone HTML topology view
Each entry has handler-signature-derived request/response shape, query
parameter table, sample curl/JSON examples following the existing
api-reference convention. TOC updated with new "Dashboard / Visualization
(internal)" entry and renumbered tail.
Out of scope (deliberate, deferred):
- 28 "docs-only" entries from the audit are confirmed false positives
from prefix-matching path normalization (code registers /v1/memory/nodes/
with trailing slash and routes the suffix; docs spell out the full
/v1/memory/nodes/{node_id}/archive form correctly)
- /v1/symbols root path is partially covered by /v1/symbols/relationships
+ /v1/symbols/{id}/relationships in docs; root listing endpoint
documentation can land later if/when its handler grows specific shape
- /v1/conversation/observations covered indirectly by the flag-for-org
endpoint documentation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(grafana-audit): Epic 0 — sprint plan + audit harness
Sprint GRAFANA-AUDIT-001 Epic 0. Builds the per-panel audit harness:
walks every panel in deploy/docker/grafana/dashboards/*.json, extracts
rawSql/sql targets, substitutes Grafana macros (\$__timeFilter,
\$__timeFrom/To, \$__interval, \$__unixEpoch*) + template variables
(\$space_id, \$instance + multi-value variants like \${space_id:raw}),
executes via docker exec mdemg-timescaledb-1 psql, classifies each
panel target as PASS / EMPTY / FAIL / SKIP.
Tier 1 unit tests (17 tests, all green):
- Template-variable substitution: time_filter / from-to / unix epoch /
interval / interval_ms / space_id (3 syntaxes) / instance (3
syntaxes) / multi-macro composite query
- Table extraction (FROM/JOIN with alias, case-insensitive, no-table)
- Panel walking (flat, nested rows, targets-with-sql vs no-sql)
Smoke test against mdemg-overview.json IMMEDIATELY validated the
operator's "diminished observability" report — 5 of 13 panels FAIL,
1 EMPTY, 7 PASS on the front-page dashboard:
FAIL Request Rate
FAIL Error Rate
FAIL Circuit Breakers
FAIL Requests by Status
FAIL Rate Limit Rejections
EMPTY Request Latency Distribution (t0; t1/t2 PASS)
The original 11-panel sample missed these because it sampled different
panels. Lesson: trust the rigorous audit, not the sample. Sprint
proceeds to Epic 1 (full audit across all 146 panels) immediately.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(grafana-audit): Epic 1 + 2 — full audit + findings
Sprint GRAFANA-AUDIT-001 Epics 1 + 2. Per-panel rigorous audit of all
165 target executions across 146 panels in 8 dashboards.
Headline:
PASS 125 (76%) — executes, returns rows in 24h window
EMPTY 19 (12%) — executes, 0 rows
FAIL 3 (2%) — SQL error
SKIP 18 (11%) — non-SQL panel types
Harness fix mid-Epic-1: \$__interval substitution was wrapping the
value in quotes, but Grafana convention has panel SQL provide its own
outer quotes — producing doubled quotes and 18 false-positive FAILs.
Fixed: substitute bare value. Verified by re-run: 20→3 FAILs.
Real failures (Epic 2 findings):
(a) 3 SQL bugs on mdemg-llm-routing.json — all three panels hardcoded
`mdemg-dev` (unquoted) in WHERE clauses instead of '\$space_id'
template variable. PG parses `mdemg-dev` as subtraction.
(b) 5 schema-drift EMPTYs — panel filter expects metric_type or labels
shape that doesn't match server emission:
- mdemg_j17_events_total: panel 'counter', server 'gauge'
- mdemg_rsic_action_total: panel status='success', server status='completed'
- 2 more suspected pending full-SQL inspection.
(c) 2 missing-server-side metrics — mdemg_rate_limit_rejected_total
and mdemg_http_request_duration_seconds_p50 not emitted. Will be
documented; server emission is follow-up.
(d) ~11 sparse-data EMPTYs — panel SQL correct, no rows in 24h window.
Widening time-range in Epic 4.
Projected post-Epic-3/4: 133 PASS, ≤11 EMPTY, 0 FAIL.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(grafana): Epic 3 — 5 panels recovered (3 FAIL + 2 schema-drift)
Sprint GRAFANA-AUDIT-001 Epic 3. Minimum-change JSON edits to fix
category (a) SQL bugs and category (b) schema-drift EMPTYs identified
in Epic 1/2.
mdemg-llm-routing.json (3 panels, all category-a SQL bugs):
- LLM call distribution by model_name (24h)
- LLM latency p50 / p95 / p99 by task × model
- LLM error rate % by task_name (selected range)
Bug: WHERE clause was `(\$space_id = '' OR space_id = '\$space_id')` —
the first \$space_id was unquoted, so PG parsed `mdemg-dev = ''` as
`column "mdemg-dev"` which doesn't exist. Also breached the
no-hardcoding rule (memory: feedback_no_hardcoded_values.md).
Fix: wrap the first variable reference in quotes → `('\$space_id' =
'' OR space_id = '\$space_id')` — a proper string-literal comparison
that also serves as the All-spaces guard the panel author intended.
Verdict: 3 FAIL -> 3 PASS. mdemg-llm-routing is now 4/4 PASS.
mdemg-j17.json :: Total Events (1 panel, category-b drift):
Panel filtered `metric_type = 'counter'` (Prometheus naming
convention because metric is `mdemg_j17_events_total`). Server
actually emits `metric_type = 'gauge'`. 6,393 rows in 7d; 0 panel
matches. Fix: align panel filter to `'gauge'`.
Verdict: EMPTY -> PASS.
mdemg-rsic.json :: Action Success Rate t0 (1 panel target, category-b
drift):
Panel filtered `labels->>'status' = 'success'`. Server actually
emits `'completed'` (181 rows in 24h; 0 panel matches). Fix: align
panel filter to `'completed'`. The t1 'failed' target retained
unchanged — its EMPTY result is now accurate observation (server
emits no `'failed'` actions; 0 = legitimate zero).
Verdict: 1/2 EMPTY -> PASS, 1/2 EMPTY accurate-zero.
Audit verdict counts:
Before: 125 PASS, 19 EMPTY, 3 FAIL, 18 SKIP
After: 130 PASS, 17 EMPTY, 0 FAIL, 18 SKIP
Remaining 17 EMPTYs (Epic 4 disposition):
- 5 category-c emission regression — 4 rsic metrics stopped at
2026-05-07/08 (server-side investigation queued as follow-up)
- 2 category-c never-emitted — Rate Limit Rejections, p50 latency
- 8 category-d sparse-data on ft-training — widen time-range
- 1 mdemg-jiminy :: Effectiveness Trends — CTE pending inspection
- 1 mdemg-rsic :: Action Success Rate t1 (accurate-zero)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(grafana-audit): Epic 4 + 7 — feature doc + sprint close
Sprint GRAFANA-AUDIT-001 closeout (Epics 4 + 5 + 6 + 7 combined as a
single doc-only commit; Epic 5 deferred and Epic 6 deferred-to-operator
as documented in post.md).
New: docs/features/observability-dashboards.md (286 lines) — full
operator-facing inventory of the 8 dashboards with:
- Per-dashboard purpose + panel count + primary tables
- Audit verdict table (130/17/0/18 post-Epic-3)
- Epic 3 fix log: 3 SQL bugs + 2 schema-drift filters
- Known gaps in 3 buckets: (c) emission regression (4 May-7-8 metrics,
current codebase has zero refs — server removed emission), (c)
never-emitted (mdemg_rate_limit_rejected_total +
mdemg_http_request_duration_seconds_p50), (d) sparse/zero data on
this dev TSDB (ft-training tables)
- Refresh expectations per table
- Operator playbook for re-running scripts/grafana_panel_audit.py
- Forward-looking: CI integration, coverage expansion, server-side
emission restore
New: docs/development/grafana-audit-001/post.md — sprint close per
memory rule, covers process / smooth-parts / friction / sprint-plan
vs reality / current state / risks-opportunities / commits.
Epic deferrals (documented in post.md):
- Epic 5 (coverage expansion for 11 unused TSDB tables): deferred
because most target tables are zero on this dev TSDB. Adding panels
would create more EMPTYs, defeating the goal.
- Epic 6 (Tier 3 browser e2e): deferred to operator; not blocking.
CHANGELOG Unreleased entry covers the sprint at high level + cross-
references the feature doc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(model-dist-002): Epic 0 — sprint plan + workspace prep
Sprint MODEL-DIST-002 picks up the adapter-only path deferred from
MODEL-DIST-001 Epic 2. Resolves the tooling gap documented in
epic_2_forensic.md.
Workspace prep:
- Vendored convert_lora_to_gguf.py from llama.cpp source (master, pinned
2026-05-21) into scripts/vendor/llama_cpp/ with MIT LICENSE attribution
and a README documenting refresh policy. brew install llama.cpp ships
convert_hf_to_gguf.py but NOT convert_lora_to_gguf.py; vendoring is the
cleanest path (vs requiring operators to clone llama.cpp source).
- pip install peft==0.19.1 + accelerate==1.13.0 + psutil==7.2.2 into
neural/.venv (the same venv that has torch + transformers + gguf from
MODEL-DIST-001 Epic 1). PEFT is needed for PEFT-format schema validation
+ as a dependency of convert_lora_to_gguf.py.
- Inspected convert_lora_to_gguf.py — expects directory with
adapter_config.json + adapter_model.safetensors in PEFT layout. Confirms
the MLX → PEFT direction is `lora_A: (rank, input)` and
`lora_B: (output, rank)` (script line 41-42 docstring).
Sprint plan in 12-section v1.0 format. 7 epics, 1-2 dev-day estimate.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(model-dist-002): Epic 1-3 — MLX adapter → PEFT → GGUF LoRA + live verify
Sprint MODEL-DIST-002 Epics 1, 2, 3 (combined commit).
Epic 1 — MLX → PEFT converter (scripts/mlx_adapter_to_peft.py + 14 unit tests):
Reads adapters/tier1/adapters.safetensors (514 MB MLX format, 560 tensors,
Phase 5 SFT Iter 2400 best). Per the analysis in MODEL-DIST-001
epic_2_forensic.md:
Key rename: model.layers.<N>.<module>.lora_a
-> base_model.model.model.layers.<N>.<module>.lora_A.weight
Tensor transpose: lora_a (input,rank) -> (rank,input)
lora_b (rank,output) -> (output,rank)
Emits PEFT-format adapter_config.json + adapter_model.safetensors.
Single-adapter PEFT layout (.lora_A.weight, not .lora_A.default.weight)
required by convert_lora_to_gguf.py.
Epic 2 — PEFT → GGUF LoRA (scripts/vendor/llama_cpp/convert_lora_to_gguf.py):
Pinned to llama.cpp release b9000 (self-contained version; upstream master
refactored to a conversion/ Python package with 30+ model files, excessive
vendoring scope). README documents refresh policy.
Output: .local-models/mdemg-llm-v1-adapter.gguf
SHA256: 0cfaf4bae3215a4aea664a8d28ae9a41d73ee740cbcce5c2eef950232cfe1de5
Size: 257 MB (vs ~9 GB fused Q5_K_M; ~35x smaller download)
Tensor count: 560 (matches expected 40 layers x 7 target_modules x 2)
Epic 3 — Live verification (docs/development/model-dist-002/verification.md):
Side-port llama-server on 127.0.0.1:18103 with f16 base + adapter; sanity
prompt vs production 8102 fused model returns semantically-aligned outputs
on the same prompt — both describe MDEMG as a knowledge-graph memory
system. Confirms the MLX-PEFT-GGUF chain is structurally correct.
Iteration during Epic 2 (worth noting):
- Initial vendored convert_lora_to_gguf.py from upstream master failed
with ImportError (refactored to use conversion/ package). Pinned to
b9000 release which is self-contained.
- Initial PEFT keys used .default.weight suffix (multi-adapter layout);
convert_lora_to_gguf.py rejected with \"Not a lora_A or lora_B tensor.\"
Switched to single-adapter layout (.weight) which the script accepts.
Test results: 14/14 Tier 1 tests green; PEFT output loads via
peft.PeftConfig.from_pretrained; GGUF emission completes with all 560
tensors; runtime adapter application produces coherent outputs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(model-dist-002): Epic 4 local — Modelfile.adapter + ollama create
Authored packaging/ollama/Modelfile.adapter:
FROM qwen3:14b
ADAPTER ../../.local-models/mdemg-llm-v1-adapter.gguf
PARAMETER num_ctx 32768 num_predict 4096 stop "<|im_end|>" stop "<|im_start|>"
SYSTEM (Qwen3-14B mdemg fine-tune positioning)
LICENSE Apache 2.0 (inherits from base)
Local ollama create succeeded:
reh3376/mdemg-llm-v1-adapter:latest
Local ID dda290492091
Layers: qwen3:14b base (a8cc1361...) + adapter blob (0cfaf4ba...)
+ template + license + params + system
quant_manifest.json adapter block updated:
status: "deferred to MODEL-DIST-002" -> "local-create done; push pending"
sha256, size_bytes, ollama_local_id captured
pipeline field added (MLX -> PEFT -> GGUF LoRA chain)
Push is operator-gated per MODEL-DIST-001 pattern (one-way action). After
push, ollama_manifest_digest will be captured and embedded quant_manifest.json
will be updated alongside.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(cli): enable mdemg model pull --adapter (MODEL-DIST-002 Epic 5+6)
Lifts the ErrAdapterDeferred guard from MODEL-DIST-001's deferred adapter
path now that reh3376/mdemg-llm-v1-adapter:latest is published.
CLI changes:
- model_fetcher_ollama.go: removed deferral guard from Fetch; switched
readModelBlobDigest to target application/vnd.ollama.image.adapter
mediaType for adapter pulls; added destFilename() helper so adapter
symlinks land at <name>-adapter.gguf (no quant suffix).
- model.go: SHA verify in runModelPull now branches on req.Adapter to
look up mf.Adapter when pulling the adapter form; tag printout shows
<ns>/<name>-adapter:latest for adapter pulls instead of the resolved
fused quant.
- model_fetcher.go: ErrAdapterDeferred sentinel retained for future
non-Ollama backends that ship fused-only first; not currently returned.
QuantManifest gained Adapter *QuantRecord field.
Manifest updates (both embedded + canonical):
- adapter SHA256 0cfaf4bae3215a4aea664a8d28ae9a41d73ee740cbcce5c2eef950232cfe1de5
- Ollama manifest digest sha256:57b98b97ede0e340e8c530aabf579136616ba670281fe04b14777164e655c278
- ollama_media_type application/vnd.ollama.image.adapter
Tests:
- Removed TestOllamaFetcher_AdapterDeferred.
- Added TestDestFilename_FusedQuantAndAdapter (6 cases).
- Added TestOllamaFetcher_ReadAdapterBlobDigest_FiltersOnAdapterMediaType.
Tier 3 live e2e: mdemg model pull --adapter completed in 987 ms, SHA
verify ok, symlink at ~/.mdemg/models/mdemg-llm-v1-adapter.gguf, and
llama-server --lora produced coherent inference against the symlinked
adapter ("MDEMG is a knowledge graph memory system...").
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(model-dist-002): flip adapter section to shipped + sprint close
Epic 7 (Documentation Update — never cut).
- docs/features/local-model-distribution.md: adapter section flipped from
"deferred to MODEL-DIST-002" to "shipped 2026-05-25"; status header
updated; Configurability Contract table adds --adapter flag row.
- CHANGELOG.md: Unreleased gains "Sprint MODEL-DIST-002 — Adapter-only
distribution path shipped" entry with full pipeline + verification +
SHA + Ollama manifest digest.
- CLAUDE.md Model Distribution architecture note: replaces "adapter-only
deferred to MODEL-DIST-002+" with the operator-facing recipe and the
pinned-toolchain pointer.
- docs/development/model-dist-002/post.md: sprint close with epic-by-epic
outcomes, acceptance criteria check-off, surprise log, and forward-
looking notes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): sprint plan (Pattern Y1 TSDB-federation)
Sprint EVENTGRAPH-001 — Reinforcement-Event TSDB Hypertable + Graph
Federation. First implementation of Pattern Y1 from the TypeDB-inspired
topology discussion: federate events into TSDB rather than reify them in
the Neo4j graph, preserve graph traversal via a Go orchestration layer.
12-section v1.0 format; 8 sequential epics; ~1.5-2 dev-days; $0 LLM;
low-medium risk (touches the Hebbian hot write path so the new writer
must be fully non-blocking + the Cypher RETURN-shape change must be
backwards-compatible at the Go call site).
Targets ApplyCoactivation only for v1. Other Hebbian entry points
(ApplySymbolCoactivation, CoactivateSession, ApplyNegativeFeedback)
deferred to EVENTGRAPH-003 once the pattern proves out under
production traffic. Pattern Y2 (link-node promotion in Neo4j)
explicitly deferred until a query proves federation-in-Go insufficient.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(tsdb): V0022 reinforcement_events hypertable (EVENTGRAPH-001 Epic 1)
One row per Hebbian co-activation pair update. Captures prev/new weight
(plus signed delta), evidence_count_after, eta_effective, surprise_factor,
activation_product, path_sim, role/obs_type of both endpoints, session_id,
direction (forward/reverse/bidirectional), and a created_new_edge flag
that distinguishes "new connection formed" from "existing connection
strengthened" at analysis time. trigger_path column will distinguish
ApplyCoactivation from EVENTGRAPH-003's other Hebbian entry points.
7-day chunks (same as V0017-V0021). 4 indexes: per-space time-series,
src+time, dst+time, partial index on (space_id, session_id, time) where
session_id is set. Federation API (Epic 5) needs src + dst lookups for
the graph-neighborhood join.
Buffered + flushed via CopyFrom on TSDB_FLUSH_INTERVAL_SEC cadence
(default 30s). Pattern matches V0019 (sparse_gate_metrics) buffered
writer, NOT V0021 (model_install_events) sync writer — Hebbian writes
are per-retrieve, far higher volume than CLI-driven model install
events.
Config: TSDB_REQUIRED_SCHEMA_VERSION default bumped 21 -> 22.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(tsdb): buffered reinforcement_events writer (EVENTGRAPH-001 Epic 2)
internal/tsdb/reinforcement_writer.go — buffered CopyFrom writer mirroring
the V0019 SparseGateMetricsWriter pattern. 30s auto-flush ticker, Close()
drains buffer + flushes final batch, idempotent across multiple Close
calls. FIFO eviction on buffer-full matches the LLMInteractionWriter
precedent; eviction counted in droppedRows for Epic 6 Prometheus
surfacing.
ReinforcementEventRow serializes optional float / string fields via
nullableFloat / nullableString helpers — zero-valued inputs land as DB
NULL rather than 0 / '', so analytic queries can distinguish "no data"
from "actually zero." Required fields (prev/new/delta weight,
evidence_count_after, created_new_edge, trigger_path) are never
nullable.
Tier 1 unit tests (9 green):
- Record + Flush writes all rows with correct table + column shape.
- Empty buffer Flush is a no-op (no CopyFrom call).
- Buffer-full evicts oldest, increments droppedRows counter.
- Unlimited buffer (maxBufferSize=0) never drops.
- Nullable serialization: zero-valued optionals → DB NULL.
- Flush error increments FailureCount; SuccessCount/TotalRows unchanged.
- Close drains buffer (final flush triggered).
- Close is idempotent (Close × 2 does not double-flush).
- Auto-flush ticker fires within deadline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* refactor(learning): expose per-pair telemetry from Hebbian Cypher (EVENTGRAPH-001 Epic 3)
ApplyCoactivation Cypher RETURN clause extended from "count(*) AS updated"
to 17 per-pair columns: src/dst node IDs, prev/new/delta weight,
evidence_count_after, eta_effective (cfg.LearningEta × etaMult),
surprise_factor, activation_product, path_sim, role_a/b, obs_type_a/b,
session_id, direction (forward/reverse/bidirectional), created_new_edge.
created_new_edge derived from (r.evidence_count = 1) — the ON CREATE
branch sets evidence_count to 1; ON MATCH increments. Reliable proxy
for "new connection formed" vs "existing connection strengthened" at
analysis time.
Plan-deviation disclosure (per feedback_plan_options_pattern.md): the
plan called for 2 rows per pair in asymmetric mode (forward + reverse).
The Cypher mirrors rr.weight = r.weight at all times — forward and
reverse edges carry identical weights. Emitting 2 rows would double-
count without adding signal. Final choice: 1 row per logical pair
regardless of mode, with the direction column carrying the
forward/reverse/bidirectional distinction. Revisit if EVENTGRAPH-003
introduces a Hebbian path where forward/reverse weights diverge.
New helper internal/learning/reinforcement_parser.go translates a
neo4j.Record (or any (key) → (any, bool) getter) into a
tsdb.ReinforcementEventRow. Lives in its own file so service.go
doesn't grow. Defensive against missing keys (zero values), nil values
(zero/empty), wrong types (fallback to zero) — no panics.
Tier 1 unit tests (6 green) cover:
- Symmetric bidirectional + ON CREATE branch
- Asymmetric forward + ON MATCH branch (evidence > 1)
- Missing optional fields → zero values (nullable* writer helpers
serialize as DB NULL)
- Neo4j int64 → Go int coercion
- nil values → zero/empty
- Wrong-typed values → graceful fallback
Reinforcement rows are captured locally in ApplyCoactivation but not
yet forwarded to TSDB — Epic 4 wires the writer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(learning): record reinforcement events to TSDB (EVENTGRAPH-001 Epic 4)
learning.Service grows a reinforcementWriter field + SetReinforcementWriter
setter (mirrors the SetStabilityReinforcer back-compat pattern). After
ExecuteWrite returns from ApplyCoactivation, each captured per-pair row
gets the spaceID stamped on it and is enqueued via writer.Record. The
writer is non-blocking; the Hebbian hot path never waits on TSDB.
Configurability Contract — 7 new env vars (no-hardcoding rule):
- EVENTGRAPH_ENABLED (bool, default true)
- EVENTGRAPH_WRITER_FLUSH_INTERVAL_SEC (int, default 30, floor 5)
- EVENTGRAPH_WRITER_BUFFER_SIZE (int, default 1000, 0 = unlimited)
- EVENTGRAPH_MAX_PAIRS_PER_EVENT_BATCH (int, default 200)
- EVENTGRAPH_MAX_EVENTS_PER_QUERY (int, default 500, Epic 5 ceiling)
- EVENTGRAPH_FEDERATION_DEFAULT_HOPS (int, default 2)
- EVENTGRAPH_FEDERATION_DEFAULT_LOOKBACK_HOURS (int, default 24)
api/server.go wires the writer's lifecycle:
- Constructed after TSDB client is ready, gated by cfg.EventGraphEnabled
so EVENTGRAPH_ENABLED=false cleanly skips construction; learner's
reinforcementWriter stays nil and the Hebbian path short-circuits.
- Closed alongside the other TSDB writers in graceful-shutdown — buffer
drains before the process exits.
Tier 2 integration tests (against real TSDB, build tag integration):
- TestEventGraph_Writer_RoundTrip: 3 rows recorded → flush-window
elapses → SELECT count(*) returns 3.
- TestEventGraph_Writer_DrainOnClose: 5 rows recorded with 1-hour flush
interval → Close() drains → SELECT returns 5 (verifies the server
shutdown invariant).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eventgraph): federation query helper + API endpoint (EVENTGRAPH-001 Epic 5)
internal/eventgraph/query.go — Pattern Y1 federation helper.
EventsInGraphNeighborhood orchestrates a two-step query:
1. Cypher graph walk from a seed node — variable-length path over
CO_ACTIVATED_WITH | GENERALIZES at depth 0..Hops. Returns the
N-hop neighborhood (DISTINCT node_ids, includes the seed).
2. TSDB query against reinforcement_events for events where src OR
dst is in the neighborhood, within the lookback window, ordered
newest-first, capped at the configured limit.
3. Go-side join — annotates events with SrcInNeighborhood /
DstInNeighborhood so the consumer can distinguish "both endpoints
in the subgraph" from "one endpoint outside the seed's N-hop
reach but the event still touches our subgraph."
Empty neighborhood (no seed match, hops=0) short-circuits before the
TSDB call. Sub-1-second Since values clamp to 1s. Hops < 0 is rejected
upfront. The handler enforces an additional ceiling of 2 ×
EVENTGRAPH_FEDERATION_DEFAULT_HOPS for runaway-walk protection.
internal/api/eventgraph_handler.go — POST /v1/eventgraph/reinforcement-
neighborhood. Same auth convention as /v1/admin/breakers. 503 when
EVENTGRAPH_ENABLED=false or when eventgraphService is nil (TSDB-down at
boot). 400 on missing space_id / seed_node_id / negative hops / hops >
ceiling. Defaults applied from config when fields omitted from request.
Plan-decision disclosure (per feedback_plan_options_pattern.md): plan
proposed Option A (single endpoint with event_type query param) vs
Option B (endpoint per event class). Final choice: A. v1 has one event
class (reinforcement); the endpoint URL is explicit about that.
EVENTGRAPH-002 can either add a query param or split the URL when a
second event class arrives — no breaking change either way.
Tests:
- Tier 1 (internal/eventgraph/query_test.go, 7 green): request
validation rejects empty space_id, empty seed, negative hops; interval
formatting roundtrips; join annotation handles both-inside,
one-outside, and empty-neighborhood cases.
- Tier 1 (internal/api/eventgraph_handler_test.go, 4 green + 2 skipped):
method-not-allowed, feature-disabled 503, nil-service 503, invalid-
JSON short-circuit. Two validation paths skipped — they require a
non-nil eventgraphService which can't be constructed without a real
driver; Tier 2 exercises them.
- Tier 2 (tests/integration/eventgraph_federation_test.go, 1 green):
builds seed--mid--leaf graph + off-node, emits 3 reinforcement
events touching all four nodes, calls federation at hops=0 and
hops=1, asserts neighborhood + in-neighborhood flags. The hops=0
test confirms that mid↔leaf (touching neither seed nor any 0-hop
neighbor) is correctly excluded.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(observability): Grafana panel + Prometheus counters for reinforcement events (EVENTGRAPH-001 Epic 6)
Three new Prometheus counters mirror the V0022 writer's internal atomic
counters:
- mdemg_eventgraph_writer_rows_enqueued_total — rows successfully CopyFrom'd
- mdemg_eventgraph_writer_rows_dropped_total — rows FIFO-evicted (buffer full)
- mdemg_eventgraph_writer_flush_failure_total — flush errors
Wiring: the writer accepts a narrow PrometheusCounter interface
(Add(int64)) so internal/tsdb does not import internal/metrics (which
would cycle). api/server.go calls SetPrometheusCounters after the
writer is constructed, passing the three counters from the global
StandardMetrics struct. Nil-safe.
Dashboard: mdemg-graph-topology.json gains a new collapsed row
"Reinforcement Events (EVENTGRAPH-001)" with a single time-series
panel "Reinforcement Event Rate (events/min)" showing all three rates
(enqueued / dropped / flush failures) over the last 24h. Dropped is
colored orange, flush failures red, enqueued the default palette. Tied
to the prometheus datasource.
The existing GRAFANA-AUDIT-001 harness (scripts/grafana_panel_audit.py)
only evaluates SQL-target panels — the new panel uses Prometheus
queries, so it lands on the SKIP pile, same as the other 8 Cypher /
Prometheus panels on this dashboard. Audit JSON refreshed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(eventgraph-001): restore full GRAFANA-AUDIT-001 audit_results.json
Epic 6's targeted audit run (scripts/grafana_panel_audit.py --dashboard
mdemg-graph-topology.json) overwrote the full multi-dashboard audit
results from GRAFANA-AUDIT-001 with the single-dashboard subset (9
SKIPs only). Restoring the full snapshot from commit 0a1e8e1 — that
audit covered all 8 dashboards and is the canonical baseline the
GRAFANA-AUDIT-001 post.md references. EVENTGRAPH-001 did not need to
regenerate it; the new panel uses Prometheus queries, which the audit
harness SKIPs regardless of dashboard.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(retrieval): set Activation on RRF RetrieveResult (EVENTGRAPH-001 fix-commit)
ScoreAndRankRRF's ConsensusResult → RetrieveResult conversion was
silently dropping the Activation field. The legacy ScoreAndRank path at
scoring.go:883 sets Activation: a (where a := act[c.NodeID] is the
spreading-activation map value). The RRF path constructed
models.RetrieveResult{...} with no Activation key, leaving the field
zero-valued.
Net effect: since Phase 13.1 default-on (2026-05-03),
learning.Service.ApplyCoactivation has filtered out every L0 candidate
on the retrieve hot path. The filter is r.Activation >=
LearningMinActivation (default 0.20). With Activation=0, no pair makes
it to the Hebbian Cypher; the function returns nil without writing.
Hebbian learning has been silently no-op on the production retrieve
goroutine for ~24 days. CO_ACTIVATED_WITH edges still exist in the
graph — sidecar paths (CoactivateSession, ApplySymbolCoactivation,
consolidation walks) and pre-Phase-13.1 retrieves wrote them — but the
retrieve-time goroutine has been a silent no-op.
Discovered during EVENTGRAPH-001 Epic 7 live e2e. Three retrieves
produced 0 rows in reinforcement_events. Investigation traced the gap
to the missing Activation field.
Fix: one-line addition in scoring_rrf.go — Activation: act[c.NodeID].
Brings the RRF path to parity with the legacy scorer.
Post-fix verification: rebuilt, restarted server, re-issued 3 retrieves
→ 10 reinforcement events landed in TSDB. Federation API at hops=1
correctly returned all 10 with src_in_neighborhood=true,
dst_in_neighborhood=true. Documented in
docs/development/eventgraph-001/verification.md.
Per CLAUDE.md "Testing — Live System Testing Is Required":
"surprise bugs caught during live smoke get their own follow-up
fix-commit — do not silently roll them into the sprint commit." This
is the precedent-aligned separate commit.
Forward-only: existing graph state is preserved; new retrieves now
correctly emit Hebbian updates. EVENTGRAPH-002 may revisit whether to
backfill the missing 24-day window.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): Tier 3 live e2e verification transcript (Epic 7)
Real /v1/memory/retrieve × 3 against mdemg-dev → 10 reinforcement events
landed in TSDB within the flush window. Federation API at hops=1 from a
seed node returned 5-node neighborhood + 10 in-neighborhood events.
Documents the surprise-bug discovery + fix that preceded this transcript
(see fix-commit for scoring_rrf.go::ScoreAndRankRRF Activation
propagation).
Acceptance criteria from sprint plan §"Acceptance Criteria" all PASS.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): feature doc + CHANGELOG + CLAUDE.md + sprint close (Epic 8)
Final epic — Documentation Update (never cut, per feedback_per_feature_docs_required.md
and the standardized v1.0 sprint plan format).
New: docs/features/event-graph-federation.md (~240 lines, Why / Choices /
How it works / How to use / Forward-looking). Documents:
- Pattern Y1 vs Y2 trade-off (why federation-in-Go now, link-node
reification deferred until a query forces it)
- Why V0019 buffered-CopyFrom over V0021 sync-INSERT (per-retrieve volume)
- Why ApplyCoactivation first (other 3 Hebbian entry points deferred to
EVENTGRAPH-003)
- Why forward-only (no source to backfill from)
- Federation pipeline (Cypher walk → TSDB query → Go-side join with
src/dst_in_neighborhood annotation)
- TSDB schema, API request/response shape, 7 env vars + defaults
- Observability (3 Prometheus counters + Grafana panel)
- Forward-looking sprints
New: docs/development/eventgraph-001/post.md — epic-by-epic outcomes,
acceptance criteria check-off, surprise log (RRF Activation drop +
audit-JSON overwrite + orphan-process port collision), plan deviations
disclosed (1-row-per-pair regardless of asymmetric mode; single-
endpoint over endpoint-per-class), forward-looking.
CHANGELOG.md Unreleased gains the EVENTGRAPH-001 entry — 11 bullet
points covering V0022 migration, buffered writer, Cypher RETURN-shape
change, Configurability Contract, federation helper + API, Prometheus
+ Grafana, Tier 2 + Tier 3 verification, the surprise-bug RRF
Activation fix-commit, and the audit-JSON restore.
CLAUDE.md Architecture Notes gains a new "Event Graph Federation" entry
above the Model Distribution section. Documents the pattern, surface,
deferrals, and the load-bearing fix-commit f307f55 that surfaced 24
days of silent Hebbian no-op on the retrieve hot path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Roger Henley <rogerhenley345@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Roger Edward Henley II <137457424+reh3376@users.noreply.github.com>
* feat(model-dist-002): Epic 0 — sprint plan + workspace prep
Sprint MODEL-DIST-002 picks up the adapter-only path deferred from
MODEL-DIST-001 Epic 2. Resolves the tooling gap documented in
epic_2_forensic.md.
Workspace prep:
- Vendored convert_lora_to_gguf.py from llama.cpp source (master, pinned
2026-05-21) into scripts/vendor/llama_cpp/ with MIT LICENSE attribution
and a README documenting refresh policy. brew install llama.cpp ships
convert_hf_to_gguf.py but NOT convert_lora_to_gguf.py; vendoring is the
cleanest path (vs requiring operators to clone llama.cpp source).
- pip install peft==0.19.1 + accelerate==1.13.0 + psutil==7.2.2 into
neural/.venv (the same venv that has torch + transformers + gguf from
MODEL-DIST-001 Epic 1). PEFT is needed for PEFT-format schema validation
+ as a dependency of convert_lora_to_gguf.py.
- Inspected convert_lora_to_gguf.py — expects directory with
adapter_config.json + adapter_model.safetensors in PEFT layout. Confirms
the MLX → PEFT direction is `lora_A: (rank, input)` and
`lora_B: (output, rank)` (script line 41-42 docstring).
Sprint plan in 12-section v1.0 format. 7 epics, 1-2 dev-day estimate.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(model-dist-002): Epic 1-3 — MLX adapter → PEFT → GGUF LoRA + live verify
Sprint MODEL-DIST-002 Epics 1, 2, 3 (combined commit).
Epic 1 — MLX → PEFT converter (scripts/mlx_adapter_to_peft.py + 14 unit tests):
Reads adapters/tier1/adapters.safetensors (514 MB MLX format, 560 tensors,
Phase 5 SFT Iter 2400 best). Per the analysis in MODEL-DIST-001
epic_2_forensic.md:
Key rename: model.layers.<N>.<module>.lora_a
-> base_model.model.model.layers.<N>.<module>.lora_A.weight
Tensor transpose: lora_a (input,rank) -> (rank,input)
lora_b (rank,output) -> (output,rank)
Emits PEFT-format adapter_config.json + adapter_model.safetensors.
Single-adapter PEFT layout (.lora_A.weight, not .lora_A.default.weight)
required by convert_lora_to_gguf.py.
Epic 2 — PEFT → GGUF LoRA (scripts/vendor/llama_cpp/convert_lora_to_gguf.py):
Pinned to llama.cpp release b9000 (self-contained version; upstream master
refactored to a conversion/ Python package with 30+ model files, excessive
vendoring scope). README documents refresh policy.
Output: .local-models/mdemg-llm-v1-adapter.gguf
SHA256: 0cfaf4bae3215a4aea664a8d28ae9a41d73ee740cbcce5c2eef950232cfe1de5
Size: 257 MB (vs ~9 GB fused Q5_K_M; ~35x smaller download)
Tensor count: 560 (matches expected 40 layers x 7 target_modules x 2)
Epic 3 — Live verification (docs/development/model-dist-002/verification.md):
Side-port llama-server on 127.0.0.1:18103 with f16 base + adapter; sanity
prompt vs production 8102 fused model returns semantically-aligned outputs
on the same prompt — both describe MDEMG as a knowledge-graph memory
system. Confirms the MLX-PEFT-GGUF chain is structurally correct.
Iteration during Epic 2 (worth noting):
- Initial vendored convert_lora_to_gguf.py from upstream master failed
with ImportError (refactored to use conversion/ package). Pinned to
b9000 release which is self-contained.
- Initial PEFT keys used .default.weight suffix (multi-adapter layout);
convert_lora_to_gguf.py rejected with \"Not a lora_A or lora_B tensor.\"
Switched to single-adapter layout (.weight) which the script accepts.
Test results: 14/14 Tier 1 tests green; PEFT output loads via
peft.PeftConfig.from_pretrained; GGUF emission completes with all 560
tensors; runtime adapter application produces coherent outputs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(model-dist-002): Epic 4 local — Modelfile.adapter + ollama create
Authored packaging/ollama/Modelfile.adapter:
FROM qwen3:14b
ADAPTER ../../.local-models/mdemg-llm-v1-adapter.gguf
PARAMETER num_ctx 32768 num_predict 4096 stop "<|im_end|>" stop "<|im_start|>"
SYSTEM (Qwen3-14B mdemg fine-tune positioning)
LICENSE Apache 2.0 (inherits from base)
Local ollama create succeeded:
reh3376/mdemg-llm-v1-adapter:latest
Local ID dda290492091
Layers: qwen3:14b base (a8cc1361...) + adapter blob (0cfaf4ba...)
+ template + license + params + system
quant_manifest.json adapter block updated:
status: "deferred to MODEL-DIST-002" -> "local-create done; push pending"
sha256, size_bytes, ollama_local_id captured
pipeline field added (MLX -> PEFT -> GGUF LoRA chain)
Push is operator-gated per MODEL-DIST-001 pattern (one-way action). After
push, ollama_manifest_digest will be captured and embedded quant_manifest.json
will be updated alongside.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(cli): enable mdemg model pull --adapter (MODEL-DIST-002 Epic 5+6)
Lifts the ErrAdapterDeferred guard from MODEL-DIST-001's deferred adapter
path now that reh3376/mdemg-llm-v1-adapter:latest is published.
CLI changes:
- model_fetcher_ollama.go: removed deferral guard from Fetch; switched
readModelBlobDigest to target application/vnd.ollama.image.adapter
mediaType for adapter pulls; added destFilename() helper so adapter
symlinks land at <name>-adapter.gguf (no quant suffix).
- model.go: SHA verify in runModelPull now branches on req.Adapter to
look up mf.Adapter when pulling the adapter form; tag printout shows
<ns>/<name>-adapter:latest for adapter pulls instead of the resolved
fused quant.
- model_fetcher.go: ErrAdapterDeferred sentinel retained for future
non-Ollama backends that ship fused-only first; not currently returned.
QuantManifest gained Adapter *QuantRecord field.
Manifest updates (both embedded + canonical):
- adapter SHA256 0cfaf4bae3215a4aea664a8d28ae9a41d73ee740cbcce5c2eef950232cfe1de5
- Ollama manifest digest sha256:57b98b97ede0e340e8c530aabf579136616ba670281fe04b14777164e655c278
- ollama_media_type application/vnd.ollama.image.adapter
Tests:
- Removed TestOllamaFetcher_AdapterDeferred.
- Added TestDestFilename_FusedQuantAndAdapter (6 cases).
- Added TestOllamaFetcher_ReadAdapterBlobDigest_FiltersOnAdapterMediaType.
Tier 3 live e2e: mdemg model pull --adapter completed in 987 ms, SHA
verify ok, symlink at ~/.mdemg/models/mdemg-llm-v1-adapter.gguf, and
llama-server --lora produced coherent inference against the symlinked
adapter ("MDEMG is a knowledge graph memory system...").
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(model-dist-002): flip adapter section to shipped + sprint close
Epic 7 (Documentation Update — never cut).
- docs/features/local-model-distribution.md: adapter section flipped from
"deferred to MODEL-DIST-002" to "shipped 2026-05-25"; status header
updated; Configurability Contract table adds --adapter flag row.
- CHANGELOG.md: Unreleased gains "Sprint MODEL-DIST-002 — Adapter-only
distribution path shipped" entry with full pipeline + verification +
SHA + Ollama manifest digest.
- CLAUDE.md Model Distribution architecture note: replaces "adapter-only
deferred to MODEL-DIST-002+" with the operator-facing recipe and the
pinned-toolchain pointer.
- docs/development/model-dist-002/post.md: sprint close with epic-by-epic
outcomes, acceptance criteria check-off, surprise log, and forward-
looking notes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): sprint plan (Pattern Y1 TSDB-federation)
Sprint EVENTGRAPH-001 — Reinforcement-Event TSDB Hypertable + Graph
Federation. First implementation of Pattern Y1 from the TypeDB-inspired
topology discussion: federate events into TSDB rather than reify them in
the Neo4j graph, preserve graph traversal via a Go orchestration layer.
12-section v1.0 format; 8 sequential epics; ~1.5-2 dev-days; $0 LLM;
low-medium risk (touches the Hebbian hot write path so the new writer
must be fully non-blocking + the Cypher RETURN-shape change must be
backwards-compatible at the Go call site).
Targets ApplyCoactivation only for v1. Other Hebbian entry points
(ApplySymbolCoactivation, CoactivateSession, ApplyNegativeFeedback)
deferred to EVENTGRAPH-003 once the pattern proves out under
production traffic. Pattern Y2 (link-node promotion in Neo4j)
explicitly deferred until a query proves federation-in-Go insufficient.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(tsdb): V0022 reinforcement_events hypertable (EVENTGRAPH-001 Epic 1)
One row per Hebbian co-activation pair update. Captures prev/new weight
(plus signed delta), evidence_count_after, eta_effective, surprise_factor,
activation_product, path_sim, role/obs_type of both endpoints, session_id,
direction (forward/reverse/bidirectional), and a created_new_edge flag
that distinguishes "new connection formed" from "existing connection
strengthened" at analysis time. trigger_path column will distinguish
ApplyCoactivation from EVENTGRAPH-003's other Hebbian entry points.
7-day chunks (same as V0017-V0021). 4 indexes: per-space time-series,
src+time, dst+time, partial index on (space_id, session_id, time) where
session_id is set. Federation API (Epic 5) needs src + dst lookups for
the graph-neighborhood join.
Buffered + flushed via CopyFrom on TSDB_FLUSH_INTERVAL_SEC cadence
(default 30s). Pattern matches V0019 (sparse_gate_metrics) buffered
writer, NOT V0021 (model_install_events) sync writer — Hebbian writes
are per-retrieve, far higher volume than CLI-driven model install
events.
Config: TSDB_REQUIRED_SCHEMA_VERSION default bumped 21 -> 22.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(tsdb): buffered reinforcement_events writer (EVENTGRAPH-001 Epic 2)
internal/tsdb/reinforcement_writer.go — buffered CopyFrom writer mirroring
the V0019 SparseGateMetricsWriter pattern. 30s auto-flush ticker, Close()
drains buffer + flushes final batch, idempotent across multiple Close
calls. FIFO eviction on buffer-full matches the LLMInteractionWriter
precedent; eviction counted in droppedRows for Epic 6 Prometheus
surfacing.
ReinforcementEventRow serializes optional float / string fields via
nullableFloat / nullableString helpers — zero-valued inputs land as DB
NULL rather than 0 / '', so analytic queries can distinguish "no data"
from "actually zero." Required fields (prev/new/delta weight,
evidence_count_after, created_new_edge, trigger_path) are never
nullable.
Tier 1 unit tests (9 green):
- Record + Flush writes all rows with correct table + column shape.
- Empty buffer Flush is a no-op (no CopyFrom call).
- Buffer-full evicts oldest, increments droppedRows counter.
- Unlimited buffer (maxBufferSize=0) never drops.
- Nullable serialization: zero-valued optionals → DB NULL.
- Flush error increments FailureCount; SuccessCount/TotalRows unchanged.
- Close drains buffer (final flush triggered).
- Close is idempotent (Close × 2 does not double-flush).
- Auto-flush ticker fires within deadline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* refactor(learning): expose per-pair telemetry from Hebbian Cypher (EVENTGRAPH-001 Epic 3)
ApplyCoactivation Cypher RETURN clause extended from "count(*) AS updated"
to 17 per-pair columns: src/dst node IDs, prev/new/delta weight,
evidence_count_after, eta_effective (cfg.LearningEta × etaMult),
surprise_factor, activation_product, path_sim, role_a/b, obs_type_a/b,
session_id, direction (forward/reverse/bidirectional), created_new_edge.
created_new_edge derived from (r.evidence_count = 1) — the ON CREATE
branch sets evidence_count to 1; ON MATCH increments. Reliable proxy
for "new connection formed" vs "existing connection strengthened" at
analysis time.
Plan-deviation disclosure (per feedback_plan_options_pattern.md): the
plan called for 2 rows per pair in asymmetric mode (forward + reverse).
The Cypher mirrors rr.weight = r.weight at all times — forward and
reverse edges carry identical weights. Emitting 2 rows would double-
count without adding signal. Final choice: 1 row per logical pair
regardless of mode, with the direction column carrying the
forward/reverse/bidirectional distinction. Revisit if EVENTGRAPH-003
introduces a Hebbian path where forward/reverse weights diverge.
New helper internal/learning/reinforcement_parser.go translates a
neo4j.Record (or any (key) → (any, bool) getter) into a
tsdb.ReinforcementEventRow. Lives in its own file so service.go
doesn't grow. Defensive against missing keys (zero values), nil values
(zero/empty), wrong types (fallback to zero) — no panics.
Tier 1 unit tests (6 green) cover:
- Symmetric bidirectional + ON CREATE branch
- Asymmetric forward + ON MATCH branch (evidence > 1)
- Missing optional fields → zero values (nullable* writer helpers
serialize as DB NULL)
- Neo4j int64 → Go int coercion
- nil values → zero/empty
- Wrong-typed values → graceful fallback
Reinforcement rows are captured locally in ApplyCoactivation but not
yet forwarded to TSDB — Epic 4 wires the writer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(learning): record reinforcement events to TSDB (EVENTGRAPH-001 Epic 4)
learning.Service grows a reinforcementWriter field + SetReinforcementWriter
setter (mirrors the SetStabilityReinforcer back-compat pattern). After
ExecuteWrite returns from ApplyCoactivation, each captured per-pair row
gets the spaceID stamped on it and is enqueued via writer.Record. The
writer is non-blocking; the Hebbian hot path never waits on TSDB.
Configurability Contract — 7 new env vars (no-hardcoding rule):
- EVENTGRAPH_ENABLED (bool, default true)
- EVENTGRAPH_WRITER_FLUSH_INTERVAL_SEC (int, default 30, floor 5)
- EVENTGRAPH_WRITER_BUFFER_SIZE (int, default 1000, 0 = unlimited)
- EVENTGRAPH_MAX_PAIRS_PER_EVENT_BATCH (int, default 200)
- EVENTGRAPH_MAX_EVENTS_PER_QUERY (int, default 500, Epic 5 ceiling)
- EVENTGRAPH_FEDERATION_DEFAULT_HOPS (int, default 2)
- EVENTGRAPH_FEDERATION_DEFAULT_LOOKBACK_HOURS (int, default 24)
api/server.go wires the writer's lifecycle:
- Constructed after TSDB client is ready, gated by cfg.EventGraphEnabled
so EVENTGRAPH_ENABLED=false cleanly skips construction; learner's
reinforcementWriter stays nil and the Hebbian path short-circuits.
- Closed alongside the other TSDB writers in graceful-shutdown — buffer
drains before the process exits.
Tier 2 integration tests (against real TSDB, build tag integration):
- TestEventGraph_Writer_RoundTrip: 3 rows recorded → flush-window
elapses → SELECT count(*) returns 3.
- TestEventGraph_Writer_DrainOnClose: 5 rows recorded with 1-hour flush
interval → Close() drains → SELECT returns 5 (verifies the server
shutdown invariant).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eventgraph): federation query helper + API endpoint (EVENTGRAPH-001 Epic 5)
internal/eventgraph/query.go — Pattern Y1 federation helper.
EventsInGraphNeighborhood orchestrates a two-step query:
1. Cypher graph walk from a seed node — variable-length path over
CO_ACTIVATED_WITH | GENERALIZES at depth 0..Hops. Returns the
N-hop neighborhood (DISTINCT node_ids, includes the seed).
2. TSDB query against reinforcement_events for events where src OR
dst is in the neighborhood, within the lookback window, ordered
newest-first, capped at the configured limit.
3. Go-side join — annotates events with SrcInNeighborhood /
DstInNeighborhood so the consumer can distinguish "both endpoints
in the subgraph" from "one endpoint outside the seed's N-hop
reach but the event still touches our subgraph."
Empty neighborhood (no seed match, hops=0) short-circuits before the
TSDB call. Sub-1-second Since values clamp to 1s. Hops < 0 is rejected
upfront. The handler enforces an additional ceiling of 2 ×
EVENTGRAPH_FEDERATION_DEFAULT_HOPS for runaway-walk protection.
internal/api/eventgraph_handler.go — POST /v1/eventgraph/reinforcement-
neighborhood. Same auth convention as /v1/admin/breakers. 503 when
EVENTGRAPH_ENABLED=false or when eventgraphService is nil (TSDB-down at
boot). 400 on missing space_id / seed_node_id / negative hops / hops >
ceiling. Defaults applied from config when fields omitted from request.
Plan-decision disclosure (per feedback_plan_options_pattern.md): plan
proposed Option A (single endpoint with event_type query param) vs
Option B (endpoint per event class). Final choice: A. v1 has one event
class (reinforcement); the endpoint URL is explicit about that.
EVENTGRAPH-002 can either add a query param or split the URL when a
second event class arrives — no breaking change either way.
Tests:
- Tier 1 (internal/eventgraph/query_test.go, 7 green): request
validation rejects empty space_id, empty seed, negative hops; interval
formatting roundtrips; join annotation handles both-inside,
one-outside, and empty-neighborhood cases.
- Tier 1 (internal/api/eventgraph_handler_test.go, 4 green + 2 skipped):
method-not-allowed, feature-disabled 503, nil-service 503, invalid-
JSON short-circuit. Two validation paths skipped — they require a
non-nil eventgraphService which can't be constructed without a real
driver; Tier 2 exercises them.
- Tier 2 (tests/integration/eventgraph_federation_test.go, 1 green):
builds seed--mid--leaf graph + off-node, emits 3 reinforcement
events touching all four nodes, calls federation at hops=0 and
hops=1, asserts neighborhood + in-neighborhood flags. The hops=0
test confirms that mid↔leaf (touching neither seed nor any 0-hop
neighbor) is correctly excluded.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(observability): Grafana panel + Prometheus counters for reinforcement events (EVENTGRAPH-001 Epic 6)
Three new Prometheus counters mirror the V0022 writer's internal atomic
counters:
- mdemg_eventgraph_writer_rows_enqueued_total — rows successfully CopyFrom'd
- mdemg_eventgraph_writer_rows_dropped_total — rows FIFO-evicted (buffer full)
- mdemg_eventgraph_writer_flush_failure_total — flush errors
Wiring: the writer accepts a narrow PrometheusCounter interface
(Add(int64)) so internal/tsdb does not import internal/metrics (which
would cycle). api/server.go calls SetPrometheusCounters after the
writer is constructed, passing the three counters from the global
StandardMetrics struct. Nil-safe.
Dashboard: mdemg-graph-topology.json gains a new collapsed row
"Reinforcement Events (EVENTGRAPH-001)" with a single time-series
panel "Reinforcement Event Rate (events/min)" showing all three rates
(enqueued / dropped / flush failures) over the last 24h. Dropped is
colored orange, flush failures red, enqueued the default palette. Tied
to the prometheus datasource.
The existing GRAFANA-AUDIT-001 harness (scripts/grafana_panel_audit.py)
only evaluates SQL-target panels — the new panel uses Prometheus
queries, so it lands on the SKIP pile, same as the other 8 Cypher /
Prometheus panels on this dashboard. Audit JSON refreshed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(eventgraph-001): restore full GRAFANA-AUDIT-001 audit_results.json
Epic 6's targeted audit run (scripts/grafana_panel_audit.py --dashboard
mdemg-graph-topology.json) overwrote the full multi-dashboard audit
results from GRAFANA-AUDIT-001 with the single-dashboard subset (9
SKIPs only). Restoring the full snapshot from commit 0a1e8e1 — that
audit covered all 8 dashboards and is the canonical baseline the
GRAFANA-AUDIT-001 post.md references. EVENTGRAPH-001 did not need to
regenerate it; the new panel uses Prometheus queries, which the audit
harness SKIPs regardless of dashboard.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(retrieval): set Activation on RRF RetrieveResult (EVENTGRAPH-001 fix-commit)
ScoreAndRankRRF's ConsensusResult → RetrieveResult conversion was
silently dropping the Activation field. The legacy ScoreAndRank path at
scoring.go:883 sets Activation: a (where a := act[c.NodeID] is the
spreading-activation map value). The RRF path constructed
models.RetrieveResult{...} with no Activation key, leaving the field
zero-valued.
Net effect: since Phase 13.1 default-on (2026-05-03),
learning.Service.ApplyCoactivation has filtered out every L0 candidate
on the retrieve hot path. The filter is r.Activation >=
LearningMinActivation (default 0.20). With Activation=0, no pair makes
it to the Hebbian Cypher; the function returns nil without writing.
Hebbian learning has been silently no-op on the production retrieve
goroutine for ~24 days. CO_ACTIVATED_WITH edges still exist in the
graph — sidecar paths (CoactivateSession, ApplySymbolCoactivation,
consolidation walks) and pre-Phase-13.1 retrieves wrote them — but the
retrieve-time goroutine has been a silent no-op.
Discovered during EVENTGRAPH-001 Epic 7 live e2e. Three retrieves
produced 0 rows in reinforcement_events. Investigation traced the gap
to the missing Activation field.
Fix: one-line addition in scoring_rrf.go — Activation: act[c.NodeID].
Brings the RRF path to parity with the legacy scorer.
Post-fix verification: rebuilt, restarted server, re-issued 3 retrieves
→ 10 reinforcement events landed in TSDB. Federation API at hops=1
correctly returned all 10 with src_in_neighborhood=true,
dst_in_neighborhood=true. Documented in
docs/development/eventgraph-001/verification.md.
Per CLAUDE.md "Testing — Live System Testing Is Required":
"surprise bugs caught during live smoke get their own follow-up
fix-commit — do not silently roll them into the sprint commit." This
is the precedent-aligned separate commit.
Forward-only: existing graph state is preserved; new retrieves now
correctly emit Hebbian updates. EVENTGRAPH-002 may revisit whether to
backfill the missing 24-day window.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): Tier 3 live e2e verification transcript (Epic 7)
Real /v1/memory/retrieve × 3 against mdemg-dev → 10 reinforcement events
landed in TSDB within the flush window. Federation API at hops=1 from a
seed node returned 5-node neighborhood + 10 in-neighborhood events.
Documents the surprise-bug discovery + fix that preceded this transcript
(see fix-commit for scoring_rrf.go::ScoreAndRankRRF Activation
propagation).
Acceptance criteria from sprint plan §"Acceptance Criteria" all PASS.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): feature doc + CHANGELOG + CLAUDE.md + sprint close (Epic 8)
Final epic — Documentation Update (never cut, per feedback_per_feature_docs_required.md
and the standardized v1.0 sprint plan format).
New: docs/features/event-graph-federation.md (~240 lines, Why / Choices /
How it works / How to use / Forward-looking). Documents:
- Pattern Y1 vs Y2 trade-off (why federation-in-Go now, link-node
reification deferred until a query forces it)
- Why V0019 buffered-CopyFrom over V0021 sync-INSERT (per-retrieve volume)
- Why ApplyCoactivation first (other 3 Hebbian entry points deferred to
EVENTGRAPH-003)
- Why forward-only (no source to backfill from)
- Federation pipeline (Cypher walk → TSDB query → Go-side join with
src/dst_in_neighborhood annotation)
- TSDB schema, API request/response shape, 7 env vars + defaults
- Observability (3 Prometheus counters + Grafana panel)
- Forward-looking sprints
New: docs/development/eventgraph-001/post.md — epic-by-epic outcomes,
acceptance criteria check-off, surprise log (RRF Activation drop +
audit-JSON overwrite + orphan-process port collision), plan deviations
disclosed (1-row-per-pair regardless of asymmetric mode; single-
endpoint over endpoint-per-class), forward-looking.
CHANGELOG.md Unreleased gains the EVENTGRAPH-001 entry — 11 bullet
points covering V0022 migration, buffered writer, Cypher RETURN-shape
change, Configurability Contract, federation helper + API, Prometheus
+ Grafana, Tier 2 + Tier 3 verification, the surprise-bug RRF
Activation fix-commit, and the audit-JSON restore.
CLAUDE.md Architecture Notes gains a new "Event Graph Federation" entry
above the Model Distribution section. Documents the pattern, surface,
deferrals, and the load-bearing fix-commit f307f55 that surfaced 24
days of silent Hebbian no-op on the retrieve hot path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(eventgraph-001): Grafana panel uses TSDB instead of unconfigured Prometheus datasource
The Epic 6 panel used datasource {type: prometheus, uid: prometheus} but
this Grafana instance has no Prometheus datasource configured — mdemg
exposes counters as JSON via /v1/metrics/snapshot, not a /metrics scrape
endpoint. Configured datasources: mdemg-nodegraph, neo4j, timescaledb
only. The panel rendered "No data" in the live Grafana.
Rewritten panel queries the reinforcement_events hypertable directly via
the timescaledb postgres datasource. Two targets:
1. count(*) over 1-minute time_buckets → overall events/min
2. count(*) FILTER (WHERE created_new_edge) vs WHERE NOT created_new_edge
→ split between new connections formed and existing connections
strengthened (the operational dimension the analytic queries
actually need)
Both targets templated on $space_id (existing dashboard variable). The
Prometheus counters (mdemg_eventgraph_writer_rows_{enqueued,dropped,
flush_failure}_total) remain wired and incrementing — they surface via
/v1/metrics/snapshot for ops scripts. The Grafana panel now actually
displays data instead of relying on a scrape path that doesn't exist
in this deployment.
Discovered during post-merge live verification (2026-05-29). Verified
fix: reloaded dashboard via Grafana API → /api/ds/query against same
SQL returns 1-minute buckets matching TSDB direct count. Audit harness
now reports 2 PASS for the new panel (previously SKIP — no SQL target).
verification.md updated with the post-merge transcript.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(rrf-scale-001): sprint plan — RRF score-scale consumer remediation
P0 fix. The Jiminy guidance->feedback->outcome loop has been dormant
~9 weeks: consulting/service.go gates constraint/suggestion extraction
on hardcoded legacy-scale score thresholds (r.Score < 0.55 et al.).
Phase 13.1 RRF (default-on May 3) dropped the score scale so strong
matches top out ~0.53 -> 0/10 results clear the gates -> empty guidance
-> dead loop. Third instance of the RRF-score-contract bug class (after
the EVENTGRAPH-001 Activation drop).
12-section format; 6 epics; config-driven percentile-gate fix +
sigmoid recalibration; live-verify the revived loop end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): Epic 1 audit findings — 12 sites cataloged
Full-repo sweep of post-RRF score/activation/confidence consumers +
live score-distribution sampling. Findings:
- HIGH (4): consulting constraint gates (1005/1081/1087) + confidence
sigmoid midpoint 1.5 (35-36) — the loop-killer cluster.
- MED (5): consulting conflict gates (931/944/957/981) + minConfidence
pre-filter (619, already config-driven).
- LOW (3): retrieval/jiminy.go Activation display gates (45/155/192) —
explanation text only, no guidance gating.
- NONE (2): jiminy trial score (0-10 scale), trust-score clamp.
Live distribution: RRF strong-match top scores cluster 0.49-0.58; the
0.55 gate sits mid-band, rejecting the most-relevant constraint half
the time. NormalizedConfidence is positional rank (spreads 100->0 even
on uniform-score sets) -> rules out plan Option A (percentile) as sole
gate. Remediation: config-driven RRF-calibrated absolute thresholds
(Option B), constraint floor default 0.45, sigmoid midpoint ->0.45.
Disclosed deviation per feedback_plan_options_pattern.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(consulting): RRF-calibrate score gates + confidence sigmoid (RRF-SCALE-001 Epic 2)
Revives the dormant Jiminy guidance loop. Replaces 7 hardcoded legacy-
scale score gates in consulting/service.go + the score->confidence
sigmoid (both copies) with config-driven, RRF-calibrated values.
Gates (all default 0.45, RRF strong-match band is 0.49-0.58):
- constraint extraction (was <0.55) -> CONSULTING_CONSTRAINT_SCORE_FLOOR
- keyword/name authority inner gate (0.55/0.6) -> CONSULTING_AUTHORITY_SCORE_FLOOR
- conflict/contradiction detection (0.6-0.7) -> CONSULTING_CONFLICT_SCORE_FLOOR
Key Epic-2 finding: keywordClassifyConstraint has an INNER authority
gate that binds tighter than the outer constraint gate. If authority
floor > constraint floor, the binding gate re-rejects the strong-match
band and the loop stays dormant -> all three default to 0.45. The RRF
band is too compressed to subdivide into tiers; knobs stay separate so
operators can raise any one independently.
Sigmoid (score->confidence), both consulting/service.go and
jiminy/retrieval_source.go (they MUST stay in sync per their own
comments): midpoint 1.5 -> 0.45, steepness 1.5 -> 8.0, config-driven via
RETRIEVAL_CONFIDENCE_SIGMOID_{MIDPOINT,STEEPNESS}. Legacy crushed a
strong 0.5 match to 0.18 confidence; recalibrated maps it to 0.60
(0.1->0.06, 0.58->0.74). normalizeRetrievalConfidence is now a Service
method reading cfg with zero-value fallback; mapRetrievalToGuidance
takes the sigmoid params from its caller's cfg.
5 new config knobs, all with RRF-calibrated defaults + zero-value
guards (no-hardcoding rule; the bug WAS a hardcoded value).
Tier 1 tests: updated 2 legacy-scale boundary tests to the new
thresholds + added RRFStrongMatchBand regression (0.50 must surface),
ConstraintFloor_ConfigDriven (override honored), and
NormalizeRetrievalConfidence_RRFCalibration (band mapping). Full
consulting + jiminy + config suites green; lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): Epic 3 — remaining LOW findings reviewed + decided
retrieval/jiminy.go Activation display gates (45/155/192 + LearningEdge
siblings) traced live: they're in the explainability renderer, not the
guidance-surfacing path; always-additive at RRF scale (live activation
~0.723 >> thresholds), no misbehavior. Intentionally left unchanged with
rationale — config-ifying display verbosity is out of proportion to zero
functional impact. Every High/Med remediated (Epic 2), every Low decided.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(rrf-scale-001): Tier 2 integration + Tier 3 live verification (Epic 4)
Tier 3 live e2e (verification.md): the score-gate fix revives the
dormant guidance loop on the live stack —
- /v1/jiminy/guide guidance items 0 -> 10, source_counts.constraints
0 -> 2, patterns 0 -> 3 (acceptance #1 MET).
- Full loop warm->latest->feedback->outcome: TSDB constraint_outcomes
sink REVIVED — fresh rows dated 2026-06-03 (table was dead since
May 1). Constraint-effectiveness Grafana sink is live again.
Three adjacent issues surfaced during live smoke, documented as distinct
follow-ups (NOT score-scale, not bolted on):
- A: Neo4j GUIDANCE_OUTCOME edges still dormant — guidance SourceNodes
point at emergent_concept nodes; PersistGuidanceOutcome only writes
edges for constraint/correction/pattern/learning or role_type=
constraint targets. Node-type-targeting bug, independent of RRF.
Candidate sprint JIMINY-OUTCOME-001.
- B: LLM guidance synthesis timeout (now that synthesis runs).
- C: /v1/jiminy/latest unescaped control chars break jq/json parsers —
the hook uses jq, so may compound dormancy. Low-effort follow-up.
Tier 2 (rrf_scale_guidance_test.go, integration tag, 2 green):
- SuggestSurfacesGuidance: constraint-matching context surfaces 7
suggestions (was 0 before fix) against live mdemg-dev.
- SuggestRejectsNoise: gibberish does not flood constraints (no
over-correction).
Cold-start note: first guide call post-restart returned constraints:0
(LLM classifier cold-model timeout -> keyword fallback); after one
warm-up call, constraints surface. Model-warmth artifact, not a fix
defect.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): CHANGELOG + CLAUDE.md score-scale contract + post.md (Epic 5)
Final epic. CHANGELOG Unreleased gains the RRF-SCALE-001 Fixed entry.
CLAUDE.md gains a 'score-scale contract' architecture note — the
structural defense against a 4th instance: downstream consumers MUST
NOT hardcode absolute thresholds against RetrieveResult.Score (the
scorer scale is not a stable contract); gate via config or a
scale-invariant signal, and re-audit on any scorer change. Notes that
NormalizedConfidence is positional (not a safe sole gate) and records
the three open follow-ups.
post.md: epic-by-epic, acceptance check-off (honest: #2 partial — TSDB
sink revived, Neo4j edge is distinct Follow-up A), scope note
separating the score-scale fix (done) from the 3 adjacent surfaced
issues (documented follow-ups), discipline notes (cold-start mask,
inner authority gate).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(rrf-scale-001): skip guidance integration tests on empty environment (CI fix)
CI failure on PR 404: TestRRFScale_SuggestSurfacesGuidance failed in
0.02s. Root cause: the test assumed the populated local mdemg-dev space
(111 constraint nodes), but CI boots a FRESH EMPTY Neo4j with stub
embeddings (and RETRIEVAL_COLUMN_VOTING_ENABLED=false / legacy scorer).
With no data, /v1/memory/suggest returns 0 candidates, so the
'total == 0' assertion fired.
Other integration tests self-seed data or skip when prerequisites are
absent; mine relied on ambient data — wrong for a reproducible CI run.
Fix: skip when debug.retrieved_count == 0 (no retrievable data → the
score-gate fix isn't exercisable; there's nothing for the gate to admit
or reject). The test stays meaningful against a populated stack (local:
9 suggestions from 15 retrieved → PASS) and skips cleanly in CI's
empty-DB environment. Verified both paths live: populated → PASS,
empty space → retrieved_count 0 → SKIP.
The gate fix itself is validated by Tier 1 unit tests + the live Tier 3
e2e (docs/development/rrf-scale-001/verification.md); this integration
test is a bonus live-stack assertion, not the primary proof.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): sprint plan — revive Neo4j GUIDANCE_OUTCOME sink
Follow-up A from RRF-SCALE-001: the Neo4j GUIDANCE_OUTCOME edge sink has
been dormant since Apr 12. Root cause: matchConstraintCode links guidance
items to constraint codes by keyword overlap (>=3 shared words), but
retrieval surfaces emergent_concept abstractions whose content does not
share 3+ literal words with raw constraint text -> no constraint_code ->
PersistGuidanceOutcome falls back to the concept SourceNode -> the
role_type=constraint filter rejects it -> no edge. Live-proven: all 17
recent outcome rows had constraint_code=(none).
Fix (Option 1): switch the matcher to embedding cosine similarity
(content already normalized to natural language ~0.70 cosine; Service
has an embedder; cosineSimilarity + embed->cosine pattern already exist
in-package via OutcomeClassifier). Existing PersistGuidanceOutcome +
findConstraintNodeID then create edges on the correct constraint nodes.
Keyword matching stays as fallback -- never regresses.
4 epics; ~1-1.5 dev-days; config-driven threshold; acceptance bar = a
fresh Neo4j GUIDANCE_OUTCOME edge on a real role_type=constraint node
dated today, reflected in GetConstraintEffectiveness.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): embedding-similarity constraint-code matching (JIMINY-OUTCOME-001 Epic 1)
Revives the Neo4j GUIDANCE_OUTCOME edge sink (dormant since Apr 12).
Root cause (RRF-SCALE-001 Follow-up A): matchConstraintCode links
guidance items to constraint codes by keyword overlap (>=3 shared
words), but retrieval surfaces emergent_concept abstractions whose
content rarely shares 3+ literal words with raw constraint text -> no
code -> PersistGuidanceOutcome falls back to the concept SourceNode ->
the role_type=constraint filter rejects it -> no edge.
Fix: new matchConstraintCodeByEmbedding queries the constraint vector
index (db.index.vector.queryNodes, role_type=constraint, sim >=
threshold) and returns the closest constraint's code. Guide() tries
this first, falling back to the keyword matcher when the embedder is
unavailable, content is empty, or nothing clears the threshold — never
regresses. The existing PersistGuidanceOutcome + findConstraintNodeID
then create the edge on the correct constraint node.
Implementation refinement vs plan: uses Neo4j's vector index server-side
(mirrors the proven Evaluator.findMatchingConstraints pattern) rather
than loading all constraint embeddings into Go and computing cosine in a
loop — cleaner, no constraintCodeEntry.Embedding needed. Same Option-1
outcome.
Config: JIMINY_CONSTRAINT_CODE_SIM_THRESHOLD (default 0.55, zero-value
fallback) — provisional; tuned against the live similarity distribution
in Epic 2.
Tier 1 (4 tests): nil-driver/empty-embedding guards, threshold default
resolution, keyword-fallback non-regression. Full jiminy + config
suites green; lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(jiminy-outcome-001): Tier 2 integration + Tier 3 live verification (Epic 2)
Tier 3 live e2e (verification.md) — acceptance bar MET:
- /v1/jiminy/guide now yields guidance items carrying constraint_codes
(10 items, 6 coded; was 0). Matched code 'no-direct-main-commits' is
semantically exact for the 'commit to main' context.
- Full warm->latest->feedback loop: Neo4j GUIDANCE_OUTCOME 893 -> 899
(+6), latest today. All 6 new edges land on REAL role_type=constraint
nodes ('CONSTRAINT: NEVER commit directly to main') — not
emergent_concept. The sink dormant since Apr 12 is revived on the
correct nodes.
- /v1/constraints/effectiveness reflects it: 'NEVER commit directly to
main | surfaced: 30 followed: 28 rate: 0.93'.
- Both sinks now revived: TSDB (RRF-SCALE-001) + Neo4j (here). The
constraint-effectiveness loop is fully restored.
Threshold 0.55 validated live: correct matches, no false positives.
Tier 2 (jiminy_outcome_test.go, integration tag, skip-on-empty): PASSES
on a populated stack with an idle LLM (7/10 items coded). The guide path
is LLM-latency-dependent (per-node classifier ~31s/call, serialized; a
call fired while the LLM is busy fast-fails empty), so the test
warm-retries and SKIPS (never false-fails) when the LLM path can't
produce items. Bonus check; Tier 3 is the definitive proof. The LLM
serialization/synthesis-timeout is RRF-SCALE-001 Follow-up B, tracked
separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): CHANGELOG + CLAUDE.md + post.md (Epic 3)
Final epic. CHANGELOG Unreleased gains the JIMINY-OUTCOME-001 Fixed
entry. CLAUDE.md gains a guidance-outcome constraint-code-matching note
(embedding-first via vector index, keyword fallback; both outcome sinks
now live). post.md: epic-by-epic, acceptance check-off, the loop-revival
completion (TSDB from RRF-SCALE-001 + Neo4j here), discipline notes (LLM
serialization is the test-flakiness source), forward-looking (Follow-up
B now the most operationally-visible remaining issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(guidance-synth-001): sprint plan — fix guidance synthesis timeout (Follow-up B)
Synthesis fails on every production warm call (6/6 jiminy.synthesize
errored). Root cause: the hook's /warm path runs background Guide() with
a hardcoded 30s timeout (handlers_jiminy.go:302), inside which the
per-node constraint classifier runs SERIALLY (~1.5s x ~10 nodes ~= 15s),
leaving only ~15s for synthesis which needs 8-27s -> deadline exceeded.
JIMINY_TIMEOUT_MS=240s is configured but the 30s hardcode caps it.
Fix (both): (1) parallelize the per-node classifier with bounded
concurrency (CONSULTING_CLASSIFY_CONCURRENCY, default 4 matching
llama-server --parallel 4); (2) config-drive the warm timeout
(JIMINY_WARM_COMPUTE_TIMEOUT_MS, default 90s). Acceptance: synthesis
succeeds live (no synthesis_error), measured latency drop, no
constraint-surfacing regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(consulting): parallelize per-node constraint classifier (GUIDANCE-SYNTH-001 Epic 1)
The per-node LLM constraint classifier in findApplicableConstraints ran
serially (~1.5s/node x ~10 nodes ~= 15s), starving guidance synthesis of
its time budget (synthesis 6/6 errored on the warm path). Now classifies
with bounded concurrency.
- Gate-first (RRF-SCALE-001 score floor) to fix a stable candidate order,
then classify each candidate into a position-indexed slot via a
semaphore-bounded worker pool, then collect-in-order + dedup-by-name —
output is identical to the serial path (determinism). Keyword-only
(no LLM) or cap=1 runs serially (no LLM latency to hide).
- Config: CONSULTING_CLASSIFY_CONCURRENCY (default 4, matching
llama-server --parallel 4; floor 1 = serial rollback). Zero-value
fallback to 4.
- Extracted constraintClassifierIface (minimal Classify surface) so the
concurrent path is unit-testable with a fake; *ConstraintClassifier
satisfies it; SetConstraintClassifier guards against a typed-nil
interface.
Tier 1 (5 new, -race clean): ParallelEqualsSerial (determinism + order),
ParallelIsFaster (concurrency overlaps latency), ErrorFallsBackToKeyword
(fallback intact), ScoreGateStillApplies (RRF-SCALE-001 gate preserved),
ConcurrencyDefaultFallback. Existing findApplicableConstraints tests
unchanged — no regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): config-drive warm-compute timeout (GUIDANCE-SYNTH-001 Epic 2)
The warm-path background Guide() ran with a hardcoded 30s timeout
(handlers_jiminy.go:302) even though JIMINY_TIMEOUT_MS=240s is
configured. 30s was too tight for the per-node classifier (~15s) +
synthesis (8-27s) -> synthesis deadline-exceeded every warm call.
Replaced with JIMINY_WARM_COMPUTE_TIMEOUT_MS (default 90000, zero-value
fallback 90000) — headroom for the now-parallel classifier (~7.5s) + a
slow 27s synthesis. No-hardcoding rule. Rollback: set to 30000.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test+docs(guidance-synth-001): Tier 2/3 verification + docs + close (Epic 3)
Tier 3 live e2e (verification.md): the warm production path now produces
a synthesized narrative — synthesis_used=true, no synthesis_error,
1892-char augmentation. Fresh jiminy.synthesize succeeded at 50.7s
latency (fit the new 90s budget; would die at the old 30s — validates
the default). Both fixes needed.
Tier 2 (guidance_synth_test.go, integration, skip-on-empty + LLM-
tolerant): PASS — warm path produces guidance without synthesis_error.
Docs: CHANGELOG Fixed entry; CLAUDE.md guidance-synthesis-budget note
('when adding LLM calls to the guidance hot path: respect the
warm-compute budget and prefer bounded concurrency over serial loops');
post.md with the data-driven diagnosis + forward-looking (Follow-up C
now the last open item).
Closes Follow-up B. The guidance pipeline (surfacing + codes +
synthesis) is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(followup-c): close JSON control-char escaping as NON-ISSUE (no fix)
Follow-up C (the last open item from RRF-SCALE-001 triage) investigated
and closed with evidence — NO code change, because there is no bug to fix.
The earlier /v1/jiminy/latest parse failures were client-side shell
artifacts (co-occurring with the session's 'failed to change group ID'
errors + ad-hoc variable-capture piping), not server bytes:
- writeJSON uses json.NewEncoder().Encode (encoding/json always escapes
control chars U+0000-U+001F); no raw-write bypass; no custom MarshalJSON.
- The synthesized narrative is double-StripControlChars'd (synthesizer.go
:127 + service.go:1116).
- prompt-context.sh already strips control chars via perl before jq, with
2>/dev/null + // empty fallbacks.
Live-verified: the hook's exact jq returns guidance_id correctly; 5 rapid
/latest fetches all parse as strict-valid JSON; 0 raw control chars.
Per 'don't fix a non-problem', shipping a fix would invent a bug that
doesn't exist. Closure documented in docs/development/followup-c-closure.md.
This closes the entire RRF-SCALE-001 follow-up triage: A (JIMINY-OUTCOME
-001), B (GUIDANCE-SYNTH-001), C (non-issue). The guidance->feedback->
outcome loop is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(jiminy): /guide 30s timeout sibling + single-source config defaults (GUIDANCE-SYNTH-001 fix-commit)
Two things, both from a live e2e of the full loop through the real
production hook path (run per user directive: standard tests don't find
live problems).
1. Sibling bug: the /v1/jiminy/guide handler had the SAME hardcoded 30s
cap as the warm path (handlers_jiminy.go). GUIDANCE-SYNTH-001 fixed
warm; /guide still deadline-exceeded synthesis at exactly 30.003s
(this is what made prior sprints' /guide integration tests flaky).
Now uses the config-driven budget. Live-verified: a 50.05s synthesis
completed (synthesis_used=true) — would die at 30s.
2. Single source of truth for config defaults (user directive: 'single
place to change all instances'). The 90s budget was duplicated as a
literal in 3 sites; prior sprints similarly duplicated each default
(the sigmoid 0.45/8.0 was in 3 places). Now each default is one
exported config.Default* const, referenced by FromEnv and aliased by
consuming-package fallbacks + a Config.JiminyWarmComputeTimeout()
method. Consolidated: warm-compute timeout, the 3 consulting score
floors, sigmoid midpoint/steepness, constraint-code sim threshold,
classify concurrency. Zero behavior change (compile-time aliases);
-race + full suites green.
Live e2e also re-confirmed: real hook captures guidance_id -> feedback
-> +7 Neo4j GUIDANCE_OUTCOME edges on real constraint nodes + 10 TSDB
rows (whole loop closes through the actual hook; re-confirms Follow-up C
non-issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): sprint plan — federation consumer CLI + UATS backfill
Builds the first consumer for EVENTGRAPH-001's reinforcement-neighborhood
federation API (which has no consumer): a 'mdemg eventgraph' CLI command.
Validates the Pattern Y1 bet + becomes the live-testing harness for
EVENTGRAPH-002/003 (user directive: build the consumer first).
Per the UxTS directive: maps the work to the frameworks. UATS applies to
the federation HTTP API -> add eventgraph_reinforcement_neighborhood.uats
.json (backfilling the -001 gap; the endpoint shipped with no UATS),
which replaces an ad-hoc Go integration test as the Tier 2 contract test.
UVTS/UBENCH N/A. UOTS panel-spec gap noted as a follow-up (out of scope).
CLI rendering -> Tier 1 Go units.
4 epics; CLI (--seed/--query/--hops/--since/--limit/--json) renders
summary + events table or JSON; server-driven defaults (no re-hardcoding);
read-only. ~1-1.5 dev-days.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-cli-001): mdemg eventgraph reinforcement-neighborhood (Epic 1)
First consumer of the EVENTGRAPH-001 federation API — POSTs to
/v1/eventgraph/reinforcement-neighborhood, renders a summary + events table
(or --json). Supports --seed, --query (resolves seed via /v1/memory/retrieve
top-1), --hops, --since, --limit. Unset flags are omitted from the request
so the server applies its config defaults (no re-hardcoding of hops/since/limit
in the CLI). Registered under the "advanced" command group.
Tier 1 (httptest, -race clean): request-mapping omit-when-unset + conversion,
--query seed resolution, no-results + invalid --since + surfaced-503 errors,
render (empty + table), helpers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph): neighbor_node_ids serializes as [] not null for empty neighborhood
Caught in EVENTGRAPH-CLI-001 live contract testing (standard code tests
missed it; the live UATS happy-path against the running server did not):
walkNeighborhood returns a nil slice when the seed has no neighborhood
(e.g. an unknown seed), which JSON-marshals to `null`, while Events is
defensively initialized to []. Both are array fields and must serialize
consistently — null breaks any consumer asserting an array type (incl. the
new UATS contract's `type_is array` on $.neighbor_node_ids).
EventsInGraphNeighborhood now coalesces the nil slice to []string{}.
Tier 1 TestFederationResult_EmptyArraysNotNull pins the JSON contract.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-cli-001): UATS contract spec for federation API (Epic 2)
Backfills the UATS gap EVENTGRAPH-001 left (no contract test for
/v1/eventgraph/reinforcement-neighborhood). 6 cases, validated 6/6 live
against the running server:
- happy 200: asserts the response contract shape (events/neighbor_node_ids
arrays, graph_hops/tsdb_rows_scanned numbers, truncated boolean) — robust
to data, works even with an unknown seed (empty neighborhood is valid 200)
- missing_space_id / missing_seed_node_id → 400 (empty-string override, since
the runner deep-merges variant body over base — key omission can't unset)
- negative_hops → 400, hops_over_ceiling (999 > 2×default) → 400
- method_not_allowed (GET) → 405
sha256 integrity hash added + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): live verification + feature doc + CHANGELOG + close (Epic 3)
Tier 3 live e2e verified the real binary against the real stack: --query
surfaced 20 reinforcement events in a 5-node neighborhood (demonstrating the
Hebbian-write → federation-read loop closing in one command); --seed/--json/
--limit/unknown-seed/no-arg paths all verified live. Feature doc gains the CLI
consumer section; CHANGELOG Added + Fixed entries; CLAUDE.md architecture note;
verification.md + post.md (UxTS mapping: UATS done, UOTS follow-up carried over).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph-cli-001): tag UATS spec 'tsdb' so CI skips it without TSDB
CI Test failed: the UATS contract step boots a minimal server without TSDB,
so the eventgraph service is nil and every POST returns 503 "service not
initialized" instead of the expected 200/400 (only GET→405 passed, since the
method check precedes the service check). Same class as PR #404. The
federation endpoint genuinely requires TSDB (it queries reinforcement_events;
the service is nil without TSDB at boot), and CI's UATS step already excludes
`tsdb`-tagged specs (ci.yml --exclude-tag ...,tsdb). Added "tsdb" to api.tags
(matching metrics_snapshot/readyz_tsdb); re-hashed. Verified locally: the spec
now reports Status: skip under the exact CI exclude filter, and still 6/6 live
against the full stack via explicit --spec.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): sprint plan — guidance-outcome federation (Epic 0)
Federate the guidance-outcome event stream (Pattern Y1, second event class):
walk a constraint's Neo4j neighborhood, surface time-windowed constraint_outcomes
(followed/ignored/contradicted) for the constraint + its graph-related constraints.
Data-decided architecture: reuse the existing constraint_outcomes table (no new
hypertable/writer/enqueue site — RRF-SCALE-001 already populates it, 1176 live
rows); join graph↔events on constraint_code (TSDB constraint_id UUID ≠ Neo4j
node_id CUID — code is the only viable key). One additive migration (V0023:
constraint_code index, schema 22→23). 8 epics, 3 testing tiers, live Tier 3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): V0023 constraint_code index on constraint_outcomes (Epic 1)
Adds idx_constraint_outcomes_code (space_id, constraint_code, time DESC) — the
guidance-outcome federation joins graph↔events on constraint_code (TSDB
constraint_id is a UUID that doesn't match the Neo4j node_id CUID; code is the
only viable key), and migration 011 indexed only space/constraint_id/outcome.
Partial index (constraint_code NOT NULL AND <> '') skips uncoded outcomes.
Bumps TSDB_REQUIRED_SCHEMA_VERSION default 22→23 (config.go) to match the
migration count — CI schema-version validator gates on this. Additive, no data
change, idempotent.
Live-verified: migration applies (schema 22→23), idx present, re-apply is a
no-op, config/tsdb tests green, CI schema check 23=23.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): GuidanceOutcomesInNeighborhood federation method (Epic 2)
Second Pattern Y1 federation: walk a constraint's Neo4j neighborhood, collect
each neighbor's constraint_code, and join constraint_outcomes on those codes
(backed by the V0023 index). walkNeighborhoodWithCodes returns the neighborhood
node IDs + a code→node map; queryGuidanceOutcomes pulls coded outcomes in the
window; Go-side join resolves each outcome's code → its neighborhood constraint
node. Non-nil slices from the start (EVENTGRAPH-CLI-001 lesson). Reuses the
existing constraint_outcomes sink — no new table/writer.
Tier 1 (-race): validation guards, empty-arrays-not-null, sortedKeys
determinism, join resolution. Tier 2 integration (live Neo4j+TSDB): full
round-trip — hops=1 (seed+related codes, off-neighborhood excluded), hops=0
(seed code only), unknown-seed (empty non-nil). PASS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): guidance-outcome federation handler + route (Epic 3)
POST /v1/eventgraph/guidance-outcome-neighborhood — walk a constraint's
neighborhood, surface constraint_outcomes whose code is in the neighborhood.
Same gating/auth/default convention as the reinforcement endpoint.
Single-source refactor (per the dynamic-variables directive): extracted the
shared gate (method/enabled/service → eventgraphGate) and default-resolution
(hops/since/limit + ceiling → resolveFederationDefaults) into helpers used by
BOTH handlers, so the federation rules live in exactly one place. The
reinforcement handler now calls them too — verified no regression (reinforcement
UATS still 6/6 live, unit tests green).
Live-verified: seeding from the real 'no-direct-main-commits' constraint node
surfaced real 'followed' outcomes with constraint_node_id resolved to the seed
and in_neighborhood=true.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): mdemg eventgraph guidance-outcome-neighborhood CLI (Epic 4)
Sibling subcommand consuming POST /v1/eventgraph/guidance-outcome-neighborhood.
Walks a constraint's neighborhood and renders guidance outcomes (followed/
ignored split + table: code · outcome · sim · g_type · guidance_id · recorded)
or --json. Seed via --seed/--query (--constraint-code seeding deferred — needs
server-side code→node resolution; --query covers discovery). Unset hops/since/
limit omitted so the server applies config defaults (single source of truth).
Tier 1 (-race): request-mapping omit-when-unset + conversion, --query seed
resolution, surfaced-503 error, render (empty + followed/ignored table),
truncStr. Help renders.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-002): UATS contract spec for guidance-outcome federation (Epic 5)
6 cases, validated 6/6 live: happy-200 response shape (outcomes/
neighbor_node_ids/neighbor_constraint_codes arrays, graph_hops/tsdb_rows_scanned
numbers, truncated boolean), missing space_id/seed → 400 (empty-string override
under deep-merge), negative_hops → 400, hops_over_ceiling → 400, GET → 405.
Tagged 'tsdb' so CI skips it without TSDB (the EVENTGRAPH-CLI-001 lesson).
sha256 hashed + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): Tier 3 live verification (Epic 6)
Real binary against the real stack. Key assertion: CLI --json output matches
direct constraint_outcomes SQL exactly (11 outcomes = 11, all followed) for the
no-direct-main-commits constraint. --seed/--query/--limit/--json/unknown-seed/
no-arg all verified live. The --query "0 outcomes" result was traced to SQL
ground truth — the 5 neighborhood codes genuinely have no feedback, so it's
correct (federation distinguishes "code in neighborhood" from "code has
outcomes"), not a join bug. Reinforcement endpoint un-regressed by the shared-
helper refactor (UATS 6/6).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): feature doc + CHANGELOG + CLAUDE.md + close (Epic 7)
Feature doc gains a Guidance-Outcome Federation section (why reuse
constraint_outcomes, why join on constraint_code, CLI usage) + forward-look
update. CHANGELOG Added (endpoint + CLI) + Changed (TSDB schema 22→23).
CLAUDE.md architecture note extended. post.md closes the sprint with UxTS
mapping + follow-ups (--constraint-code seeding, EVENTGRAPH-003, UOTS).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(metrics,backup): resolve docker binary robustly under minimal launchd PATH
The native server (launchd) inherits PATH=/usr/bin:/bin:/usr/sbin:/sbin, which
excludes the Docker Desktop symlink (/usr/local/bin/docker). So every
server-runtime `docker` shellout failed with "executable file not found in
$PATH": (1) Neo4j container CPU/mem stats (server.go) logged an ERROR every 60s
and left the neo4j_container_* gauges empty — so the neo4j_high_cpu/_memory
alert rules had no data; (2) the TSDB backup scheduler's `docker compose
pg_dump` (backup.go) failed with only a slog.Warn. The DATA PLANE was never
affected — Neo4j (Bolt) + TSDB (pgx) connect over mapped TCP ports, not the
docker CLI.
Fix (durable, configurable, single-source): new internal/dockerbin resolver —
MDEMG_DOCKER_BIN env override → exec.LookPath → well-known install locations
(/usr/local/bin, /opt/homebrew/bin, /usr/bin) → graceful unavailable. Wired
into server.go (stats) + backup.go (both compose calls). The perpetual 60s
ERROR is downgraded to a one-shot WARN when docker is genuinely absent (it's
optional telemetry). Added a sane PATH to the launchd server plist template as
defense-in-depth.
Live-verified: after restart, mdemg_neo4j_container_cpu_percent=0.59 /
mem_percent=29.13 now land in metric_samples (were absent); no more docker-stats
ERROR; `docker stats` + backup resolve docker under a simulated minimal PATH.
Note: `mdemg data export-auto` (training-export job) was NOT a victim — it
exports via network SQL, not docker (corrected from an earlier assumption).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): sprint plan — fail-loud scheduled jobs (Epic 0)
Triggered by a live-discovered silent failure: the TSDB backup scheduler was
failing every 24h run (docker-under-launchd-PATH) with only a buried slog.Warn.
Docker cause fixed (4cc7608); this sprint fixes the class — scheduled jobs that
fail with no record + no alert. V0024 scheduled_job_events hypertable + writer,
jobhealth.Report (record + alert on failure), wire the 3 jobs (backup,
maintenance, export-auto), 2 evaluator rules (backup staleness + recent
failure) so the server catches "job failed OR never ran". Config-driven, 3
testing tiers, live Tier-3 induced-failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): V0024 scheduled_job_events + writer (Epic 1)
Hypertable (job_name, success, latency_ms, error_message, metadata jsonb,
recorded_at) + RecordJobEvent synchronous single-row writer (mirrors V0021
model_install pattern). Indexes: per-job freshness, partial failed, per-space.
One row per scheduled-job run so the alert evaluator can detect "job failed"
AND "job never ran" (staleness). Schema 23->24; TSDB_REQUIRED_SCHEMA_VERSION
bumped to match migration count (CI check 24=24).
Tier 1 (-race): field mapping, optional-nulls, error truncation, nil-pool
no-op, insert-error propagation. Tier 2 (live TSDB): round-trip + the staleness
(recent successes) + failure (recent failures) query shapes the rules will use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): record + alert on scheduled-job outcomes (Epic 2)
New internal/jobhealth.Report — the single policy point: record a
scheduled_job_events row and fire a high-severity "scheduled-job" alert on
failure (both pool + dispatcher nil-safe). Wired into all three jobs:
- TSDB backup scheduler (internal/tsdb/backup.go): decoupled JobResultFunc hook
(mutex-guarded, -race clean) so internal/tsdb stays free of internal/alert;
server.go::SetTSDBClient sets it with the pool + s.alertDispatcher. A failed
or never-run backup now records + alerts instead of a silent slog.Warn.
- export-auto + maintenance (CLI): deferred reportScheduledJob on the named
return error — opens a short-lived pool + a file-backed dispatcher (same
~/.mdemg/alerts/current.json the hooks surface) so a separate-process CLI job
still alerts the operator.
Tier 1 (-race): jobhealth fires alert only on failure (real file-backend
dispatcher), nil-safe. Live smoke: export-auto recorded success=t latency=3050ms.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): scheduled-job staleness + failure alert rules (Epic 3)
Two server-native evaluator rules over V0024 scheduled_job_events:
- scheduled_job_recent_failure (always on): any job failure in the last
JOB_FAILURE_LOOKBACK_MIN (default 60) → high alert.
- backup_no_recent_success (gated on TSDB_BACKUP_ENABLED): zero successful
tsdb-backups within the staleness window → high alert. THIS is the "job
never ran" guarantee — it fires from the server observing ABSENT success, so
a backup that silently died or never started is caught, not just one that
errored.
Window derives from the real backup interval × 2 (JOB_BACKUP_STALENESS_HOURS
override; no hardcoded literal). JOB_HEALTH_ALERT_ENABLED master gate (default
true). Appended after DefaultRules() in serve.go.
Tier 1: failure rule always present (gt 0), staleness gated on backups-enabled
(lt 0.5), windows reflect config, non-positive fallback. Build/lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): distinct services for job rules so neither masks the other
Caught in live Tier-3 testing: both evaluator rules used Service="scheduled-jobs",
and the dispatcher cooldown key is (Service, Severity) — so the failure alert's
cooldown SUPPRESSED the staleness alert (only the failure fired). One alarm
masking another is the exact silent-failure class this sprint kills. Fixed:
scheduled-job-failure / scheduled-job-staleness distinct services. Re-verified
live — both fire independently and land as distinct alert-file entries. Tier-1
assertion pins that the two services differ. Includes Tier-3 verification.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): feature doc + CHANGELOG + CLAUDE.md + close (Epic 4)
Feature doc docs/features/scheduled-job-health.md (why / two mechanisms /
operator view / config). CHANGELOG Added (NOSILENT-001) + Changed (schema 23→24)
+ Fixed (docker-under-launchd-PATH). CLAUDE.md Service Alert System extended
with the scheduled-job-health note + the distinct-Service-per-rule cooldown
caveat. post.md closes the sprint (UxTS mapping + follow-ups).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): sync embedded launchd server plist with source (CI)
CI "Verify embedded launchd templates match source" diffs packaging/launchd/*
against internal/cli/launchd_templates/* (the embed.FS copy mdemg service
install uses). The PATH addition landed only in the source copy; sync the
embedded copy so they match byte-for-byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(roadmap): add jiminy-governance skill build-out (Workstream C, Action 7)
Records the jiminy-governance Claude Code skill on the active forward roadmap
(SPRINT_ROADMAP_POST_FT_LORA.md, cross-cutting governance) + brings the source
spec into the repo (docs/development/jiminy-governance-skill/SKILL.md, out of
~/Downloads). The skill makes Jiminy the deterministic source of context +
governance over J17, enforced by the PreToolUse hook — a routing/handshake shim,
not a rulebook. Build-out scope notes the wire-up placeholders that must be
resolved against the real instance (Jiminy MCP/endpoint, PreToolUse hook, J17
ack/RetireCode/GUIDANCE_OUTCOME calls). Aligns with the now-live guidance loop
(RRF-SCALE-001 / JIMINY-OUTCOME-001 / GUIDANCE-SYNTH-001).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): resolve skill wire-up against the real instance
Step 1 of the jiminy-governance build-out (roadmap Workstream C Action 7).
Resolved all five placeholders from the running MDEMG instance, verified live:
- Jiminy query: MCP `mdemg mcp` (stdio) → jiminy_guide/validate_changes; HTTP
/v1/jiminy/guide (returns guidance_id) + /bootstrap glossary (j17v1) + /latest.
- PreToolUse: pre-bash-check.py (Bash, fail-closed) + pre-write-check.py
(Write/Edit → /v1/jiminy/classify, /strict-only, fail-open).
- SessionID: claude-core convention.
- Comprehension ack: /v1/jiminy/protocol/feedback (verified ingested 1).
- GUIDANCE_OUTCOME: /v1/jiminy/feedback {guidance_id,…}.
- RetireCode: internal-only by d…
* docs(model-dist-002): flip adapter section to shipped + sprint close
Epic 7 (Documentation Update — never cut).
- docs/features/local-model-distribution.md: adapter section flipped from
"deferred to MODEL-DIST-002" to "shipped 2026-05-25"; status header
updated; Configurability Contract table adds --adapter flag row.
- CHANGELOG.md: Unreleased gains "Sprint MODEL-DIST-002 — Adapter-only
distribution path shipped" entry with full pipeline + verification +
SHA + Ollama manifest digest.
- CLAUDE.md Model Distribution architecture note: replaces "adapter-only
deferred to MODEL-DIST-002+" with the operator-facing recipe and the
pinned-toolchain pointer.
- docs/development/model-dist-002/post.md: sprint close with epic-by-epic
outcomes, acceptance criteria check-off, surprise log, and forward-
looking notes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): sprint plan (Pattern Y1 TSDB-federation)
Sprint EVENTGRAPH-001 — Reinforcement-Event TSDB Hypertable + Graph
Federation. First implementation of Pattern Y1 from the TypeDB-inspired
topology discussion: federate events into TSDB rather than reify them in
the Neo4j graph, preserve graph traversal via a Go orchestration layer.
12-section v1.0 format; 8 sequential epics; ~1.5-2 dev-days; $0 LLM;
low-medium risk (touches the Hebbian hot write path so the new writer
must be fully non-blocking + the Cypher RETURN-shape change must be
backwards-compatible at the Go call site).
Targets ApplyCoactivation only for v1. Other Hebbian entry points
(ApplySymbolCoactivation, CoactivateSession, ApplyNegativeFeedback)
deferred to EVENTGRAPH-003 once the pattern proves out under
production traffic. Pattern Y2 (link-node promotion in Neo4j)
explicitly deferred until a query proves federation-in-Go insufficient.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(tsdb): V0022 reinforcement_events hypertable (EVENTGRAPH-001 Epic 1)
One row per Hebbian co-activation pair update. Captures prev/new weight
(plus signed delta), evidence_count_after, eta_effective, surprise_factor,
activation_product, path_sim, role/obs_type of both endpoints, session_id,
direction (forward/reverse/bidirectional), and a created_new_edge flag
that distinguishes "new connection formed" from "existing connection
strengthened" at analysis time. trigger_path column will distinguish
ApplyCoactivation from EVENTGRAPH-003's other Hebbian entry points.
7-day chunks (same as V0017-V0021). 4 indexes: per-space time-series,
src+time, dst+time, partial index on (space_id, session_id, time) where
session_id is set. Federation API (Epic 5) needs src + dst lookups for
the graph-neighborhood join.
Buffered + flushed via CopyFrom on TSDB_FLUSH_INTERVAL_SEC cadence
(default 30s). Pattern matches V0019 (sparse_gate_metrics) buffered
writer, NOT V0021 (model_install_events) sync writer — Hebbian writes
are per-retrieve, far higher volume than CLI-driven model install
events.
Config: TSDB_REQUIRED_SCHEMA_VERSION default bumped 21 -> 22.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(tsdb): buffered reinforcement_events writer (EVENTGRAPH-001 Epic 2)
internal/tsdb/reinforcement_writer.go — buffered CopyFrom writer mirroring
the V0019 SparseGateMetricsWriter pattern. 30s auto-flush ticker, Close()
drains buffer + flushes final batch, idempotent across multiple Close
calls. FIFO eviction on buffer-full matches the LLMInteractionWriter
precedent; eviction counted in droppedRows for Epic 6 Prometheus
surfacing.
ReinforcementEventRow serializes optional float / string fields via
nullableFloat / nullableString helpers — zero-valued inputs land as DB
NULL rather than 0 / '', so analytic queries can distinguish "no data"
from "actually zero." Required fields (prev/new/delta weight,
evidence_count_after, created_new_edge, trigger_path) are never
nullable.
Tier 1 unit tests (9 green):
- Record + Flush writes all rows with correct table + column shape.
- Empty buffer Flush is a no-op (no CopyFrom call).
- Buffer-full evicts oldest, increments droppedRows counter.
- Unlimited buffer (maxBufferSize=0) never drops.
- Nullable serialization: zero-valued optionals → DB NULL.
- Flush error increments FailureCount; SuccessCount/TotalRows unchanged.
- Close drains buffer (final flush triggered).
- Close is idempotent (Close × 2 does not double-flush).
- Auto-flush ticker fires within deadline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* refactor(learning): expose per-pair telemetry from Hebbian Cypher (EVENTGRAPH-001 Epic 3)
ApplyCoactivation Cypher RETURN clause extended from "count(*) AS updated"
to 17 per-pair columns: src/dst node IDs, prev/new/delta weight,
evidence_count_after, eta_effective (cfg.LearningEta × etaMult),
surprise_factor, activation_product, path_sim, role_a/b, obs_type_a/b,
session_id, direction (forward/reverse/bidirectional), created_new_edge.
created_new_edge derived from (r.evidence_count = 1) — the ON CREATE
branch sets evidence_count to 1; ON MATCH increments. Reliable proxy
for "new connection formed" vs "existing connection strengthened" at
analysis time.
Plan-deviation disclosure (per feedback_plan_options_pattern.md): the
plan called for 2 rows per pair in asymmetric mode (forward + reverse).
The Cypher mirrors rr.weight = r.weight at all times — forward and
reverse edges carry identical weights. Emitting 2 rows would double-
count without adding signal. Final choice: 1 row per logical pair
regardless of mode, with the direction column carrying the
forward/reverse/bidirectional distinction. Revisit if EVENTGRAPH-003
introduces a Hebbian path where forward/reverse weights diverge.
New helper internal/learning/reinforcement_parser.go translates a
neo4j.Record (or any (key) → (any, bool) getter) into a
tsdb.ReinforcementEventRow. Lives in its own file so service.go
doesn't grow. Defensive against missing keys (zero values), nil values
(zero/empty), wrong types (fallback to zero) — no panics.
Tier 1 unit tests (6 green) cover:
- Symmetric bidirectional + ON CREATE branch
- Asymmetric forward + ON MATCH branch (evidence > 1)
- Missing optional fields → zero values (nullable* writer helpers
serialize as DB NULL)
- Neo4j int64 → Go int coercion
- nil values → zero/empty
- Wrong-typed values → graceful fallback
Reinforcement rows are captured locally in ApplyCoactivation but not
yet forwarded to TSDB — Epic 4 wires the writer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(learning): record reinforcement events to TSDB (EVENTGRAPH-001 Epic 4)
learning.Service grows a reinforcementWriter field + SetReinforcementWriter
setter (mirrors the SetStabilityReinforcer back-compat pattern). After
ExecuteWrite returns from ApplyCoactivation, each captured per-pair row
gets the spaceID stamped on it and is enqueued via writer.Record. The
writer is non-blocking; the Hebbian hot path never waits on TSDB.
Configurability Contract — 7 new env vars (no-hardcoding rule):
- EVENTGRAPH_ENABLED (bool, default true)
- EVENTGRAPH_WRITER_FLUSH_INTERVAL_SEC (int, default 30, floor 5)
- EVENTGRAPH_WRITER_BUFFER_SIZE (int, default 1000, 0 = unlimited)
- EVENTGRAPH_MAX_PAIRS_PER_EVENT_BATCH (int, default 200)
- EVENTGRAPH_MAX_EVENTS_PER_QUERY (int, default 500, Epic 5 ceiling)
- EVENTGRAPH_FEDERATION_DEFAULT_HOPS (int, default 2)
- EVENTGRAPH_FEDERATION_DEFAULT_LOOKBACK_HOURS (int, default 24)
api/server.go wires the writer's lifecycle:
- Constructed after TSDB client is ready, gated by cfg.EventGraphEnabled
so EVENTGRAPH_ENABLED=false cleanly skips construction; learner's
reinforcementWriter stays nil and the Hebbian path short-circuits.
- Closed alongside the other TSDB writers in graceful-shutdown — buffer
drains before the process exits.
Tier 2 integration tests (against real TSDB, build tag integration):
- TestEventGraph_Writer_RoundTrip: 3 rows recorded → flush-window
elapses → SELECT count(*) returns 3.
- TestEventGraph_Writer_DrainOnClose: 5 rows recorded with 1-hour flush
interval → Close() drains → SELECT returns 5 (verifies the server
shutdown invariant).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eventgraph): federation query helper + API endpoint (EVENTGRAPH-001 Epic 5)
internal/eventgraph/query.go — Pattern Y1 federation helper.
EventsInGraphNeighborhood orchestrates a two-step query:
1. Cypher graph walk from a seed node — variable-length path over
CO_ACTIVATED_WITH | GENERALIZES at depth 0..Hops. Returns the
N-hop neighborhood (DISTINCT node_ids, includes the seed).
2. TSDB query against reinforcement_events for events where src OR
dst is in the neighborhood, within the lookback window, ordered
newest-first, capped at the configured limit.
3. Go-side join — annotates events with SrcInNeighborhood /
DstInNeighborhood so the consumer can distinguish "both endpoints
in the subgraph" from "one endpoint outside the seed's N-hop
reach but the event still touches our subgraph."
Empty neighborhood (no seed match, hops=0) short-circuits before the
TSDB call. Sub-1-second Since values clamp to 1s. Hops < 0 is rejected
upfront. The handler enforces an additional ceiling of 2 ×
EVENTGRAPH_FEDERATION_DEFAULT_HOPS for runaway-walk protection.
internal/api/eventgraph_handler.go — POST /v1/eventgraph/reinforcement-
neighborhood. Same auth convention as /v1/admin/breakers. 503 when
EVENTGRAPH_ENABLED=false or when eventgraphService is nil (TSDB-down at
boot). 400 on missing space_id / seed_node_id / negative hops / hops >
ceiling. Defaults applied from config when fields omitted from request.
Plan-decision disclosure (per feedback_plan_options_pattern.md): plan
proposed Option A (single endpoint with event_type query param) vs
Option B (endpoint per event class). Final choice: A. v1 has one event
class (reinforcement); the endpoint URL is explicit about that.
EVENTGRAPH-002 can either add a query param or split the URL when a
second event class arrives — no breaking change either way.
Tests:
- Tier 1 (internal/eventgraph/query_test.go, 7 green): request
validation rejects empty space_id, empty seed, negative hops; interval
formatting roundtrips; join annotation handles both-inside,
one-outside, and empty-neighborhood cases.
- Tier 1 (internal/api/eventgraph_handler_test.go, 4 green + 2 skipped):
method-not-allowed, feature-disabled 503, nil-service 503, invalid-
JSON short-circuit. Two validation paths skipped — they require a
non-nil eventgraphService which can't be constructed without a real
driver; Tier 2 exercises them.
- Tier 2 (tests/integration/eventgraph_federation_test.go, 1 green):
builds seed--mid--leaf graph + off-node, emits 3 reinforcement
events touching all four nodes, calls federation at hops=0 and
hops=1, asserts neighborhood + in-neighborhood flags. The hops=0
test confirms that mid↔leaf (touching neither seed nor any 0-hop
neighbor) is correctly excluded.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(observability): Grafana panel + Prometheus counters for reinforcement events (EVENTGRAPH-001 Epic 6)
Three new Prometheus counters mirror the V0022 writer's internal atomic
counters:
- mdemg_eventgraph_writer_rows_enqueued_total — rows successfully CopyFrom'd
- mdemg_eventgraph_writer_rows_dropped_total — rows FIFO-evicted (buffer full)
- mdemg_eventgraph_writer_flush_failure_total — flush errors
Wiring: the writer accepts a narrow PrometheusCounter interface
(Add(int64)) so internal/tsdb does not import internal/metrics (which
would cycle). api/server.go calls SetPrometheusCounters after the
writer is constructed, passing the three counters from the global
StandardMetrics struct. Nil-safe.
Dashboard: mdemg-graph-topology.json gains a new collapsed row
"Reinforcement Events (EVENTGRAPH-001)" with a single time-series
panel "Reinforcement Event Rate (events/min)" showing all three rates
(enqueued / dropped / flush failures) over the last 24h. Dropped is
colored orange, flush failures red, enqueued the default palette. Tied
to the prometheus datasource.
The existing GRAFANA-AUDIT-001 harness (scripts/grafana_panel_audit.py)
only evaluates SQL-target panels — the new panel uses Prometheus
queries, so it lands on the SKIP pile, same as the other 8 Cypher /
Prometheus panels on this dashboard. Audit JSON refreshed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(eventgraph-001): restore full GRAFANA-AUDIT-001 audit_results.json
Epic 6's targeted audit run (scripts/grafana_panel_audit.py --dashboard
mdemg-graph-topology.json) overwrote the full multi-dashboard audit
results from GRAFANA-AUDIT-001 with the single-dashboard subset (9
SKIPs only). Restoring the full snapshot from commit 0a1e8e1 — that
audit covered all 8 dashboards and is the canonical baseline the
GRAFANA-AUDIT-001 post.md references. EVENTGRAPH-001 did not need to
regenerate it; the new panel uses Prometheus queries, which the audit
harness SKIPs regardless of dashboard.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(retrieval): set Activation on RRF RetrieveResult (EVENTGRAPH-001 fix-commit)
ScoreAndRankRRF's ConsensusResult → RetrieveResult conversion was
silently dropping the Activation field. The legacy ScoreAndRank path at
scoring.go:883 sets Activation: a (where a := act[c.NodeID] is the
spreading-activation map value). The RRF path constructed
models.RetrieveResult{...} with no Activation key, leaving the field
zero-valued.
Net effect: since Phase 13.1 default-on (2026-05-03),
learning.Service.ApplyCoactivation has filtered out every L0 candidate
on the retrieve hot path. The filter is r.Activation >=
LearningMinActivation (default 0.20). With Activation=0, no pair makes
it to the Hebbian Cypher; the function returns nil without writing.
Hebbian learning has been silently no-op on the production retrieve
goroutine for ~24 days. CO_ACTIVATED_WITH edges still exist in the
graph — sidecar paths (CoactivateSession, ApplySymbolCoactivation,
consolidation walks) and pre-Phase-13.1 retrieves wrote them — but the
retrieve-time goroutine has been a silent no-op.
Discovered during EVENTGRAPH-001 Epic 7 live e2e. Three retrieves
produced 0 rows in reinforcement_events. Investigation traced the gap
to the missing Activation field.
Fix: one-line addition in scoring_rrf.go — Activation: act[c.NodeID].
Brings the RRF path to parity with the legacy scorer.
Post-fix verification: rebuilt, restarted server, re-issued 3 retrieves
→ 10 reinforcement events landed in TSDB. Federation API at hops=1
correctly returned all 10 with src_in_neighborhood=true,
dst_in_neighborhood=true. Documented in
docs/development/eventgraph-001/verification.md.
Per CLAUDE.md "Testing — Live System Testing Is Required":
"surprise bugs caught during live smoke get their own follow-up
fix-commit — do not silently roll them into the sprint commit." This
is the precedent-aligned separate commit.
Forward-only: existing graph state is preserved; new retrieves now
correctly emit Hebbian updates. EVENTGRAPH-002 may revisit whether to
backfill the missing 24-day window.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): Tier 3 live e2e verification transcript (Epic 7)
Real /v1/memory/retrieve × 3 against mdemg-dev → 10 reinforcement events
landed in TSDB within the flush window. Federation API at hops=1 from a
seed node returned 5-node neighborhood + 10 in-neighborhood events.
Documents the surprise-bug discovery + fix that preceded this transcript
(see fix-commit for scoring_rrf.go::ScoreAndRankRRF Activation
propagation).
Acceptance criteria from sprint plan §"Acceptance Criteria" all PASS.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): feature doc + CHANGELOG + CLAUDE.md + sprint close (Epic 8)
Final epic — Documentation Update (never cut, per feedback_per_feature_docs_required.md
and the standardized v1.0 sprint plan format).
New: docs/features/event-graph-federation.md (~240 lines, Why / Choices /
How it works / How to use / Forward-looking). Documents:
- Pattern Y1 vs Y2 trade-off (why federation-in-Go now, link-node
reification deferred until a query forces it)
- Why V0019 buffered-CopyFrom over V0021 sync-INSERT (per-retrieve volume)
- Why ApplyCoactivation first (other 3 Hebbian entry points deferred to
EVENTGRAPH-003)
- Why forward-only (no source to backfill from)
- Federation pipeline (Cypher walk → TSDB query → Go-side join with
src/dst_in_neighborhood annotation)
- TSDB schema, API request/response shape, 7 env vars + defaults
- Observability (3 Prometheus counters + Grafana panel)
- Forward-looking sprints
New: docs/development/eventgraph-001/post.md — epic-by-epic outcomes,
acceptance criteria check-off, surprise log (RRF Activation drop +
audit-JSON overwrite + orphan-process port collision), plan deviations
disclosed (1-row-per-pair regardless of asymmetric mode; single-
endpoint over endpoint-per-class), forward-looking.
CHANGELOG.md Unreleased gains the EVENTGRAPH-001 entry — 11 bullet
points covering V0022 migration, buffered writer, Cypher RETURN-shape
change, Configurability Contract, federation helper + API, Prometheus
+ Grafana, Tier 2 + Tier 3 verification, the surprise-bug RRF
Activation fix-commit, and the audit-JSON restore.
CLAUDE.md Architecture Notes gains a new "Event Graph Federation" entry
above the Model Distribution section. Documents the pattern, surface,
deferrals, and the load-bearing fix-commit f307f55 that surfaced 24
days of silent Hebbian no-op on the retrieve hot path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(eventgraph-001): Grafana panel uses TSDB instead of unconfigured Prometheus datasource
The Epic 6 panel used datasource {type: prometheus, uid: prometheus} but
this Grafana instance has no Prometheus datasource configured — mdemg
exposes counters as JSON via /v1/metrics/snapshot, not a /metrics scrape
endpoint. Configured datasources: mdemg-nodegraph, neo4j, timescaledb
only. The panel rendered "No data" in the live Grafana.
Rewritten panel queries the reinforcement_events hypertable directly via
the timescaledb postgres datasource. Two targets:
1. count(*) over 1-minute time_buckets → overall events/min
2. count(*) FILTER (WHERE created_new_edge) vs WHERE NOT created_new_edge
→ split between new connections formed and existing connections
strengthened (the operational dimension the analytic queries
actually need)
Both targets templated on $space_id (existing dashboard variable). The
Prometheus counters (mdemg_eventgraph_writer_rows_{enqueued,dropped,
flush_failure}_total) remain wired and incrementing — they surface via
/v1/metrics/snapshot for ops scripts. The Grafana panel now actually
displays data instead of relying on a scrape path that doesn't exist
in this deployment.
Discovered during post-merge live verification (2026-05-29). Verified
fix: reloaded dashboard via Grafana API → /api/ds/query against same
SQL returns 1-minute buckets matching TSDB direct count. Audit harness
now reports 2 PASS for the new panel (previously SKIP — no SQL target).
verification.md updated with the post-merge transcript.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(rrf-scale-001): sprint plan — RRF score-scale consumer remediation
P0 fix. The Jiminy guidance->feedback->outcome loop has been dormant
~9 weeks: consulting/service.go gates constraint/suggestion extraction
on hardcoded legacy-scale score thresholds (r.Score < 0.55 et al.).
Phase 13.1 RRF (default-on May 3) dropped the score scale so strong
matches top out ~0.53 -> 0/10 results clear the gates -> empty guidance
-> dead loop. Third instance of the RRF-score-contract bug class (after
the EVENTGRAPH-001 Activation drop).
12-section format; 6 epics; config-driven percentile-gate fix +
sigmoid recalibration; live-verify the revived loop end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): Epic 1 audit findings — 12 sites cataloged
Full-repo sweep of post-RRF score/activation/confidence consumers +
live score-distribution sampling. Findings:
- HIGH (4): consulting constraint gates (1005/1081/1087) + confidence
sigmoid midpoint 1.5 (35-36) — the loop-killer cluster.
- MED (5): consulting conflict gates (931/944/957/981) + minConfidence
pre-filter (619, already config-driven).
- LOW (3): retrieval/jiminy.go Activation display gates (45/155/192) —
explanation text only, no guidance gating.
- NONE (2): jiminy trial score (0-10 scale), trust-score clamp.
Live distribution: RRF strong-match top scores cluster 0.49-0.58; the
0.55 gate sits mid-band, rejecting the most-relevant constraint half
the time. NormalizedConfidence is positional rank (spreads 100->0 even
on uniform-score sets) -> rules out plan Option A (percentile) as sole
gate. Remediation: config-driven RRF-calibrated absolute thresholds
(Option B), constraint floor default 0.45, sigmoid midpoint ->0.45.
Disclosed deviation per feedback_plan_options_pattern.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(consulting): RRF-calibrate score gates + confidence sigmoid (RRF-SCALE-001 Epic 2)
Revives the dormant Jiminy guidance loop. Replaces 7 hardcoded legacy-
scale score gates in consulting/service.go + the score->confidence
sigmoid (both copies) with config-driven, RRF-calibrated values.
Gates (all default 0.45, RRF strong-match band is 0.49-0.58):
- constraint extraction (was <0.55) -> CONSULTING_CONSTRAINT_SCORE_FLOOR
- keyword/name authority inner gate (0.55/0.6) -> CONSULTING_AUTHORITY_SCORE_FLOOR
- conflict/contradiction detection (0.6-0.7) -> CONSULTING_CONFLICT_SCORE_FLOOR
Key Epic-2 finding: keywordClassifyConstraint has an INNER authority
gate that binds tighter than the outer constraint gate. If authority
floor > constraint floor, the binding gate re-rejects the strong-match
band and the loop stays dormant -> all three default to 0.45. The RRF
band is too compressed to subdivide into tiers; knobs stay separate so
operators can raise any one independently.
Sigmoid (score->confidence), both consulting/service.go and
jiminy/retrieval_source.go (they MUST stay in sync per their own
comments): midpoint 1.5 -> 0.45, steepness 1.5 -> 8.0, config-driven via
RETRIEVAL_CONFIDENCE_SIGMOID_{MIDPOINT,STEEPNESS}. Legacy crushed a
strong 0.5 match to 0.18 confidence; recalibrated maps it to 0.60
(0.1->0.06, 0.58->0.74). normalizeRetrievalConfidence is now a Service
method reading cfg with zero-value fallback; mapRetrievalToGuidance
takes the sigmoid params from its caller's cfg.
5 new config knobs, all with RRF-calibrated defaults + zero-value
guards (no-hardcoding rule; the bug WAS a hardcoded value).
Tier 1 tests: updated 2 legacy-scale boundary tests to the new
thresholds + added RRFStrongMatchBand regression (0.50 must surface),
ConstraintFloor_ConfigDriven (override honored), and
NormalizeRetrievalConfidence_RRFCalibration (band mapping). Full
consulting + jiminy + config suites green; lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): Epic 3 — remaining LOW findings reviewed + decided
retrieval/jiminy.go Activation display gates (45/155/192 + LearningEdge
siblings) traced live: they're in the explainability renderer, not the
guidance-surfacing path; always-additive at RRF scale (live activation
~0.723 >> thresholds), no misbehavior. Intentionally left unchanged with
rationale — config-ifying display verbosity is out of proportion to zero
functional impact. Every High/Med remediated (Epic 2), every Low decided.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(rrf-scale-001): Tier 2 integration + Tier 3 live verification (Epic 4)
Tier 3 live e2e (verification.md): the score-gate fix revives the
dormant guidance loop on the live stack —
- /v1/jiminy/guide guidance items 0 -> 10, source_counts.constraints
0 -> 2, patterns 0 -> 3 (acceptance #1 MET).
- Full loop warm->latest->feedback->outcome: TSDB constraint_outcomes
sink REVIVED — fresh rows dated 2026-06-03 (table was dead since
May 1). Constraint-effectiveness Grafana sink is live again.
Three adjacent issues surfaced during live smoke, documented as distinct
follow-ups (NOT score-scale, not bolted on):
- A: Neo4j GUIDANCE_OUTCOME edges still dormant — guidance SourceNodes
point at emergent_concept nodes; PersistGuidanceOutcome only writes
edges for constraint/correction/pattern/learning or role_type=
constraint targets. Node-type-targeting bug, independent of RRF.
Candidate sprint JIMINY-OUTCOME-001.
- B: LLM guidance synthesis timeout (now that synthesis runs).
- C: /v1/jiminy/latest unescaped control chars break jq/json parsers —
the hook uses jq, so may compound dormancy. Low-effort follow-up.
Tier 2 (rrf_scale_guidance_test.go, integration tag, 2 green):
- SuggestSurfacesGuidance: constraint-matching context surfaces 7
suggestions (was 0 before fix) against live mdemg-dev.
- SuggestRejectsNoise: gibberish does not flood constraints (no
over-correction).
Cold-start note: first guide call post-restart returned constraints:0
(LLM classifier cold-model timeout -> keyword fallback); after one
warm-up call, constraints surface. Model-warmth artifact, not a fix
defect.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): CHANGELOG + CLAUDE.md score-scale contract + post.md (Epic 5)
Final epic. CHANGELOG Unreleased gains the RRF-SCALE-001 Fixed entry.
CLAUDE.md gains a 'score-scale contract' architecture note — the
structural defense against a 4th instance: downstream consumers MUST
NOT hardcode absolute thresholds against RetrieveResult.Score (the
scorer scale is not a stable contract); gate via config or a
scale-invariant signal, and re-audit on any scorer change. Notes that
NormalizedConfidence is positional (not a safe sole gate) and records
the three open follow-ups.
post.md: epic-by-epic, acceptance check-off (honest: #2 partial — TSDB
sink revived, Neo4j edge is distinct Follow-up A), scope note
separating the score-scale fix (done) from the 3 adjacent surfaced
issues (documented follow-ups), discipline notes (cold-start mask,
inner authority gate).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(rrf-scale-001): skip guidance integration tests on empty environment (CI fix)
CI failure on PR 404: TestRRFScale_SuggestSurfacesGuidance failed in
0.02s. Root cause: the test assumed the populated local mdemg-dev space
(111 constraint nodes), but CI boots a FRESH EMPTY Neo4j with stub
embeddings (and RETRIEVAL_COLUMN_VOTING_ENABLED=false / legacy scorer).
With no data, /v1/memory/suggest returns 0 candidates, so the
'total == 0' assertion fired.
Other integration tests self-seed data or skip when prerequisites are
absent; mine relied on ambient data — wrong for a reproducible CI run.
Fix: skip when debug.retrieved_count == 0 (no retrievable data → the
score-gate fix isn't exercisable; there's nothing for the gate to admit
or reject). The test stays meaningful against a populated stack (local:
9 suggestions from 15 retrieved → PASS) and skips cleanly in CI's
empty-DB environment. Verified both paths live: populated → PASS,
empty space → retrieved_count 0 → SKIP.
The gate fix itself is validated by Tier 1 unit tests + the live Tier 3
e2e (docs/development/rrf-scale-001/verification.md); this integration
test is a bonus live-stack assertion, not the primary proof.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): sprint plan — revive Neo4j GUIDANCE_OUTCOME sink
Follow-up A from RRF-SCALE-001: the Neo4j GUIDANCE_OUTCOME edge sink has
been dormant since Apr 12. Root cause: matchConstraintCode links guidance
items to constraint codes by keyword overlap (>=3 shared words), but
retrieval surfaces emergent_concept abstractions whose content does not
share 3+ literal words with raw constraint text -> no constraint_code ->
PersistGuidanceOutcome falls back to the concept SourceNode -> the
role_type=constraint filter rejects it -> no edge. Live-proven: all 17
recent outcome rows had constraint_code=(none).
Fix (Option 1): switch the matcher to embedding cosine similarity
(content already normalized to natural language ~0.70 cosine; Service
has an embedder; cosineSimilarity + embed->cosine pattern already exist
in-package via OutcomeClassifier). Existing PersistGuidanceOutcome +
findConstraintNodeID then create edges on the correct constraint nodes.
Keyword matching stays as fallback -- never regresses.
4 epics; ~1-1.5 dev-days; config-driven threshold; acceptance bar = a
fresh Neo4j GUIDANCE_OUTCOME edge on a real role_type=constraint node
dated today, reflected in GetConstraintEffectiveness.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): embedding-similarity constraint-code matching (JIMINY-OUTCOME-001 Epic 1)
Revives the Neo4j GUIDANCE_OUTCOME edge sink (dormant since Apr 12).
Root cause (RRF-SCALE-001 Follow-up A): matchConstraintCode links
guidance items to constraint codes by keyword overlap (>=3 shared
words), but retrieval surfaces emergent_concept abstractions whose
content rarely shares 3+ literal words with raw constraint text -> no
code -> PersistGuidanceOutcome falls back to the concept SourceNode ->
the role_type=constraint filter rejects it -> no edge.
Fix: new matchConstraintCodeByEmbedding queries the constraint vector
index (db.index.vector.queryNodes, role_type=constraint, sim >=
threshold) and returns the closest constraint's code. Guide() tries
this first, falling back to the keyword matcher when the embedder is
unavailable, content is empty, or nothing clears the threshold — never
regresses. The existing PersistGuidanceOutcome + findConstraintNodeID
then create the edge on the correct constraint node.
Implementation refinement vs plan: uses Neo4j's vector index server-side
(mirrors the proven Evaluator.findMatchingConstraints pattern) rather
than loading all constraint embeddings into Go and computing cosine in a
loop — cleaner, no constraintCodeEntry.Embedding needed. Same Option-1
outcome.
Config: JIMINY_CONSTRAINT_CODE_SIM_THRESHOLD (default 0.55, zero-value
fallback) — provisional; tuned against the live similarity distribution
in Epic 2.
Tier 1 (4 tests): nil-driver/empty-embedding guards, threshold default
resolution, keyword-fallback non-regression. Full jiminy + config
suites green; lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(jiminy-outcome-001): Tier 2 integration + Tier 3 live verification (Epic 2)
Tier 3 live e2e (verification.md) — acceptance bar MET:
- /v1/jiminy/guide now yields guidance items carrying constraint_codes
(10 items, 6 coded; was 0). Matched code 'no-direct-main-commits' is
semantically exact for the 'commit to main' context.
- Full warm->latest->feedback loop: Neo4j GUIDANCE_OUTCOME 893 -> 899
(+6), latest today. All 6 new edges land on REAL role_type=constraint
nodes ('CONSTRAINT: NEVER commit directly to main') — not
emergent_concept. The sink dormant since Apr 12 is revived on the
correct nodes.
- /v1/constraints/effectiveness reflects it: 'NEVER commit directly to
main | surfaced: 30 followed: 28 rate: 0.93'.
- Both sinks now revived: TSDB (RRF-SCALE-001) + Neo4j (here). The
constraint-effectiveness loop is fully restored.
Threshold 0.55 validated live: correct matches, no false positives.
Tier 2 (jiminy_outcome_test.go, integration tag, skip-on-empty): PASSES
on a populated stack with an idle LLM (7/10 items coded). The guide path
is LLM-latency-dependent (per-node classifier ~31s/call, serialized; a
call fired while the LLM is busy fast-fails empty), so the test
warm-retries and SKIPS (never false-fails) when the LLM path can't
produce items. Bonus check; Tier 3 is the definitive proof. The LLM
serialization/synthesis-timeout is RRF-SCALE-001 Follow-up B, tracked
separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): CHANGELOG + CLAUDE.md + post.md (Epic 3)
Final epic. CHANGELOG Unreleased gains the JIMINY-OUTCOME-001 Fixed
entry. CLAUDE.md gains a guidance-outcome constraint-code-matching note
(embedding-first via vector index, keyword fallback; both outcome sinks
now live). post.md: epic-by-epic, acceptance check-off, the loop-revival
completion (TSDB from RRF-SCALE-001 + Neo4j here), discipline notes (LLM
serialization is the test-flakiness source), forward-looking (Follow-up
B now the most operationally-visible remaining issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(guidance-synth-001): sprint plan — fix guidance synthesis timeout (Follow-up B)
Synthesis fails on every production warm call (6/6 jiminy.synthesize
errored). Root cause: the hook's /warm path runs background Guide() with
a hardcoded 30s timeout (handlers_jiminy.go:302), inside which the
per-node constraint classifier runs SERIALLY (~1.5s x ~10 nodes ~= 15s),
leaving only ~15s for synthesis which needs 8-27s -> deadline exceeded.
JIMINY_TIMEOUT_MS=240s is configured but the 30s hardcode caps it.
Fix (both): (1) parallelize the per-node classifier with bounded
concurrency (CONSULTING_CLASSIFY_CONCURRENCY, default 4 matching
llama-server --parallel 4); (2) config-drive the warm timeout
(JIMINY_WARM_COMPUTE_TIMEOUT_MS, default 90s). Acceptance: synthesis
succeeds live (no synthesis_error), measured latency drop, no
constraint-surfacing regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(consulting): parallelize per-node constraint classifier (GUIDANCE-SYNTH-001 Epic 1)
The per-node LLM constraint classifier in findApplicableConstraints ran
serially (~1.5s/node x ~10 nodes ~= 15s), starving guidance synthesis of
its time budget (synthesis 6/6 errored on the warm path). Now classifies
with bounded concurrency.
- Gate-first (RRF-SCALE-001 score floor) to fix a stable candidate order,
then classify each candidate into a position-indexed slot via a
semaphore-bounded worker pool, then collect-in-order + dedup-by-name —
output is identical to the serial path (determinism). Keyword-only
(no LLM) or cap=1 runs serially (no LLM latency to hide).
- Config: CONSULTING_CLASSIFY_CONCURRENCY (default 4, matching
llama-server --parallel 4; floor 1 = serial rollback). Zero-value
fallback to 4.
- Extracted constraintClassifierIface (minimal Classify surface) so the
concurrent path is unit-testable with a fake; *ConstraintClassifier
satisfies it; SetConstraintClassifier guards against a typed-nil
interface.
Tier 1 (5 new, -race clean): ParallelEqualsSerial (determinism + order),
ParallelIsFaster (concurrency overlaps latency), ErrorFallsBackToKeyword
(fallback intact), ScoreGateStillApplies (RRF-SCALE-001 gate preserved),
ConcurrencyDefaultFallback. Existing findApplicableConstraints tests
unchanged — no regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): config-drive warm-compute timeout (GUIDANCE-SYNTH-001 Epic 2)
The warm-path background Guide() ran with a hardcoded 30s timeout
(handlers_jiminy.go:302) even though JIMINY_TIMEOUT_MS=240s is
configured. 30s was too tight for the per-node classifier (~15s) +
synthesis (8-27s) -> synthesis deadline-exceeded every warm call.
Replaced with JIMINY_WARM_COMPUTE_TIMEOUT_MS (default 90000, zero-value
fallback 90000) — headroom for the now-parallel classifier (~7.5s) + a
slow 27s synthesis. No-hardcoding rule. Rollback: set to 30000.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test+docs(guidance-synth-001): Tier 2/3 verification + docs + close (Epic 3)
Tier 3 live e2e (verification.md): the warm production path now produces
a synthesized narrative — synthesis_used=true, no synthesis_error,
1892-char augmentation. Fresh jiminy.synthesize succeeded at 50.7s
latency (fit the new 90s budget; would die at the old 30s — validates
the default). Both fixes needed.
Tier 2 (guidance_synth_test.go, integration, skip-on-empty + LLM-
tolerant): PASS — warm path produces guidance without synthesis_error.
Docs: CHANGELOG Fixed entry; CLAUDE.md guidance-synthesis-budget note
('when adding LLM calls to the guidance hot path: respect the
warm-compute budget and prefer bounded concurrency over serial loops');
post.md with the data-driven diagnosis + forward-looking (Follow-up C
now the last open item).
Closes Follow-up B. The guidance pipeline (surfacing + codes +
synthesis) is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(followup-c): close JSON control-char escaping as NON-ISSUE (no fix)
Follow-up C (the last open item from RRF-SCALE-001 triage) investigated
and closed with evidence — NO code change, because there is no bug to fix.
The earlier /v1/jiminy/latest parse failures were client-side shell
artifacts (co-occurring with the session's 'failed to change group ID'
errors + ad-hoc variable-capture piping), not server bytes:
- writeJSON uses json.NewEncoder().Encode (encoding/json always escapes
control chars U+0000-U+001F); no raw-write bypass; no custom MarshalJSON.
- The synthesized narrative is double-StripControlChars'd (synthesizer.go
:127 + service.go:1116).
- prompt-context.sh already strips control chars via perl before jq, with
2>/dev/null + // empty fallbacks.
Live-verified: the hook's exact jq returns guidance_id correctly; 5 rapid
/latest fetches all parse as strict-valid JSON; 0 raw control chars.
Per 'don't fix a non-problem', shipping a fix would invent a bug that
doesn't exist. Closure documented in docs/development/followup-c-closure.md.
This closes the entire RRF-SCALE-001 follow-up triage: A (JIMINY-OUTCOME
-001), B (GUIDANCE-SYNTH-001), C (non-issue). The guidance->feedback->
outcome loop is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(jiminy): /guide 30s timeout sibling + single-source config defaults (GUIDANCE-SYNTH-001 fix-commit)
Two things, both from a live e2e of the full loop through the real
production hook path (run per user directive: standard tests don't find
live problems).
1. Sibling bug: the /v1/jiminy/guide handler had the SAME hardcoded 30s
cap as the warm path (handlers_jiminy.go). GUIDANCE-SYNTH-001 fixed
warm; /guide still deadline-exceeded synthesis at exactly 30.003s
(this is what made prior sprints' /guide integration tests flaky).
Now uses the config-driven budget. Live-verified: a 50.05s synthesis
completed (synthesis_used=true) — would die at 30s.
2. Single source of truth for config defaults (user directive: 'single
place to change all instances'). The 90s budget was duplicated as a
literal in 3 sites; prior sprints similarly duplicated each default
(the sigmoid 0.45/8.0 was in 3 places). Now each default is one
exported config.Default* const, referenced by FromEnv and aliased by
consuming-package fallbacks + a Config.JiminyWarmComputeTimeout()
method. Consolidated: warm-compute timeout, the 3 consulting score
floors, sigmoid midpoint/steepness, constraint-code sim threshold,
classify concurrency. Zero behavior change (compile-time aliases);
-race + full suites green.
Live e2e also re-confirmed: real hook captures guidance_id -> feedback
-> +7 Neo4j GUIDANCE_OUTCOME edges on real constraint nodes + 10 TSDB
rows (whole loop closes through the actual hook; re-confirms Follow-up C
non-issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): sprint plan — federation consumer CLI + UATS backfill
Builds the first consumer for EVENTGRAPH-001's reinforcement-neighborhood
federation API (which has no consumer): a 'mdemg eventgraph' CLI command.
Validates the Pattern Y1 bet + becomes the live-testing harness for
EVENTGRAPH-002/003 (user directive: build the consumer first).
Per the UxTS directive: maps the work to the frameworks. UATS applies to
the federation HTTP API -> add eventgraph_reinforcement_neighborhood.uats
.json (backfilling the -001 gap; the endpoint shipped with no UATS),
which replaces an ad-hoc Go integration test as the Tier 2 contract test.
UVTS/UBENCH N/A. UOTS panel-spec gap noted as a follow-up (out of scope).
CLI rendering -> Tier 1 Go units.
4 epics; CLI (--seed/--query/--hops/--since/--limit/--json) renders
summary + events table or JSON; server-driven defaults (no re-hardcoding);
read-only. ~1-1.5 dev-days.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-cli-001): mdemg eventgraph reinforcement-neighborhood (Epic 1)
First consumer of the EVENTGRAPH-001 federation API — POSTs to
/v1/eventgraph/reinforcement-neighborhood, renders a summary + events table
(or --json). Supports --seed, --query (resolves seed via /v1/memory/retrieve
top-1), --hops, --since, --limit. Unset flags are omitted from the request
so the server applies its config defaults (no re-hardcoding of hops/since/limit
in the CLI). Registered under the "advanced" command group.
Tier 1 (httptest, -race clean): request-mapping omit-when-unset + conversion,
--query seed resolution, no-results + invalid --since + surfaced-503 errors,
render (empty + table), helpers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph): neighbor_node_ids serializes as [] not null for empty neighborhood
Caught in EVENTGRAPH-CLI-001 live contract testing (standard code tests
missed it; the live UATS happy-path against the running server did not):
walkNeighborhood returns a nil slice when the seed has no neighborhood
(e.g. an unknown seed), which JSON-marshals to `null`, while Events is
defensively initialized to []. Both are array fields and must serialize
consistently — null breaks any consumer asserting an array type (incl. the
new UATS contract's `type_is array` on $.neighbor_node_ids).
EventsInGraphNeighborhood now coalesces the nil slice to []string{}.
Tier 1 TestFederationResult_EmptyArraysNotNull pins the JSON contract.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-cli-001): UATS contract spec for federation API (Epic 2)
Backfills the UATS gap EVENTGRAPH-001 left (no contract test for
/v1/eventgraph/reinforcement-neighborhood). 6 cases, validated 6/6 live
against the running server:
- happy 200: asserts the response contract shape (events/neighbor_node_ids
arrays, graph_hops/tsdb_rows_scanned numbers, truncated boolean) — robust
to data, works even with an unknown seed (empty neighborhood is valid 200)
- missing_space_id / missing_seed_node_id → 400 (empty-string override, since
the runner deep-merges variant body over base — key omission can't unset)
- negative_hops → 400, hops_over_ceiling (999 > 2×default) → 400
- method_not_allowed (GET) → 405
sha256 integrity hash added + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): live verification + feature doc + CHANGELOG + close (Epic 3)
Tier 3 live e2e verified the real binary against the real stack: --query
surfaced 20 reinforcement events in a 5-node neighborhood (demonstrating the
Hebbian-write → federation-read loop closing in one command); --seed/--json/
--limit/unknown-seed/no-arg paths all verified live. Feature doc gains the CLI
consumer section; CHANGELOG Added + Fixed entries; CLAUDE.md architecture note;
verification.md + post.md (UxTS mapping: UATS done, UOTS follow-up carried over).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph-cli-001): tag UATS spec 'tsdb' so CI skips it without TSDB
CI Test failed: the UATS contract step boots a minimal server without TSDB,
so the eventgraph service is nil and every POST returns 503 "service not
initialized" instead of the expected 200/400 (only GET→405 passed, since the
method check precedes the service check). Same class as PR #404. The
federation endpoint genuinely requires TSDB (it queries reinforcement_events;
the service is nil without TSDB at boot), and CI's UATS step already excludes
`tsdb`-tagged specs (ci.yml --exclude-tag ...,tsdb). Added "tsdb" to api.tags
(matching metrics_snapshot/readyz_tsdb); re-hashed. Verified locally: the spec
now reports Status: skip under the exact CI exclude filter, and still 6/6 live
against the full stack via explicit --spec.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): sprint plan — guidance-outcome federation (Epic 0)
Federate the guidance-outcome event stream (Pattern Y1, second event class):
walk a constraint's Neo4j neighborhood, surface time-windowed constraint_outcomes
(followed/ignored/contradicted) for the constraint + its graph-related constraints.
Data-decided architecture: reuse the existing constraint_outcomes table (no new
hypertable/writer/enqueue site — RRF-SCALE-001 already populates it, 1176 live
rows); join graph↔events on constraint_code (TSDB constraint_id UUID ≠ Neo4j
node_id CUID — code is the only viable key). One additive migration (V0023:
constraint_code index, schema 22→23). 8 epics, 3 testing tiers, live Tier 3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): V0023 constraint_code index on constraint_outcomes (Epic 1)
Adds idx_constraint_outcomes_code (space_id, constraint_code, time DESC) — the
guidance-outcome federation joins graph↔events on constraint_code (TSDB
constraint_id is a UUID that doesn't match the Neo4j node_id CUID; code is the
only viable key), and migration 011 indexed only space/constraint_id/outcome.
Partial index (constraint_code NOT NULL AND <> '') skips uncoded outcomes.
Bumps TSDB_REQUIRED_SCHEMA_VERSION default 22→23 (config.go) to match the
migration count — CI schema-version validator gates on this. Additive, no data
change, idempotent.
Live-verified: migration applies (schema 22→23), idx present, re-apply is a
no-op, config/tsdb tests green, CI schema check 23=23.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): GuidanceOutcomesInNeighborhood federation method (Epic 2)
Second Pattern Y1 federation: walk a constraint's Neo4j neighborhood, collect
each neighbor's constraint_code, and join constraint_outcomes on those codes
(backed by the V0023 index). walkNeighborhoodWithCodes returns the neighborhood
node IDs + a code→node map; queryGuidanceOutcomes pulls coded outcomes in the
window; Go-side join resolves each outcome's code → its neighborhood constraint
node. Non-nil slices from the start (EVENTGRAPH-CLI-001 lesson). Reuses the
existing constraint_outcomes sink — no new table/writer.
Tier 1 (-race): validation guards, empty-arrays-not-null, sortedKeys
determinism, join resolution. Tier 2 integration (live Neo4j+TSDB): full
round-trip — hops=1 (seed+related codes, off-neighborhood excluded), hops=0
(seed code only), unknown-seed (empty non-nil). PASS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): guidance-outcome federation handler + route (Epic 3)
POST /v1/eventgraph/guidance-outcome-neighborhood — walk a constraint's
neighborhood, surface constraint_outcomes whose code is in the neighborhood.
Same gating/auth/default convention as the reinforcement endpoint.
Single-source refactor (per the dynamic-variables directive): extracted the
shared gate (method/enabled/service → eventgraphGate) and default-resolution
(hops/since/limit + ceiling → resolveFederationDefaults) into helpers used by
BOTH handlers, so the federation rules live in exactly one place. The
reinforcement handler now calls them too — verified no regression (reinforcement
UATS still 6/6 live, unit tests green).
Live-verified: seeding from the real 'no-direct-main-commits' constraint node
surfaced real 'followed' outcomes with constraint_node_id resolved to the seed
and in_neighborhood=true.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): mdemg eventgraph guidance-outcome-neighborhood CLI (Epic 4)
Sibling subcommand consuming POST /v1/eventgraph/guidance-outcome-neighborhood.
Walks a constraint's neighborhood and renders guidance outcomes (followed/
ignored split + table: code · outcome · sim · g_type · guidance_id · recorded)
or --json. Seed via --seed/--query (--constraint-code seeding deferred — needs
server-side code→node resolution; --query covers discovery). Unset hops/since/
limit omitted so the server applies config defaults (single source of truth).
Tier 1 (-race): request-mapping omit-when-unset + conversion, --query seed
resolution, surfaced-503 error, render (empty + followed/ignored table),
truncStr. Help renders.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-002): UATS contract spec for guidance-outcome federation (Epic 5)
6 cases, validated 6/6 live: happy-200 response shape (outcomes/
neighbor_node_ids/neighbor_constraint_codes arrays, graph_hops/tsdb_rows_scanned
numbers, truncated boolean), missing space_id/seed → 400 (empty-string override
under deep-merge), negative_hops → 400, hops_over_ceiling → 400, GET → 405.
Tagged 'tsdb' so CI skips it without TSDB (the EVENTGRAPH-CLI-001 lesson).
sha256 hashed + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): Tier 3 live verification (Epic 6)
Real binary against the real stack. Key assertion: CLI --json output matches
direct constraint_outcomes SQL exactly (11 outcomes = 11, all followed) for the
no-direct-main-commits constraint. --seed/--query/--limit/--json/unknown-seed/
no-arg all verified live. The --query "0 outcomes" result was traced to SQL
ground truth — the 5 neighborhood codes genuinely have no feedback, so it's
correct (federation distinguishes "code in neighborhood" from "code has
outcomes"), not a join bug. Reinforcement endpoint un-regressed by the shared-
helper refactor (UATS 6/6).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): feature doc + CHANGELOG + CLAUDE.md + close (Epic 7)
Feature doc gains a Guidance-Outcome Federation section (why reuse
constraint_outcomes, why join on constraint_code, CLI usage) + forward-look
update. CHANGELOG Added (endpoint + CLI) + Changed (TSDB schema 22→23).
CLAUDE.md architecture note extended. post.md closes the sprint with UxTS
mapping + follow-ups (--constraint-code seeding, EVENTGRAPH-003, UOTS).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(metrics,backup): resolve docker binary robustly under minimal launchd PATH
The native server (launchd) inherits PATH=/usr/bin:/bin:/usr/sbin:/sbin, which
excludes the Docker Desktop symlink (/usr/local/bin/docker). So every
server-runtime `docker` shellout failed with "executable file not found in
$PATH": (1) Neo4j container CPU/mem stats (server.go) logged an ERROR every 60s
and left the neo4j_container_* gauges empty — so the neo4j_high_cpu/_memory
alert rules had no data; (2) the TSDB backup scheduler's `docker compose
pg_dump` (backup.go) failed with only a slog.Warn. The DATA PLANE was never
affected — Neo4j (Bolt) + TSDB (pgx) connect over mapped TCP ports, not the
docker CLI.
Fix (durable, configurable, single-source): new internal/dockerbin resolver —
MDEMG_DOCKER_BIN env override → exec.LookPath → well-known install locations
(/usr/local/bin, /opt/homebrew/bin, /usr/bin) → graceful unavailable. Wired
into server.go (stats) + backup.go (both compose calls). The perpetual 60s
ERROR is downgraded to a one-shot WARN when docker is genuinely absent (it's
optional telemetry). Added a sane PATH to the launchd server plist template as
defense-in-depth.
Live-verified: after restart, mdemg_neo4j_container_cpu_percent=0.59 /
mem_percent=29.13 now land in metric_samples (were absent); no more docker-stats
ERROR; `docker stats` + backup resolve docker under a simulated minimal PATH.
Note: `mdemg data export-auto` (training-export job) was NOT a victim — it
exports via network SQL, not docker (corrected from an earlier assumption).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): sprint plan — fail-loud scheduled jobs (Epic 0)
Triggered by a live-discovered silent failure: the TSDB backup scheduler was
failing every 24h run (docker-under-launchd-PATH) with only a buried slog.Warn.
Docker cause fixed (4cc7608); this sprint fixes the class — scheduled jobs that
fail with no record + no alert. V0024 scheduled_job_events hypertable + writer,
jobhealth.Report (record + alert on failure), wire the 3 jobs (backup,
maintenance, export-auto), 2 evaluator rules (backup staleness + recent
failure) so the server catches "job failed OR never ran". Config-driven, 3
testing tiers, live Tier-3 induced-failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): V0024 scheduled_job_events + writer (Epic 1)
Hypertable (job_name, success, latency_ms, error_message, metadata jsonb,
recorded_at) + RecordJobEvent synchronous single-row writer (mirrors V0021
model_install pattern). Indexes: per-job freshness, partial failed, per-space.
One row per scheduled-job run so the alert evaluator can detect "job failed"
AND "job never ran" (staleness). Schema 23->24; TSDB_REQUIRED_SCHEMA_VERSION
bumped to match migration count (CI check 24=24).
Tier 1 (-race): field mapping, optional-nulls, error truncation, nil-pool
no-op, insert-error propagation. Tier 2 (live TSDB): round-trip + the staleness
(recent successes) + failure (recent failures) query shapes the rules will use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): record + alert on scheduled-job outcomes (Epic 2)
New internal/jobhealth.Report — the single policy point: record a
scheduled_job_events row and fire a high-severity "scheduled-job" alert on
failure (both pool + dispatcher nil-safe). Wired into all three jobs:
- TSDB backup scheduler (internal/tsdb/backup.go): decoupled JobResultFunc hook
(mutex-guarded, -race clean) so internal/tsdb stays free of internal/alert;
server.go::SetTSDBClient sets it with the pool + s.alertDispatcher. A failed
or never-run backup now records + alerts instead of a silent slog.Warn.
- export-auto + maintenance (CLI): deferred reportScheduledJob on the named
return error — opens a short-lived pool + a file-backed dispatcher (same
~/.mdemg/alerts/current.json the hooks surface) so a separate-process CLI job
still alerts the operator.
Tier 1 (-race): jobhealth fires alert only on failure (real file-backend
dispatcher), nil-safe. Live smoke: export-auto recorded success=t latency=3050ms.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): scheduled-job staleness + failure alert rules (Epic 3)
Two server-native evaluator rules over V0024 scheduled_job_events:
- scheduled_job_recent_failure (always on): any job failure in the last
JOB_FAILURE_LOOKBACK_MIN (default 60) → high alert.
- backup_no_recent_success (gated on TSDB_BACKUP_ENABLED): zero successful
tsdb-backups within the staleness window → high alert. THIS is the "job
never ran" guarantee — it fires from the server observing ABSENT success, so
a backup that silently died or never started is caught, not just one that
errored.
Window derives from the real backup interval × 2 (JOB_BACKUP_STALENESS_HOURS
override; no hardcoded literal). JOB_HEALTH_ALERT_ENABLED master gate (default
true). Appended after DefaultRules() in serve.go.
Tier 1: failure rule always present (gt 0), staleness gated on backups-enabled
(lt 0.5), windows reflect config, non-positive fallback. Build/lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): distinct services for job rules so neither masks the other
Caught in live Tier-3 testing: both evaluator rules used Service="scheduled-jobs",
and the dispatcher cooldown key is (Service, Severity) — so the failure alert's
cooldown SUPPRESSED the staleness alert (only the failure fired). One alarm
masking another is the exact silent-failure class this sprint kills. Fixed:
scheduled-job-failure / scheduled-job-staleness distinct services. Re-verified
live — both fire independently and land as distinct alert-file entries. Tier-1
assertion pins that the two services differ. Includes Tier-3 verification.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): feature doc + CHANGELOG + CLAUDE.md + close (Epic 4)
Feature doc docs/features/scheduled-job-health.md (why / two mechanisms /
operator view / config). CHANGELOG Added (NOSILENT-001) + Changed (schema 23→24)
+ Fixed (docker-under-launchd-PATH). CLAUDE.md Service Alert System extended
with the scheduled-job-health note + the distinct-Service-per-rule cooldown
caveat. post.md closes the sprint (UxTS mapping + follow-ups).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): sync embedded launchd server plist with source (CI)
CI "Verify embedded launchd templates match source" diffs packaging/launchd/*
against internal/cli/launchd_templates/* (the embed.FS copy mdemg service
install uses). The PATH addition landed only in the source copy; sync the
embedded copy so they match byte-for-byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(roadmap): add jiminy-governance skill build-out (Workstream C, Action 7)
Records the jiminy-governance Claude Code skill on the active forward roadmap
(SPRINT_ROADMAP_POST_FT_LORA.md, cross-cutting governance) + brings the source
spec into the repo (docs/development/jiminy-governance-skill/SKILL.md, out of
~/Downloads). The skill makes Jiminy the deterministic source of context +
governance over J17, enforced by the PreToolUse hook — a routing/handshake shim,
not a rulebook. Build-out scope notes the wire-up placeholders that must be
resolved against the real instance (Jiminy MCP/endpoint, PreToolUse hook, J17
ack/RetireCode/GUIDANCE_OUTCOME calls). Aligns with the now-live guidance loop
(RRF-SCALE-001 / JIMINY-OUTCOME-001 / GUIDANCE-SYNTH-001).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): resolve skill wire-up against the real instance
Step 1 of the jiminy-governance build-out (roadmap Workstream C Action 7).
Resolved all five placeholders from the running MDEMG instance, verified live:
- Jiminy query: MCP `mdemg mcp` (stdio) → jiminy_guide/validate_changes; HTTP
/v1/jiminy/guide (returns guidance_id) + /bootstrap glossary (j17v1) + /latest.
- PreToolUse: pre-bash-check.py (Bash, fail-closed) + pre-write-check.py
(Write/Edit → /v1/jiminy/classify, /strict-only, fail-open).
- SessionID: claude-core convention.
- Comprehension ack: /v1/jiminy/protocol/feedback (verified ingested 1).
- GUIDANCE_OUTCOME: /v1/jiminy/feedback {guidance_id,…}.
- RetireCode: internal-only by design (RSIC/APE protocol-evolution) — no
agent-facing call; agent must never self-retire a constraint.
Also surfaced the two real integration gaps the build-out must close (the work,
not the prose): (a) the MDEMG MCP server is NOT registered (.mcp.json absent —
context is pushed by prompt-context.sh, not pulled by the agent); (b) PreToolUse
enforcement is /strict-gated + fail-open, so not deterministic-by-default.
Roadmap Action 7 updated to reflect step 1 done + the two gaps as next steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy-governance): ship the J17 governance skill + register MDEMG MCP
Build-out of roadmap Workstream C Action 7 (steps 2-4), per the resolved
wire-up. Closes the two integration gaps found while resolving:
- gap 1 (MCP not registered): .mcp.json registers `mdemg mcp` (stdio).
Live-probed: 20 tools incl. jiminy_guide + validate_changes — the agent can
now PULL guidance, not only receive the hook push.
- gap 2 (enforcement /strict-gated + fail-open): policy set — Write/Edit J17
gate kept fail-open (hard server dependency on every edit is too brittle);
the skill's handshake auto-enables /strict so the gate is active per session.
Bash gate already fail-closed (demonstrated live when a test payload with a
destructive force-push string was blocked by pre-bash-check.py).
Skill authored at the canonical .claude/skills/jiminy-governance/SKILL.md
(frontmatter valid; concrete wire-up inline — MCP tools + HTTP endpoints +
SessionID claude-core + the 5-step handshake). Kept a routing/handshake shim,
not a rulebook (rules stay in the graph).
Live Tier-3 PASSED: full handshake identify->request->comprehend->act->report
ran against the real instance; GUIDANCE_OUTCOME edges 906->909 (one per coded
constraint). Verification: docs/development/jiminy-governance-skill/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): commit install-ready skill + install README
.claude/ is gitignored (per-developer local config), so the installed skill at
.claude/skills/jiminy-governance/SKILL.md is local-only by convention. Commit
the reproducible, install-ready copy (jiminy-governance.skill.md) + a README
with the one-line install (cp into .claude/skills/) so the skill propagates via
the repo. The MCP server it uses is registered in the tracked repo-root
.mcp.json.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID across hooks + skill
Hooks and the jiminy-governance skill hardcoded session_id="claude-core", so
trust/escalation/observations from ALL Claude Code conversations collapsed into
one shared MDEMG session. Claude Code already passes a per-conversation
session_id on stdin to every hook — the implementation just never used it.
Resolver precedence (single rule everywhere): MDEMG_SESSION_ID env (stable-
identity escape hatch) > Claude Code stdin session_id (per-conversation default,
race-free per hook) > ~/.mdemg/.claude-session (published by SessionStart +
UserPromptSubmit for the agent / stdin-less contexts) > claude-core (fallback).
Realizes J17's intended per-(session,constraint) isolation.
Tracked templates updated (internal/cli/hook_templates/): session-start.sh,
prompt-context.sh, post-tool-observe.py, pre-compact.sh + Windows .ps1 variants
— every hardcoded claude-core in MDEMG calls replaced with the resolved id;
session-start/prompt-context publish the session file. Skill SessionID
instruction + handshake steps now resolve <SessionID> instead of claude-core.
Live-verified: a hook resolved a stdin session_id and published the session
file; post-tool-observe wrote an observation keyed to the per-conversation id in
Neo4j (not claude-core). bash -n / py_compile clean; go build + hooks test pass.
Note: pre-write-check.py is local-only (no tracked installer template); its fix
is applied on this machine but won't propagate via `mdemg hooks install` until
it's added to the tracked hooks (follow-up). .claude/* is gitignored, so the
live hook copies aren't committed — the tracked templates propagate via install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID — installed hook copies
The .claude/hooks/ installed copies are tracked (committed before the .claude/*
ignore), so apply the same SessionID resolver here as in the embedded templates
(prior commit). session-start.sh, prompt-context.sh, post-tool-observe.py,
pre-compact.sh now resolve MDEMG_SESSION_ID env > stdin session_id >
~/.mdemg/.claude-session > claude-core. (pre-write-check.py is untracked /
local-only — its fix lives on this machine only; tracking it is the follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcemen…
* refactor(learning): expose per-pair telemetry from Hebbian Cypher (EVENTGRAPH-001 Epic 3)
ApplyCoactivation Cypher RETURN clause extended from "count(*) AS updated"
to 17 per-pair columns: src/dst node IDs, prev/new/delta weight,
evidence_count_after, eta_effective (cfg.LearningEta × etaMult),
surprise_factor, activation_product, path_sim, role_a/b, obs_type_a/b,
session_id, direction (forward/reverse/bidirectional), created_new_edge.
created_new_edge derived from (r.evidence_count = 1) — the ON CREATE
branch sets evidence_count to 1; ON MATCH increments. Reliable proxy
for "new connection formed" vs "existing connection strengthened" at
analysis time.
Plan-deviation disclosure (per feedback_plan_options_pattern.md): the
plan called for 2 rows per pair in asymmetric mode (forward + reverse).
The Cypher mirrors rr.weight = r.weight at all times — forward and
reverse edges carry identical weights. Emitting 2 rows would double-
count without adding signal. Final choice: 1 row per logical pair
regardless of mode, with the direction column carrying the
forward/reverse/bidirectional distinction. Revisit if EVENTGRAPH-003
introduces a Hebbian path where forward/reverse weights diverge.
New helper internal/learning/reinforcement_parser.go translates a
neo4j.Record (or any (key) → (any, bool) getter) into a
tsdb.ReinforcementEventRow. Lives in its own file so service.go
doesn't grow. Defensive against missing keys (zero values), nil values
(zero/empty), wrong types (fallback to zero) — no panics.
Tier 1 unit tests (6 green) cover:
- Symmetric bidirectional + ON CREATE branch
- Asymmetric forward + ON MATCH branch (evidence > 1)
- Missing optional fields → zero values (nullable* writer helpers
serialize as DB NULL)
- Neo4j int64 → Go int coercion
- nil values → zero/empty
- Wrong-typed values → graceful fallback
Reinforcement rows are captured locally in ApplyCoactivation but not
yet forwarded to TSDB — Epic 4 wires the writer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(learning): record reinforcement events to TSDB (EVENTGRAPH-001 Epic 4)
learning.Service grows a reinforcementWriter field + SetReinforcementWriter
setter (mirrors the SetStabilityReinforcer back-compat pattern). After
ExecuteWrite returns from ApplyCoactivation, each captured per-pair row
gets the spaceID stamped on it and is enqueued via writer.Record. The
writer is non-blocking; the Hebbian hot path never waits on TSDB.
Configurability Contract — 7 new env vars (no-hardcoding rule):
- EVENTGRAPH_ENABLED (bool, default true)
- EVENTGRAPH_WRITER_FLUSH_INTERVAL_SEC (int, default 30, floor 5)
- EVENTGRAPH_WRITER_BUFFER_SIZE (int, default 1000, 0 = unlimited)
- EVENTGRAPH_MAX_PAIRS_PER_EVENT_BATCH (int, default 200)
- EVENTGRAPH_MAX_EVENTS_PER_QUERY (int, default 500, Epic 5 ceiling)
- EVENTGRAPH_FEDERATION_DEFAULT_HOPS (int, default 2)
- EVENTGRAPH_FEDERATION_DEFAULT_LOOKBACK_HOURS (int, default 24)
api/server.go wires the writer's lifecycle:
- Constructed after TSDB client is ready, gated by cfg.EventGraphEnabled
so EVENTGRAPH_ENABLED=false cleanly skips construction; learner's
reinforcementWriter stays nil and the Hebbian path short-circuits.
- Closed alongside the other TSDB writers in graceful-shutdown — buffer
drains before the process exits.
Tier 2 integration tests (against real TSDB, build tag integration):
- TestEventGraph_Writer_RoundTrip: 3 rows recorded → flush-window
elapses → SELECT count(*) returns 3.
- TestEventGraph_Writer_DrainOnClose: 5 rows recorded with 1-hour flush
interval → Close() drains → SELECT returns 5 (verifies the server
shutdown invariant).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(eventgraph): federation query helper + API endpoint (EVENTGRAPH-001 Epic 5)
internal/eventgraph/query.go — Pattern Y1 federation helper.
EventsInGraphNeighborhood orchestrates a two-step query:
1. Cypher graph walk from a seed node — variable-length path over
CO_ACTIVATED_WITH | GENERALIZES at depth 0..Hops. Returns the
N-hop neighborhood (DISTINCT node_ids, includes the seed).
2. TSDB query against reinforcement_events for events where src OR
dst is in the neighborhood, within the lookback window, ordered
newest-first, capped at the configured limit.
3. Go-side join — annotates events with SrcInNeighborhood /
DstInNeighborhood so the consumer can distinguish "both endpoints
in the subgraph" from "one endpoint outside the seed's N-hop
reach but the event still touches our subgraph."
Empty neighborhood (no seed match, hops=0) short-circuits before the
TSDB call. Sub-1-second Since values clamp to 1s. Hops < 0 is rejected
upfront. The handler enforces an additional ceiling of 2 ×
EVENTGRAPH_FEDERATION_DEFAULT_HOPS for runaway-walk protection.
internal/api/eventgraph_handler.go — POST /v1/eventgraph/reinforcement-
neighborhood. Same auth convention as /v1/admin/breakers. 503 when
EVENTGRAPH_ENABLED=false or when eventgraphService is nil (TSDB-down at
boot). 400 on missing space_id / seed_node_id / negative hops / hops >
ceiling. Defaults applied from config when fields omitted from request.
Plan-decision disclosure (per feedback_plan_options_pattern.md): plan
proposed Option A (single endpoint with event_type query param) vs
Option B (endpoint per event class). Final choice: A. v1 has one event
class (reinforcement); the endpoint URL is explicit about that.
EVENTGRAPH-002 can either add a query param or split the URL when a
second event class arrives — no breaking change either way.
Tests:
- Tier 1 (internal/eventgraph/query_test.go, 7 green): request
validation rejects empty space_id, empty seed, negative hops; interval
formatting roundtrips; join annotation handles both-inside,
one-outside, and empty-neighborhood cases.
- Tier 1 (internal/api/eventgraph_handler_test.go, 4 green + 2 skipped):
method-not-allowed, feature-disabled 503, nil-service 503, invalid-
JSON short-circuit. Two validation paths skipped — they require a
non-nil eventgraphService which can't be constructed without a real
driver; Tier 2 exercises them.
- Tier 2 (tests/integration/eventgraph_federation_test.go, 1 green):
builds seed--mid--leaf graph + off-node, emits 3 reinforcement
events touching all four nodes, calls federation at hops=0 and
hops=1, asserts neighborhood + in-neighborhood flags. The hops=0
test confirms that mid↔leaf (touching neither seed nor any 0-hop
neighbor) is correctly excluded.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(observability): Grafana panel + Prometheus counters for reinforcement events (EVENTGRAPH-001 Epic 6)
Three new Prometheus counters mirror the V0022 writer's internal atomic
counters:
- mdemg_eventgraph_writer_rows_enqueued_total — rows successfully CopyFrom'd
- mdemg_eventgraph_writer_rows_dropped_total — rows FIFO-evicted (buffer full)
- mdemg_eventgraph_writer_flush_failure_total — flush errors
Wiring: the writer accepts a narrow PrometheusCounter interface
(Add(int64)) so internal/tsdb does not import internal/metrics (which
would cycle). api/server.go calls SetPrometheusCounters after the
writer is constructed, passing the three counters from the global
StandardMetrics struct. Nil-safe.
Dashboard: mdemg-graph-topology.json gains a new collapsed row
"Reinforcement Events (EVENTGRAPH-001)" with a single time-series
panel "Reinforcement Event Rate (events/min)" showing all three rates
(enqueued / dropped / flush failures) over the last 24h. Dropped is
colored orange, flush failures red, enqueued the default palette. Tied
to the prometheus datasource.
The existing GRAFANA-AUDIT-001 harness (scripts/grafana_panel_audit.py)
only evaluates SQL-target panels — the new panel uses Prometheus
queries, so it lands on the SKIP pile, same as the other 8 Cypher /
Prometheus panels on this dashboard. Audit JSON refreshed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(eventgraph-001): restore full GRAFANA-AUDIT-001 audit_results.json
Epic 6's targeted audit run (scripts/grafana_panel_audit.py --dashboard
mdemg-graph-topology.json) overwrote the full multi-dashboard audit
results from GRAFANA-AUDIT-001 with the single-dashboard subset (9
SKIPs only). Restoring the full snapshot from commit 0a1e8e1 — that
audit covered all 8 dashboards and is the canonical baseline the
GRAFANA-AUDIT-001 post.md references. EVENTGRAPH-001 did not need to
regenerate it; the new panel uses Prometheus queries, which the audit
harness SKIPs regardless of dashboard.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(retrieval): set Activation on RRF RetrieveResult (EVENTGRAPH-001 fix-commit)
ScoreAndRankRRF's ConsensusResult → RetrieveResult conversion was
silently dropping the Activation field. The legacy ScoreAndRank path at
scoring.go:883 sets Activation: a (where a := act[c.NodeID] is the
spreading-activation map value). The RRF path constructed
models.RetrieveResult{...} with no Activation key, leaving the field
zero-valued.
Net effect: since Phase 13.1 default-on (2026-05-03),
learning.Service.ApplyCoactivation has filtered out every L0 candidate
on the retrieve hot path. The filter is r.Activation >=
LearningMinActivation (default 0.20). With Activation=0, no pair makes
it to the Hebbian Cypher; the function returns nil without writing.
Hebbian learning has been silently no-op on the production retrieve
goroutine for ~24 days. CO_ACTIVATED_WITH edges still exist in the
graph — sidecar paths (CoactivateSession, ApplySymbolCoactivation,
consolidation walks) and pre-Phase-13.1 retrieves wrote them — but the
retrieve-time goroutine has been a silent no-op.
Discovered during EVENTGRAPH-001 Epic 7 live e2e. Three retrieves
produced 0 rows in reinforcement_events. Investigation traced the gap
to the missing Activation field.
Fix: one-line addition in scoring_rrf.go — Activation: act[c.NodeID].
Brings the RRF path to parity with the legacy scorer.
Post-fix verification: rebuilt, restarted server, re-issued 3 retrieves
→ 10 reinforcement events landed in TSDB. Federation API at hops=1
correctly returned all 10 with src_in_neighborhood=true,
dst_in_neighborhood=true. Documented in
docs/development/eventgraph-001/verification.md.
Per CLAUDE.md "Testing — Live System Testing Is Required":
"surprise bugs caught during live smoke get their own follow-up
fix-commit — do not silently roll them into the sprint commit." This
is the precedent-aligned separate commit.
Forward-only: existing graph state is preserved; new retrieves now
correctly emit Hebbian updates. EVENTGRAPH-002 may revisit whether to
backfill the missing 24-day window.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): Tier 3 live e2e verification transcript (Epic 7)
Real /v1/memory/retrieve × 3 against mdemg-dev → 10 reinforcement events
landed in TSDB within the flush window. Federation API at hops=1 from a
seed node returned 5-node neighborhood + 10 in-neighborhood events.
Documents the surprise-bug discovery + fix that preceded this transcript
(see fix-commit for scoring_rrf.go::ScoreAndRankRRF Activation
propagation).
Acceptance criteria from sprint plan §"Acceptance Criteria" all PASS.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(eventgraph-001): feature doc + CHANGELOG + CLAUDE.md + sprint close (Epic 8)
Final epic — Documentation Update (never cut, per feedback_per_feature_docs_required.md
and the standardized v1.0 sprint plan format).
New: docs/features/event-graph-federation.md (~240 lines, Why / Choices /
How it works / How to use / Forward-looking). Documents:
- Pattern Y1 vs Y2 trade-off (why federation-in-Go now, link-node
reification deferred until a query forces it)
- Why V0019 buffered-CopyFrom over V0021 sync-INSERT (per-retrieve volume)
- Why ApplyCoactivation first (other 3 Hebbian entry points deferred to
EVENTGRAPH-003)
- Why forward-only (no source to backfill from)
- Federation pipeline (Cypher walk → TSDB query → Go-side join with
src/dst_in_neighborhood annotation)
- TSDB schema, API request/response shape, 7 env vars + defaults
- Observability (3 Prometheus counters + Grafana panel)
- Forward-looking sprints
New: docs/development/eventgraph-001/post.md — epic-by-epic outcomes,
acceptance criteria check-off, surprise log (RRF Activation drop +
audit-JSON overwrite + orphan-process port collision), plan deviations
disclosed (1-row-per-pair regardless of asymmetric mode; single-
endpoint over endpoint-per-class), forward-looking.
CHANGELOG.md Unreleased gains the EVENTGRAPH-001 entry — 11 bullet
points covering V0022 migration, buffered writer, Cypher RETURN-shape
change, Configurability Contract, federation helper + API, Prometheus
+ Grafana, Tier 2 + Tier 3 verification, the surprise-bug RRF
Activation fix-commit, and the audit-JSON restore.
CLAUDE.md Architecture Notes gains a new "Event Graph Federation" entry
above the Model Distribution section. Documents the pattern, surface,
deferrals, and the load-bearing fix-commit f307f55 that surfaced 24
days of silent Hebbian no-op on the retrieve hot path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(eventgraph-001): Grafana panel uses TSDB instead of unconfigured Prometheus datasource
The Epic 6 panel used datasource {type: prometheus, uid: prometheus} but
this Grafana instance has no Prometheus datasource configured — mdemg
exposes counters as JSON via /v1/metrics/snapshot, not a /metrics scrape
endpoint. Configured datasources: mdemg-nodegraph, neo4j, timescaledb
only. The panel rendered "No data" in the live Grafana.
Rewritten panel queries the reinforcement_events hypertable directly via
the timescaledb postgres datasource. Two targets:
1. count(*) over 1-minute time_buckets → overall events/min
2. count(*) FILTER (WHERE created_new_edge) vs WHERE NOT created_new_edge
→ split between new connections formed and existing connections
strengthened (the operational dimension the analytic queries
actually need)
Both targets templated on $space_id (existing dashboard variable). The
Prometheus counters (mdemg_eventgraph_writer_rows_{enqueued,dropped,
flush_failure}_total) remain wired and incrementing — they surface via
/v1/metrics/snapshot for ops scripts. The Grafana panel now actually
displays data instead of relying on a scrape path that doesn't exist
in this deployment.
Discovered during post-merge live verification (2026-05-29). Verified
fix: reloaded dashboard via Grafana API → /api/ds/query against same
SQL returns 1-minute buckets matching TSDB direct count. Audit harness
now reports 2 PASS for the new panel (previously SKIP — no SQL target).
verification.md updated with the post-merge transcript.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(rrf-scale-001): sprint plan — RRF score-scale consumer remediation
P0 fix. The Jiminy guidance->feedback->outcome loop has been dormant
~9 weeks: consulting/service.go gates constraint/suggestion extraction
on hardcoded legacy-scale score thresholds (r.Score < 0.55 et al.).
Phase 13.1 RRF (default-on May 3) dropped the score scale so strong
matches top out ~0.53 -> 0/10 results clear the gates -> empty guidance
-> dead loop. Third instance of the RRF-score-contract bug class (after
the EVENTGRAPH-001 Activation drop).
12-section format; 6 epics; config-driven percentile-gate fix +
sigmoid recalibration; live-verify the revived loop end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): Epic 1 audit findings — 12 sites cataloged
Full-repo sweep of post-RRF score/activation/confidence consumers +
live score-distribution sampling. Findings:
- HIGH (4): consulting constraint gates (1005/1081/1087) + confidence
sigmoid midpoint 1.5 (35-36) — the loop-killer cluster.
- MED (5): consulting conflict gates (931/944/957/981) + minConfidence
pre-filter (619, already config-driven).
- LOW (3): retrieval/jiminy.go Activation display gates (45/155/192) —
explanation text only, no guidance gating.
- NONE (2): jiminy trial score (0-10 scale), trust-score clamp.
Live distribution: RRF strong-match top scores cluster 0.49-0.58; the
0.55 gate sits mid-band, rejecting the most-relevant constraint half
the time. NormalizedConfidence is positional rank (spreads 100->0 even
on uniform-score sets) -> rules out plan Option A (percentile) as sole
gate. Remediation: config-driven RRF-calibrated absolute thresholds
(Option B), constraint floor default 0.45, sigmoid midpoint ->0.45.
Disclosed deviation per feedback_plan_options_pattern.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(consulting): RRF-calibrate score gates + confidence sigmoid (RRF-SCALE-001 Epic 2)
Revives the dormant Jiminy guidance loop. Replaces 7 hardcoded legacy-
scale score gates in consulting/service.go + the score->confidence
sigmoid (both copies) with config-driven, RRF-calibrated values.
Gates (all default 0.45, RRF strong-match band is 0.49-0.58):
- constraint extraction (was <0.55) -> CONSULTING_CONSTRAINT_SCORE_FLOOR
- keyword/name authority inner gate (0.55/0.6) -> CONSULTING_AUTHORITY_SCORE_FLOOR
- conflict/contradiction detection (0.6-0.7) -> CONSULTING_CONFLICT_SCORE_FLOOR
Key Epic-2 finding: keywordClassifyConstraint has an INNER authority
gate that binds tighter than the outer constraint gate. If authority
floor > constraint floor, the binding gate re-rejects the strong-match
band and the loop stays dormant -> all three default to 0.45. The RRF
band is too compressed to subdivide into tiers; knobs stay separate so
operators can raise any one independently.
Sigmoid (score->confidence), both consulting/service.go and
jiminy/retrieval_source.go (they MUST stay in sync per their own
comments): midpoint 1.5 -> 0.45, steepness 1.5 -> 8.0, config-driven via
RETRIEVAL_CONFIDENCE_SIGMOID_{MIDPOINT,STEEPNESS}. Legacy crushed a
strong 0.5 match to 0.18 confidence; recalibrated maps it to 0.60
(0.1->0.06, 0.58->0.74). normalizeRetrievalConfidence is now a Service
method reading cfg with zero-value fallback; mapRetrievalToGuidance
takes the sigmoid params from its caller's cfg.
5 new config knobs, all with RRF-calibrated defaults + zero-value
guards (no-hardcoding rule; the bug WAS a hardcoded value).
Tier 1 tests: updated 2 legacy-scale boundary tests to the new
thresholds + added RRFStrongMatchBand regression (0.50 must surface),
ConstraintFloor_ConfigDriven (override honored), and
NormalizeRetrievalConfidence_RRFCalibration (band mapping). Full
consulting + jiminy + config suites green; lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): Epic 3 — remaining LOW findings reviewed + decided
retrieval/jiminy.go Activation display gates (45/155/192 + LearningEdge
siblings) traced live: they're in the explainability renderer, not the
guidance-surfacing path; always-additive at RRF scale (live activation
~0.723 >> thresholds), no misbehavior. Intentionally left unchanged with
rationale — config-ifying display verbosity is out of proportion to zero
functional impact. Every High/Med remediated (Epic 2), every Low decided.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(rrf-scale-001): Tier 2 integration + Tier 3 live verification (Epic 4)
Tier 3 live e2e (verification.md): the score-gate fix revives the
dormant guidance loop on the live stack —
- /v1/jiminy/guide guidance items 0 -> 10, source_counts.constraints
0 -> 2, patterns 0 -> 3 (acceptance #1 MET).
- Full loop warm->latest->feedback->outcome: TSDB constraint_outcomes
sink REVIVED — fresh rows dated 2026-06-03 (table was dead since
May 1). Constraint-effectiveness Grafana sink is live again.
Three adjacent issues surfaced during live smoke, documented as distinct
follow-ups (NOT score-scale, not bolted on):
- A: Neo4j GUIDANCE_OUTCOME edges still dormant — guidance SourceNodes
point at emergent_concept nodes; PersistGuidanceOutcome only writes
edges for constraint/correction/pattern/learning or role_type=
constraint targets. Node-type-targeting bug, independent of RRF.
Candidate sprint JIMINY-OUTCOME-001.
- B: LLM guidance synthesis timeout (now that synthesis runs).
- C: /v1/jiminy/latest unescaped control chars break jq/json parsers —
the hook uses jq, so may compound dormancy. Low-effort follow-up.
Tier 2 (rrf_scale_guidance_test.go, integration tag, 2 green):
- SuggestSurfacesGuidance: constraint-matching context surfaces 7
suggestions (was 0 before fix) against live mdemg-dev.
- SuggestRejectsNoise: gibberish does not flood constraints (no
over-correction).
Cold-start note: first guide call post-restart returned constraints:0
(LLM classifier cold-model timeout -> keyword fallback); after one
warm-up call, constraints surface. Model-warmth artifact, not a fix
defect.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): CHANGELOG + CLAUDE.md score-scale contract + post.md (Epic 5)
Final epic. CHANGELOG Unreleased gains the RRF-SCALE-001 Fixed entry.
CLAUDE.md gains a 'score-scale contract' architecture note — the
structural defense against a 4th instance: downstream consumers MUST
NOT hardcode absolute thresholds against RetrieveResult.Score (the
scorer scale is not a stable contract); gate via config or a
scale-invariant signal, and re-audit on any scorer change. Notes that
NormalizedConfidence is positional (not a safe sole gate) and records
the three open follow-ups.
post.md: epic-by-epic, acceptance check-off (honest: #2 partial — TSDB
sink revived, Neo4j edge is distinct Follow-up A), scope note
separating the score-scale fix (done) from the 3 adjacent surfaced
issues (documented follow-ups), discipline notes (cold-start mask,
inner authority gate).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(rrf-scale-001): skip guidance integration tests on empty environment (CI fix)
CI failure on PR 404: TestRRFScale_SuggestSurfacesGuidance failed in
0.02s. Root cause: the test assumed the populated local mdemg-dev space
(111 constraint nodes), but CI boots a FRESH EMPTY Neo4j with stub
embeddings (and RETRIEVAL_COLUMN_VOTING_ENABLED=false / legacy scorer).
With no data, /v1/memory/suggest returns 0 candidates, so the
'total == 0' assertion fired.
Other integration tests self-seed data or skip when prerequisites are
absent; mine relied on ambient data — wrong for a reproducible CI run.
Fix: skip when debug.retrieved_count == 0 (no retrievable data → the
score-gate fix isn't exercisable; there's nothing for the gate to admit
or reject). The test stays meaningful against a populated stack (local:
9 suggestions from 15 retrieved → PASS) and skips cleanly in CI's
empty-DB environment. Verified both paths live: populated → PASS,
empty space → retrieved_count 0 → SKIP.
The gate fix itself is validated by Tier 1 unit tests + the live Tier 3
e2e (docs/development/rrf-scale-001/verification.md); this integration
test is a bonus live-stack assertion, not the primary proof.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): sprint plan — revive Neo4j GUIDANCE_OUTCOME sink
Follow-up A from RRF-SCALE-001: the Neo4j GUIDANCE_OUTCOME edge sink has
been dormant since Apr 12. Root cause: matchConstraintCode links guidance
items to constraint codes by keyword overlap (>=3 shared words), but
retrieval surfaces emergent_concept abstractions whose content does not
share 3+ literal words with raw constraint text -> no constraint_code ->
PersistGuidanceOutcome falls back to the concept SourceNode -> the
role_type=constraint filter rejects it -> no edge. Live-proven: all 17
recent outcome rows had constraint_code=(none).
Fix (Option 1): switch the matcher to embedding cosine similarity
(content already normalized to natural language ~0.70 cosine; Service
has an embedder; cosineSimilarity + embed->cosine pattern already exist
in-package via OutcomeClassifier). Existing PersistGuidanceOutcome +
findConstraintNodeID then create edges on the correct constraint nodes.
Keyword matching stays as fallback -- never regresses.
4 epics; ~1-1.5 dev-days; config-driven threshold; acceptance bar = a
fresh Neo4j GUIDANCE_OUTCOME edge on a real role_type=constraint node
dated today, reflected in GetConstraintEffectiveness.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): embedding-similarity constraint-code matching (JIMINY-OUTCOME-001 Epic 1)
Revives the Neo4j GUIDANCE_OUTCOME edge sink (dormant since Apr 12).
Root cause (RRF-SCALE-001 Follow-up A): matchConstraintCode links
guidance items to constraint codes by keyword overlap (>=3 shared
words), but retrieval surfaces emergent_concept abstractions whose
content rarely shares 3+ literal words with raw constraint text -> no
code -> PersistGuidanceOutcome falls back to the concept SourceNode ->
the role_type=constraint filter rejects it -> no edge.
Fix: new matchConstraintCodeByEmbedding queries the constraint vector
index (db.index.vector.queryNodes, role_type=constraint, sim >=
threshold) and returns the closest constraint's code. Guide() tries
this first, falling back to the keyword matcher when the embedder is
unavailable, content is empty, or nothing clears the threshold — never
regresses. The existing PersistGuidanceOutcome + findConstraintNodeID
then create the edge on the correct constraint node.
Implementation refinement vs plan: uses Neo4j's vector index server-side
(mirrors the proven Evaluator.findMatchingConstraints pattern) rather
than loading all constraint embeddings into Go and computing cosine in a
loop — cleaner, no constraintCodeEntry.Embedding needed. Same Option-1
outcome.
Config: JIMINY_CONSTRAINT_CODE_SIM_THRESHOLD (default 0.55, zero-value
fallback) — provisional; tuned against the live similarity distribution
in Epic 2.
Tier 1 (4 tests): nil-driver/empty-embedding guards, threshold default
resolution, keyword-fallback non-regression. Full jiminy + config
suites green; lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(jiminy-outcome-001): Tier 2 integration + Tier 3 live verification (Epic 2)
Tier 3 live e2e (verification.md) — acceptance bar MET:
- /v1/jiminy/guide now yields guidance items carrying constraint_codes
(10 items, 6 coded; was 0). Matched code 'no-direct-main-commits' is
semantically exact for the 'commit to main' context.
- Full warm->latest->feedback loop: Neo4j GUIDANCE_OUTCOME 893 -> 899
(+6), latest today. All 6 new edges land on REAL role_type=constraint
nodes ('CONSTRAINT: NEVER commit directly to main') — not
emergent_concept. The sink dormant since Apr 12 is revived on the
correct nodes.
- /v1/constraints/effectiveness reflects it: 'NEVER commit directly to
main | surfaced: 30 followed: 28 rate: 0.93'.
- Both sinks now revived: TSDB (RRF-SCALE-001) + Neo4j (here). The
constraint-effectiveness loop is fully restored.
Threshold 0.55 validated live: correct matches, no false positives.
Tier 2 (jiminy_outcome_test.go, integration tag, skip-on-empty): PASSES
on a populated stack with an idle LLM (7/10 items coded). The guide path
is LLM-latency-dependent (per-node classifier ~31s/call, serialized; a
call fired while the LLM is busy fast-fails empty), so the test
warm-retries and SKIPS (never false-fails) when the LLM path can't
produce items. Bonus check; Tier 3 is the definitive proof. The LLM
serialization/synthesis-timeout is RRF-SCALE-001 Follow-up B, tracked
separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): CHANGELOG + CLAUDE.md + post.md (Epic 3)
Final epic. CHANGELOG Unreleased gains the JIMINY-OUTCOME-001 Fixed
entry. CLAUDE.md gains a guidance-outcome constraint-code-matching note
(embedding-first via vector index, keyword fallback; both outcome sinks
now live). post.md: epic-by-epic, acceptance check-off, the loop-revival
completion (TSDB from RRF-SCALE-001 + Neo4j here), discipline notes (LLM
serialization is the test-flakiness source), forward-looking (Follow-up
B now the most operationally-visible remaining issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(guidance-synth-001): sprint plan — fix guidance synthesis timeout (Follow-up B)
Synthesis fails on every production warm call (6/6 jiminy.synthesize
errored). Root cause: the hook's /warm path runs background Guide() with
a hardcoded 30s timeout (handlers_jiminy.go:302), inside which the
per-node constraint classifier runs SERIALLY (~1.5s x ~10 nodes ~= 15s),
leaving only ~15s for synthesis which needs 8-27s -> deadline exceeded.
JIMINY_TIMEOUT_MS=240s is configured but the 30s hardcode caps it.
Fix (both): (1) parallelize the per-node classifier with bounded
concurrency (CONSULTING_CLASSIFY_CONCURRENCY, default 4 matching
llama-server --parallel 4); (2) config-drive the warm timeout
(JIMINY_WARM_COMPUTE_TIMEOUT_MS, default 90s). Acceptance: synthesis
succeeds live (no synthesis_error), measured latency drop, no
constraint-surfacing regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(consulting): parallelize per-node constraint classifier (GUIDANCE-SYNTH-001 Epic 1)
The per-node LLM constraint classifier in findApplicableConstraints ran
serially (~1.5s/node x ~10 nodes ~= 15s), starving guidance synthesis of
its time budget (synthesis 6/6 errored on the warm path). Now classifies
with bounded concurrency.
- Gate-first (RRF-SCALE-001 score floor) to fix a stable candidate order,
then classify each candidate into a position-indexed slot via a
semaphore-bounded worker pool, then collect-in-order + dedup-by-name —
output is identical to the serial path (determinism). Keyword-only
(no LLM) or cap=1 runs serially (no LLM latency to hide).
- Config: CONSULTING_CLASSIFY_CONCURRENCY (default 4, matching
llama-server --parallel 4; floor 1 = serial rollback). Zero-value
fallback to 4.
- Extracted constraintClassifierIface (minimal Classify surface) so the
concurrent path is unit-testable with a fake; *ConstraintClassifier
satisfies it; SetConstraintClassifier guards against a typed-nil
interface.
Tier 1 (5 new, -race clean): ParallelEqualsSerial (determinism + order),
ParallelIsFaster (concurrency overlaps latency), ErrorFallsBackToKeyword
(fallback intact), ScoreGateStillApplies (RRF-SCALE-001 gate preserved),
ConcurrencyDefaultFallback. Existing findApplicableConstraints tests
unchanged — no regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): config-drive warm-compute timeout (GUIDANCE-SYNTH-001 Epic 2)
The warm-path background Guide() ran with a hardcoded 30s timeout
(handlers_jiminy.go:302) even though JIMINY_TIMEOUT_MS=240s is
configured. 30s was too tight for the per-node classifier (~15s) +
synthesis (8-27s) -> synthesis deadline-exceeded every warm call.
Replaced with JIMINY_WARM_COMPUTE_TIMEOUT_MS (default 90000, zero-value
fallback 90000) — headroom for the now-parallel classifier (~7.5s) + a
slow 27s synthesis. No-hardcoding rule. Rollback: set to 30000.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test+docs(guidance-synth-001): Tier 2/3 verification + docs + close (Epic 3)
Tier 3 live e2e (verification.md): the warm production path now produces
a synthesized narrative — synthesis_used=true, no synthesis_error,
1892-char augmentation. Fresh jiminy.synthesize succeeded at 50.7s
latency (fit the new 90s budget; would die at the old 30s — validates
the default). Both fixes needed.
Tier 2 (guidance_synth_test.go, integration, skip-on-empty + LLM-
tolerant): PASS — warm path produces guidance without synthesis_error.
Docs: CHANGELOG Fixed entry; CLAUDE.md guidance-synthesis-budget note
('when adding LLM calls to the guidance hot path: respect the
warm-compute budget and prefer bounded concurrency over serial loops');
post.md with the data-driven diagnosis + forward-looking (Follow-up C
now the last open item).
Closes Follow-up B. The guidance pipeline (surfacing + codes +
synthesis) is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(followup-c): close JSON control-char escaping as NON-ISSUE (no fix)
Follow-up C (the last open item from RRF-SCALE-001 triage) investigated
and closed with evidence — NO code change, because there is no bug to fix.
The earlier /v1/jiminy/latest parse failures were client-side shell
artifacts (co-occurring with the session's 'failed to change group ID'
errors + ad-hoc variable-capture piping), not server bytes:
- writeJSON uses json.NewEncoder().Encode (encoding/json always escapes
control chars U+0000-U+001F); no raw-write bypass; no custom MarshalJSON.
- The synthesized narrative is double-StripControlChars'd (synthesizer.go
:127 + service.go:1116).
- prompt-context.sh already strips control chars via perl before jq, with
2>/dev/null + // empty fallbacks.
Live-verified: the hook's exact jq returns guidance_id correctly; 5 rapid
/latest fetches all parse as strict-valid JSON; 0 raw control chars.
Per 'don't fix a non-problem', shipping a fix would invent a bug that
doesn't exist. Closure documented in docs/development/followup-c-closure.md.
This closes the entire RRF-SCALE-001 follow-up triage: A (JIMINY-OUTCOME
-001), B (GUIDANCE-SYNTH-001), C (non-issue). The guidance->feedback->
outcome loop is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(jiminy): /guide 30s timeout sibling + single-source config defaults (GUIDANCE-SYNTH-001 fix-commit)
Two things, both from a live e2e of the full loop through the real
production hook path (run per user directive: standard tests don't find
live problems).
1. Sibling bug: the /v1/jiminy/guide handler had the SAME hardcoded 30s
cap as the warm path (handlers_jiminy.go). GUIDANCE-SYNTH-001 fixed
warm; /guide still deadline-exceeded synthesis at exactly 30.003s
(this is what made prior sprints' /guide integration tests flaky).
Now uses the config-driven budget. Live-verified: a 50.05s synthesis
completed (synthesis_used=true) — would die at 30s.
2. Single source of truth for config defaults (user directive: 'single
place to change all instances'). The 90s budget was duplicated as a
literal in 3 sites; prior sprints similarly duplicated each default
(the sigmoid 0.45/8.0 was in 3 places). Now each default is one
exported config.Default* const, referenced by FromEnv and aliased by
consuming-package fallbacks + a Config.JiminyWarmComputeTimeout()
method. Consolidated: warm-compute timeout, the 3 consulting score
floors, sigmoid midpoint/steepness, constraint-code sim threshold,
classify concurrency. Zero behavior change (compile-time aliases);
-race + full suites green.
Live e2e also re-confirmed: real hook captures guidance_id -> feedback
-> +7 Neo4j GUIDANCE_OUTCOME edges on real constraint nodes + 10 TSDB
rows (whole loop closes through the actual hook; re-confirms Follow-up C
non-issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): sprint plan — federation consumer CLI + UATS backfill
Builds the first consumer for EVENTGRAPH-001's reinforcement-neighborhood
federation API (which has no consumer): a 'mdemg eventgraph' CLI command.
Validates the Pattern Y1 bet + becomes the live-testing harness for
EVENTGRAPH-002/003 (user directive: build the consumer first).
Per the UxTS directive: maps the work to the frameworks. UATS applies to
the federation HTTP API -> add eventgraph_reinforcement_neighborhood.uats
.json (backfilling the -001 gap; the endpoint shipped with no UATS),
which replaces an ad-hoc Go integration test as the Tier 2 contract test.
UVTS/UBENCH N/A. UOTS panel-spec gap noted as a follow-up (out of scope).
CLI rendering -> Tier 1 Go units.
4 epics; CLI (--seed/--query/--hops/--since/--limit/--json) renders
summary + events table or JSON; server-driven defaults (no re-hardcoding);
read-only. ~1-1.5 dev-days.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-cli-001): mdemg eventgraph reinforcement-neighborhood (Epic 1)
First consumer of the EVENTGRAPH-001 federation API — POSTs to
/v1/eventgraph/reinforcement-neighborhood, renders a summary + events table
(or --json). Supports --seed, --query (resolves seed via /v1/memory/retrieve
top-1), --hops, --since, --limit. Unset flags are omitted from the request
so the server applies its config defaults (no re-hardcoding of hops/since/limit
in the CLI). Registered under the "advanced" command group.
Tier 1 (httptest, -race clean): request-mapping omit-when-unset + conversion,
--query seed resolution, no-results + invalid --since + surfaced-503 errors,
render (empty + table), helpers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph): neighbor_node_ids serializes as [] not null for empty neighborhood
Caught in EVENTGRAPH-CLI-001 live contract testing (standard code tests
missed it; the live UATS happy-path against the running server did not):
walkNeighborhood returns a nil slice when the seed has no neighborhood
(e.g. an unknown seed), which JSON-marshals to `null`, while Events is
defensively initialized to []. Both are array fields and must serialize
consistently — null breaks any consumer asserting an array type (incl. the
new UATS contract's `type_is array` on $.neighbor_node_ids).
EventsInGraphNeighborhood now coalesces the nil slice to []string{}.
Tier 1 TestFederationResult_EmptyArraysNotNull pins the JSON contract.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-cli-001): UATS contract spec for federation API (Epic 2)
Backfills the UATS gap EVENTGRAPH-001 left (no contract test for
/v1/eventgraph/reinforcement-neighborhood). 6 cases, validated 6/6 live
against the running server:
- happy 200: asserts the response contract shape (events/neighbor_node_ids
arrays, graph_hops/tsdb_rows_scanned numbers, truncated boolean) — robust
to data, works even with an unknown seed (empty neighborhood is valid 200)
- missing_space_id / missing_seed_node_id → 400 (empty-string override, since
the runner deep-merges variant body over base — key omission can't unset)
- negative_hops → 400, hops_over_ceiling (999 > 2×default) → 400
- method_not_allowed (GET) → 405
sha256 integrity hash added + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): live verification + feature doc + CHANGELOG + close (Epic 3)
Tier 3 live e2e verified the real binary against the real stack: --query
surfaced 20 reinforcement events in a 5-node neighborhood (demonstrating the
Hebbian-write → federation-read loop closing in one command); --seed/--json/
--limit/unknown-seed/no-arg paths all verified live. Feature doc gains the CLI
consumer section; CHANGELOG Added + Fixed entries; CLAUDE.md architecture note;
verification.md + post.md (UxTS mapping: UATS done, UOTS follow-up carried over).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph-cli-001): tag UATS spec 'tsdb' so CI skips it without TSDB
CI Test failed: the UATS contract step boots a minimal server without TSDB,
so the eventgraph service is nil and every POST returns 503 "service not
initialized" instead of the expected 200/400 (only GET→405 passed, since the
method check precedes the service check). Same class as PR #404. The
federation endpoint genuinely requires TSDB (it queries reinforcement_events;
the service is nil without TSDB at boot), and CI's UATS step already excludes
`tsdb`-tagged specs (ci.yml --exclude-tag ...,tsdb). Added "tsdb" to api.tags
(matching metrics_snapshot/readyz_tsdb); re-hashed. Verified locally: the spec
now reports Status: skip under the exact CI exclude filter, and still 6/6 live
against the full stack via explicit --spec.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): sprint plan — guidance-outcome federation (Epic 0)
Federate the guidance-outcome event stream (Pattern Y1, second event class):
walk a constraint's Neo4j neighborhood, surface time-windowed constraint_outcomes
(followed/ignored/contradicted) for the constraint + its graph-related constraints.
Data-decided architecture: reuse the existing constraint_outcomes table (no new
hypertable/writer/enqueue site — RRF-SCALE-001 already populates it, 1176 live
rows); join graph↔events on constraint_code (TSDB constraint_id UUID ≠ Neo4j
node_id CUID — code is the only viable key). One additive migration (V0023:
constraint_code index, schema 22→23). 8 epics, 3 testing tiers, live Tier 3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): V0023 constraint_code index on constraint_outcomes (Epic 1)
Adds idx_constraint_outcomes_code (space_id, constraint_code, time DESC) — the
guidance-outcome federation joins graph↔events on constraint_code (TSDB
constraint_id is a UUID that doesn't match the Neo4j node_id CUID; code is the
only viable key), and migration 011 indexed only space/constraint_id/outcome.
Partial index (constraint_code NOT NULL AND <> '') skips uncoded outcomes.
Bumps TSDB_REQUIRED_SCHEMA_VERSION default 22→23 (config.go) to match the
migration count — CI schema-version validator gates on this. Additive, no data
change, idempotent.
Live-verified: migration applies (schema 22→23), idx present, re-apply is a
no-op, config/tsdb tests green, CI schema check 23=23.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): GuidanceOutcomesInNeighborhood federation method (Epic 2)
Second Pattern Y1 federation: walk a constraint's Neo4j neighborhood, collect
each neighbor's constraint_code, and join constraint_outcomes on those codes
(backed by the V0023 index). walkNeighborhoodWithCodes returns the neighborhood
node IDs + a code→node map; queryGuidanceOutcomes pulls coded outcomes in the
window; Go-side join resolves each outcome's code → its neighborhood constraint
node. Non-nil slices from the start (EVENTGRAPH-CLI-001 lesson). Reuses the
existing constraint_outcomes sink — no new table/writer.
Tier 1 (-race): validation guards, empty-arrays-not-null, sortedKeys
determinism, join resolution. Tier 2 integration (live Neo4j+TSDB): full
round-trip — hops=1 (seed+related codes, off-neighborhood excluded), hops=0
(seed code only), unknown-seed (empty non-nil). PASS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): guidance-outcome federation handler + route (Epic 3)
POST /v1/eventgraph/guidance-outcome-neighborhood — walk a constraint's
neighborhood, surface constraint_outcomes whose code is in the neighborhood.
Same gating/auth/default convention as the reinforcement endpoint.
Single-source refactor (per the dynamic-variables directive): extracted the
shared gate (method/enabled/service → eventgraphGate) and default-resolution
(hops/since/limit + ceiling → resolveFederationDefaults) into helpers used by
BOTH handlers, so the federation rules live in exactly one place. The
reinforcement handler now calls them too — verified no regression (reinforcement
UATS still 6/6 live, unit tests green).
Live-verified: seeding from the real 'no-direct-main-commits' constraint node
surfaced real 'followed' outcomes with constraint_node_id resolved to the seed
and in_neighborhood=true.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): mdemg eventgraph guidance-outcome-neighborhood CLI (Epic 4)
Sibling subcommand consuming POST /v1/eventgraph/guidance-outcome-neighborhood.
Walks a constraint's neighborhood and renders guidance outcomes (followed/
ignored split + table: code · outcome · sim · g_type · guidance_id · recorded)
or --json. Seed via --seed/--query (--constraint-code seeding deferred — needs
server-side code→node resolution; --query covers discovery). Unset hops/since/
limit omitted so the server applies config defaults (single source of truth).
Tier 1 (-race): request-mapping omit-when-unset + conversion, --query seed
resolution, surfaced-503 error, render (empty + followed/ignored table),
truncStr. Help renders.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-002): UATS contract spec for guidance-outcome federation (Epic 5)
6 cases, validated 6/6 live: happy-200 response shape (outcomes/
neighbor_node_ids/neighbor_constraint_codes arrays, graph_hops/tsdb_rows_scanned
numbers, truncated boolean), missing space_id/seed → 400 (empty-string override
under deep-merge), negative_hops → 400, hops_over_ceiling → 400, GET → 405.
Tagged 'tsdb' so CI skips it without TSDB (the EVENTGRAPH-CLI-001 lesson).
sha256 hashed + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): Tier 3 live verification (Epic 6)
Real binary against the real stack. Key assertion: CLI --json output matches
direct constraint_outcomes SQL exactly (11 outcomes = 11, all followed) for the
no-direct-main-commits constraint. --seed/--query/--limit/--json/unknown-seed/
no-arg all verified live. The --query "0 outcomes" result was traced to SQL
ground truth — the 5 neighborhood codes genuinely have no feedback, so it's
correct (federation distinguishes "code in neighborhood" from "code has
outcomes"), not a join bug. Reinforcement endpoint un-regressed by the shared-
helper refactor (UATS 6/6).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): feature doc + CHANGELOG + CLAUDE.md + close (Epic 7)
Feature doc gains a Guidance-Outcome Federation section (why reuse
constraint_outcomes, why join on constraint_code, CLI usage) + forward-look
update. CHANGELOG Added (endpoint + CLI) + Changed (TSDB schema 22→23).
CLAUDE.md architecture note extended. post.md closes the sprint with UxTS
mapping + follow-ups (--constraint-code seeding, EVENTGRAPH-003, UOTS).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(metrics,backup): resolve docker binary robustly under minimal launchd PATH
The native server (launchd) inherits PATH=/usr/bin:/bin:/usr/sbin:/sbin, which
excludes the Docker Desktop symlink (/usr/local/bin/docker). So every
server-runtime `docker` shellout failed with "executable file not found in
$PATH": (1) Neo4j container CPU/mem stats (server.go) logged an ERROR every 60s
and left the neo4j_container_* gauges empty — so the neo4j_high_cpu/_memory
alert rules had no data; (2) the TSDB backup scheduler's `docker compose
pg_dump` (backup.go) failed with only a slog.Warn. The DATA PLANE was never
affected — Neo4j (Bolt) + TSDB (pgx) connect over mapped TCP ports, not the
docker CLI.
Fix (durable, configurable, single-source): new internal/dockerbin resolver —
MDEMG_DOCKER_BIN env override → exec.LookPath → well-known install locations
(/usr/local/bin, /opt/homebrew/bin, /usr/bin) → graceful unavailable. Wired
into server.go (stats) + backup.go (both compose calls). The perpetual 60s
ERROR is downgraded to a one-shot WARN when docker is genuinely absent (it's
optional telemetry). Added a sane PATH to the launchd server plist template as
defense-in-depth.
Live-verified: after restart, mdemg_neo4j_container_cpu_percent=0.59 /
mem_percent=29.13 now land in metric_samples (were absent); no more docker-stats
ERROR; `docker stats` + backup resolve docker under a simulated minimal PATH.
Note: `mdemg data export-auto` (training-export job) was NOT a victim — it
exports via network SQL, not docker (corrected from an earlier assumption).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): sprint plan — fail-loud scheduled jobs (Epic 0)
Triggered by a live-discovered silent failure: the TSDB backup scheduler was
failing every 24h run (docker-under-launchd-PATH) with only a buried slog.Warn.
Docker cause fixed (4cc7608); this sprint fixes the class — scheduled jobs that
fail with no record + no alert. V0024 scheduled_job_events hypertable + writer,
jobhealth.Report (record + alert on failure), wire the 3 jobs (backup,
maintenance, export-auto), 2 evaluator rules (backup staleness + recent
failure) so the server catches "job failed OR never ran". Config-driven, 3
testing tiers, live Tier-3 induced-failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): V0024 scheduled_job_events + writer (Epic 1)
Hypertable (job_name, success, latency_ms, error_message, metadata jsonb,
recorded_at) + RecordJobEvent synchronous single-row writer (mirrors V0021
model_install pattern). Indexes: per-job freshness, partial failed, per-space.
One row per scheduled-job run so the alert evaluator can detect "job failed"
AND "job never ran" (staleness). Schema 23->24; TSDB_REQUIRED_SCHEMA_VERSION
bumped to match migration count (CI check 24=24).
Tier 1 (-race): field mapping, optional-nulls, error truncation, nil-pool
no-op, insert-error propagation. Tier 2 (live TSDB): round-trip + the staleness
(recent successes) + failure (recent failures) query shapes the rules will use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): record + alert on scheduled-job outcomes (Epic 2)
New internal/jobhealth.Report — the single policy point: record a
scheduled_job_events row and fire a high-severity "scheduled-job" alert on
failure (both pool + dispatcher nil-safe). Wired into all three jobs:
- TSDB backup scheduler (internal/tsdb/backup.go): decoupled JobResultFunc hook
(mutex-guarded, -race clean) so internal/tsdb stays free of internal/alert;
server.go::SetTSDBClient sets it with the pool + s.alertDispatcher. A failed
or never-run backup now records + alerts instead of a silent slog.Warn.
- export-auto + maintenance (CLI): deferred reportScheduledJob on the named
return error — opens a short-lived pool + a file-backed dispatcher (same
~/.mdemg/alerts/current.json the hooks surface) so a separate-process CLI job
still alerts the operator.
Tier 1 (-race): jobhealth fires alert only on failure (real file-backend
dispatcher), nil-safe. Live smoke: export-auto recorded success=t latency=3050ms.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): scheduled-job staleness + failure alert rules (Epic 3)
Two server-native evaluator rules over V0024 scheduled_job_events:
- scheduled_job_recent_failure (always on): any job failure in the last
JOB_FAILURE_LOOKBACK_MIN (default 60) → high alert.
- backup_no_recent_success (gated on TSDB_BACKUP_ENABLED): zero successful
tsdb-backups within the staleness window → high alert. THIS is the "job
never ran" guarantee — it fires from the server observing ABSENT success, so
a backup that silently died or never started is caught, not just one that
errored.
Window derives from the real backup interval × 2 (JOB_BACKUP_STALENESS_HOURS
override; no hardcoded literal). JOB_HEALTH_ALERT_ENABLED master gate (default
true). Appended after DefaultRules() in serve.go.
Tier 1: failure rule always present (gt 0), staleness gated on backups-enabled
(lt 0.5), windows reflect config, non-positive fallback. Build/lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): distinct services for job rules so neither masks the other
Caught in live Tier-3 testing: both evaluator rules used Service="scheduled-jobs",
and the dispatcher cooldown key is (Service, Severity) — so the failure alert's
cooldown SUPPRESSED the staleness alert (only the failure fired). One alarm
masking another is the exact silent-failure class this sprint kills. Fixed:
scheduled-job-failure / scheduled-job-staleness distinct services. Re-verified
live — both fire independently and land as distinct alert-file entries. Tier-1
assertion pins that the two services differ. Includes Tier-3 verification.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): feature doc + CHANGELOG + CLAUDE.md + close (Epic 4)
Feature doc docs/features/scheduled-job-health.md (why / two mechanisms /
operator view / config). CHANGELOG Added (NOSILENT-001) + Changed (schema 23→24)
+ Fixed (docker-under-launchd-PATH). CLAUDE.md Service Alert System extended
with the scheduled-job-health note + the distinct-Service-per-rule cooldown
caveat. post.md closes the sprint (UxTS mapping + follow-ups).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): sync embedded launchd server plist with source (CI)
CI "Verify embedded launchd templates match source" diffs packaging/launchd/*
against internal/cli/launchd_templates/* (the embed.FS copy mdemg service
install uses). The PATH addition landed only in the source copy; sync the
embedded copy so they match byte-for-byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(roadmap): add jiminy-governance skill build-out (Workstream C, Action 7)
Records the jiminy-governance Claude Code skill on the active forward roadmap
(SPRINT_ROADMAP_POST_FT_LORA.md, cross-cutting governance) + brings the source
spec into the repo (docs/development/jiminy-governance-skill/SKILL.md, out of
~/Downloads). The skill makes Jiminy the deterministic source of context +
governance over J17, enforced by the PreToolUse hook — a routing/handshake shim,
not a rulebook. Build-out scope notes the wire-up placeholders that must be
resolved against the real instance (Jiminy MCP/endpoint, PreToolUse hook, J17
ack/RetireCode/GUIDANCE_OUTCOME calls). Aligns with the now-live guidance loop
(RRF-SCALE-001 / JIMINY-OUTCOME-001 / GUIDANCE-SYNTH-001).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): resolve skill wire-up against the real instance
Step 1 of the jiminy-governance build-out (roadmap Workstream C Action 7).
Resolved all five placeholders from the running MDEMG instance, verified live:
- Jiminy query: MCP `mdemg mcp` (stdio) → jiminy_guide/validate_changes; HTTP
/v1/jiminy/guide (returns guidance_id) + /bootstrap glossary (j17v1) + /latest.
- PreToolUse: pre-bash-check.py (Bash, fail-closed) + pre-write-check.py
(Write/Edit → /v1/jiminy/classify, /strict-only, fail-open).
- SessionID: claude-core convention.
- Comprehension ack: /v1/jiminy/protocol/feedback (verified ingested 1).
- GUIDANCE_OUTCOME: /v1/jiminy/feedback {guidance_id,…}.
- RetireCode: internal-only by design (RSIC/APE protocol-evolution) — no
agent-facing call; agent must never self-retire a constraint.
Also surfaced the two real integration gaps the build-out must close (the work,
not the prose): (a) the MDEMG MCP server is NOT registered (.mcp.json absent —
context is pushed by prompt-context.sh, not pulled by the agent); (b) PreToolUse
enforcement is /strict-gated + fail-open, so not deterministic-by-default.
Roadmap Action 7 updated to reflect step 1 done + the two gaps as next steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy-governance): ship the J17 governance skill + register MDEMG MCP
Build-out of roadmap Workstream C Action 7 (steps 2-4), per the resolved
wire-up. Closes the two integration gaps found while resolving:
- gap 1 (MCP not registered): .mcp.json registers `mdemg mcp` (stdio).
Live-probed: 20 tools incl. jiminy_guide + validate_changes — the agent can
now PULL guidance, not only receive the hook push.
- gap 2 (enforcement /strict-gated + fail-open): policy set — Write/Edit J17
gate kept fail-open (hard server dependency on every edit is too brittle);
the skill's handshake auto-enables /strict so the gate is active per session.
Bash gate already fail-closed (demonstrated live when a test payload with a
destructive force-push string was blocked by pre-bash-check.py).
Skill authored at the canonical .claude/skills/jiminy-governance/SKILL.md
(frontmatter valid; concrete wire-up inline — MCP tools + HTTP endpoints +
SessionID claude-core + the 5-step handshake). Kept a routing/handshake shim,
not a rulebook (rules stay in the graph).
Live Tier-3 PASSED: full handshake identify->request->comprehend->act->report
ran against the real instance; GUIDANCE_OUTCOME edges 906->909 (one per coded
constraint). Verification: docs/development/jiminy-governance-skill/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): commit install-ready skill + install README
.claude/ is gitignored (per-developer local config), so the installed skill at
.claude/skills/jiminy-governance/SKILL.md is local-only by convention. Commit
the reproducible, install-ready copy (jiminy-governance.skill.md) + a README
with the one-line install (cp into .claude/skills/) so the skill propagates via
the repo. The MCP server it uses is registered in the tracked repo-root
.mcp.json.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID across hooks + skill
Hooks and the jiminy-governance skill hardcoded session_id="claude-core", so
trust/escalation/observations from ALL Claude Code conversations collapsed into
one shared MDEMG session. Claude Code already passes a per-conversation
session_id on stdin to every hook — the implementation just never used it.
Resolver precedence (single rule everywhere): MDEMG_SESSION_ID env (stable-
identity escape hatch) > Claude Code stdin session_id (per-conversation default,
race-free per hook) > ~/.mdemg/.claude-session (published by SessionStart +
UserPromptSubmit for the agent / stdin-less contexts) > claude-core (fallback).
Realizes J17's intended per-(session,constraint) isolation.
Tracked templates updated (internal/cli/hook_templates/): session-start.sh,
prompt-context.sh, post-tool-observe.py, pre-compact.sh + Windows .ps1 variants
— every hardcoded claude-core in MDEMG calls replaced with the resolved id;
session-start/prompt-context publish the session file. Skill SessionID
instruction + handshake steps now resolve <SessionID> instead of claude-core.
Live-verified: a hook resolved a stdin session_id and published the session
file; post-tool-observe wrote an observation keyed to the per-conversation id in
Neo4j (not claude-core). bash -n / py_compile clean; go build + hooks test pass.
Note: pre-write-check.py is local-only (no tracked installer template); its fix
is applied on this machine but won't propagate via `mdemg hooks install` until
it's added to the tracked hooks (follow-up). .claude/* is gitignored, so the
live hook copies aren't committed — the tracked templates propagate via install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID — installed hook copies
The .claude/hooks/ installed copies are tracked (committed before the .claude/*
ignore), so apply the same SessionID resolver here as in the embedded templates
(prior commit). session-start.sh, prompt-context.sh, post-tool-observe.py,
pre-compact.sh now resolve MDEMG_SESSION_ID env > stdin session_id >
~/.mdemg/.claude-session > claude-core. (pre-write-check.py is untracked /
local-only — its fix lives on this machine only; tracking it is the follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcement,guidance-outcome}-neighborhood) + TOC entry;
these were the only two missing endpoints (audited the full route table).
(3) docs/user/cli-reference.md — add mdemg eventgraph {reinforcement,guidance-
outcome}-neighborhood, model run, watchdog status, migrate context-fingerprint,
data curate/validate/clean; fix the stale `model pull --adapter` description
(MODEL-DIST-002 shipped — no longer "deferred/errors"); update the Command Tree
Summary to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distin…
* fix(eventgraph-001): Grafana panel uses TSDB instead of unconfigured Prometheus datasource
The Epic 6 panel used datasource {type: prometheus, uid: prometheus} but
this Grafana instance has no Prometheus datasource configured — mdemg
exposes counters as JSON via /v1/metrics/snapshot, not a /metrics scrape
endpoint. Configured datasources: mdemg-nodegraph, neo4j, timescaledb
only. The panel rendered "No data" in the live Grafana.
Rewritten panel queries the reinforcement_events hypertable directly via
the timescaledb postgres datasource. Two targets:
1. count(*) over 1-minute time_buckets → overall events/min
2. count(*) FILTER (WHERE created_new_edge) vs WHERE NOT created_new_edge
→ split between new connections formed and existing connections
strengthened (the operational dimension the analytic queries
actually need)
Both targets templated on $space_id (existing dashboard variable). The
Prometheus counters (mdemg_eventgraph_writer_rows_{enqueued,dropped,
flush_failure}_total) remain wired and incrementing — they surface via
/v1/metrics/snapshot for ops scripts. The Grafana panel now actually
displays data instead of relying on a scrape path that doesn't exist
in this deployment.
Discovered during post-merge live verification (2026-05-29). Verified
fix: reloaded dashboard via Grafana API → /api/ds/query against same
SQL returns 1-minute buckets matching TSDB direct count. Audit harness
now reports 2 PASS for the new panel (previously SKIP — no SQL target).
verification.md updated with the post-merge transcript.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs(rrf-scale-001): sprint plan — RRF score-scale consumer remediation
P0 fix. The Jiminy guidance->feedback->outcome loop has been dormant
~9 weeks: consulting/service.go gates constraint/suggestion extraction
on hardcoded legacy-scale score thresholds (r.Score < 0.55 et al.).
Phase 13.1 RRF (default-on May 3) dropped the score scale so strong
matches top out ~0.53 -> 0/10 results clear the gates -> empty guidance
-> dead loop. Third instance of the RRF-score-contract bug class (after
the EVENTGRAPH-001 Activation drop).
12-section format; 6 epics; config-driven percentile-gate fix +
sigmoid recalibration; live-verify the revived loop end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): Epic 1 audit findings — 12 sites cataloged
Full-repo sweep of post-RRF score/activation/confidence consumers +
live score-distribution sampling. Findings:
- HIGH (4): consulting constraint gates (1005/1081/1087) + confidence
sigmoid midpoint 1.5 (35-36) — the loop-killer cluster.
- MED (5): consulting conflict gates (931/944/957/981) + minConfidence
pre-filter (619, already config-driven).
- LOW (3): retrieval/jiminy.go Activation display gates (45/155/192) —
explanation text only, no guidance gating.
- NONE (2): jiminy trial score (0-10 scale), trust-score clamp.
Live distribution: RRF strong-match top scores cluster 0.49-0.58; the
0.55 gate sits mid-band, rejecting the most-relevant constraint half
the time. NormalizedConfidence is positional rank (spreads 100->0 even
on uniform-score sets) -> rules out plan Option A (percentile) as sole
gate. Remediation: config-driven RRF-calibrated absolute thresholds
(Option B), constraint floor default 0.45, sigmoid midpoint ->0.45.
Disclosed deviation per feedback_plan_options_pattern.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(consulting): RRF-calibrate score gates + confidence sigmoid (RRF-SCALE-001 Epic 2)
Revives the dormant Jiminy guidance loop. Replaces 7 hardcoded legacy-
scale score gates in consulting/service.go + the score->confidence
sigmoid (both copies) with config-driven, RRF-calibrated values.
Gates (all default 0.45, RRF strong-match band is 0.49-0.58):
- constraint extraction (was <0.55) -> CONSULTING_CONSTRAINT_SCORE_FLOOR
- keyword/name authority inner gate (0.55/0.6) -> CONSULTING_AUTHORITY_SCORE_FLOOR
- conflict/contradiction detection (0.6-0.7) -> CONSULTING_CONFLICT_SCORE_FLOOR
Key Epic-2 finding: keywordClassifyConstraint has an INNER authority
gate that binds tighter than the outer constraint gate. If authority
floor > constraint floor, the binding gate re-rejects the strong-match
band and the loop stays dormant -> all three default to 0.45. The RRF
band is too compressed to subdivide into tiers; knobs stay separate so
operators can raise any one independently.
Sigmoid (score->confidence), both consulting/service.go and
jiminy/retrieval_source.go (they MUST stay in sync per their own
comments): midpoint 1.5 -> 0.45, steepness 1.5 -> 8.0, config-driven via
RETRIEVAL_CONFIDENCE_SIGMOID_{MIDPOINT,STEEPNESS}. Legacy crushed a
strong 0.5 match to 0.18 confidence; recalibrated maps it to 0.60
(0.1->0.06, 0.58->0.74). normalizeRetrievalConfidence is now a Service
method reading cfg with zero-value fallback; mapRetrievalToGuidance
takes the sigmoid params from its caller's cfg.
5 new config knobs, all with RRF-calibrated defaults + zero-value
guards (no-hardcoding rule; the bug WAS a hardcoded value).
Tier 1 tests: updated 2 legacy-scale boundary tests to the new
thresholds + added RRFStrongMatchBand regression (0.50 must surface),
ConstraintFloor_ConfigDriven (override honored), and
NormalizeRetrievalConfidence_RRFCalibration (band mapping). Full
consulting + jiminy + config suites green; lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): Epic 3 — remaining LOW findings reviewed + decided
retrieval/jiminy.go Activation display gates (45/155/192 + LearningEdge
siblings) traced live: they're in the explainability renderer, not the
guidance-surfacing path; always-additive at RRF scale (live activation
~0.723 >> thresholds), no misbehavior. Intentionally left unchanged with
rationale — config-ifying display verbosity is out of proportion to zero
functional impact. Every High/Med remediated (Epic 2), every Low decided.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(rrf-scale-001): Tier 2 integration + Tier 3 live verification (Epic 4)
Tier 3 live e2e (verification.md): the score-gate fix revives the
dormant guidance loop on the live stack —
- /v1/jiminy/guide guidance items 0 -> 10, source_counts.constraints
0 -> 2, patterns 0 -> 3 (acceptance #1 MET).
- Full loop warm->latest->feedback->outcome: TSDB constraint_outcomes
sink REVIVED — fresh rows dated 2026-06-03 (table was dead since
May 1). Constraint-effectiveness Grafana sink is live again.
Three adjacent issues surfaced during live smoke, documented as distinct
follow-ups (NOT score-scale, not bolted on):
- A: Neo4j GUIDANCE_OUTCOME edges still dormant — guidance SourceNodes
point at emergent_concept nodes; PersistGuidanceOutcome only writes
edges for constraint/correction/pattern/learning or role_type=
constraint targets. Node-type-targeting bug, independent of RRF.
Candidate sprint JIMINY-OUTCOME-001.
- B: LLM guidance synthesis timeout (now that synthesis runs).
- C: /v1/jiminy/latest unescaped control chars break jq/json parsers —
the hook uses jq, so may compound dormancy. Low-effort follow-up.
Tier 2 (rrf_scale_guidance_test.go, integration tag, 2 green):
- SuggestSurfacesGuidance: constraint-matching context surfaces 7
suggestions (was 0 before fix) against live mdemg-dev.
- SuggestRejectsNoise: gibberish does not flood constraints (no
over-correction).
Cold-start note: first guide call post-restart returned constraints:0
(LLM classifier cold-model timeout -> keyword fallback); after one
warm-up call, constraints surface. Model-warmth artifact, not a fix
defect.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(rrf-scale-001): CHANGELOG + CLAUDE.md score-scale contract + post.md (Epic 5)
Final epic. CHANGELOG Unreleased gains the RRF-SCALE-001 Fixed entry.
CLAUDE.md gains a 'score-scale contract' architecture note — the
structural defense against a 4th instance: downstream consumers MUST
NOT hardcode absolute thresholds against RetrieveResult.Score (the
scorer scale is not a stable contract); gate via config or a
scale-invariant signal, and re-audit on any scorer change. Notes that
NormalizedConfidence is positional (not a safe sole gate) and records
the three open follow-ups.
post.md: epic-by-epic, acceptance check-off (honest: #2 partial — TSDB
sink revived, Neo4j edge is distinct Follow-up A), scope note
separating the score-scale fix (done) from the 3 adjacent surfaced
issues (documented follow-ups), discipline notes (cold-start mask,
inner authority gate).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(rrf-scale-001): skip guidance integration tests on empty environment (CI fix)
CI failure on PR 404: TestRRFScale_SuggestSurfacesGuidance failed in
0.02s. Root cause: the test assumed the populated local mdemg-dev space
(111 constraint nodes), but CI boots a FRESH EMPTY Neo4j with stub
embeddings (and RETRIEVAL_COLUMN_VOTING_ENABLED=false / legacy scorer).
With no data, /v1/memory/suggest returns 0 candidates, so the
'total == 0' assertion fired.
Other integration tests self-seed data or skip when prerequisites are
absent; mine relied on ambient data — wrong for a reproducible CI run.
Fix: skip when debug.retrieved_count == 0 (no retrievable data → the
score-gate fix isn't exercisable; there's nothing for the gate to admit
or reject). The test stays meaningful against a populated stack (local:
9 suggestions from 15 retrieved → PASS) and skips cleanly in CI's
empty-DB environment. Verified both paths live: populated → PASS,
empty space → retrieved_count 0 → SKIP.
The gate fix itself is validated by Tier 1 unit tests + the live Tier 3
e2e (docs/development/rrf-scale-001/verification.md); this integration
test is a bonus live-stack assertion, not the primary proof.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): sprint plan — revive Neo4j GUIDANCE_OUTCOME sink
Follow-up A from RRF-SCALE-001: the Neo4j GUIDANCE_OUTCOME edge sink has
been dormant since Apr 12. Root cause: matchConstraintCode links guidance
items to constraint codes by keyword overlap (>=3 shared words), but
retrieval surfaces emergent_concept abstractions whose content does not
share 3+ literal words with raw constraint text -> no constraint_code ->
PersistGuidanceOutcome falls back to the concept SourceNode -> the
role_type=constraint filter rejects it -> no edge. Live-proven: all 17
recent outcome rows had constraint_code=(none).
Fix (Option 1): switch the matcher to embedding cosine similarity
(content already normalized to natural language ~0.70 cosine; Service
has an embedder; cosineSimilarity + embed->cosine pattern already exist
in-package via OutcomeClassifier). Existing PersistGuidanceOutcome +
findConstraintNodeID then create edges on the correct constraint nodes.
Keyword matching stays as fallback -- never regresses.
4 epics; ~1-1.5 dev-days; config-driven threshold; acceptance bar = a
fresh Neo4j GUIDANCE_OUTCOME edge on a real role_type=constraint node
dated today, reflected in GetConstraintEffectiveness.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): embedding-similarity constraint-code matching (JIMINY-OUTCOME-001 Epic 1)
Revives the Neo4j GUIDANCE_OUTCOME edge sink (dormant since Apr 12).
Root cause (RRF-SCALE-001 Follow-up A): matchConstraintCode links
guidance items to constraint codes by keyword overlap (>=3 shared
words), but retrieval surfaces emergent_concept abstractions whose
content rarely shares 3+ literal words with raw constraint text -> no
code -> PersistGuidanceOutcome falls back to the concept SourceNode ->
the role_type=constraint filter rejects it -> no edge.
Fix: new matchConstraintCodeByEmbedding queries the constraint vector
index (db.index.vector.queryNodes, role_type=constraint, sim >=
threshold) and returns the closest constraint's code. Guide() tries
this first, falling back to the keyword matcher when the embedder is
unavailable, content is empty, or nothing clears the threshold — never
regresses. The existing PersistGuidanceOutcome + findConstraintNodeID
then create the edge on the correct constraint node.
Implementation refinement vs plan: uses Neo4j's vector index server-side
(mirrors the proven Evaluator.findMatchingConstraints pattern) rather
than loading all constraint embeddings into Go and computing cosine in a
loop — cleaner, no constraintCodeEntry.Embedding needed. Same Option-1
outcome.
Config: JIMINY_CONSTRAINT_CODE_SIM_THRESHOLD (default 0.55, zero-value
fallback) — provisional; tuned against the live similarity distribution
in Epic 2.
Tier 1 (4 tests): nil-driver/empty-embedding guards, threshold default
resolution, keyword-fallback non-regression. Full jiminy + config
suites green; lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(jiminy-outcome-001): Tier 2 integration + Tier 3 live verification (Epic 2)
Tier 3 live e2e (verification.md) — acceptance bar MET:
- /v1/jiminy/guide now yields guidance items carrying constraint_codes
(10 items, 6 coded; was 0). Matched code 'no-direct-main-commits' is
semantically exact for the 'commit to main' context.
- Full warm->latest->feedback loop: Neo4j GUIDANCE_OUTCOME 893 -> 899
(+6), latest today. All 6 new edges land on REAL role_type=constraint
nodes ('CONSTRAINT: NEVER commit directly to main') — not
emergent_concept. The sink dormant since Apr 12 is revived on the
correct nodes.
- /v1/constraints/effectiveness reflects it: 'NEVER commit directly to
main | surfaced: 30 followed: 28 rate: 0.93'.
- Both sinks now revived: TSDB (RRF-SCALE-001) + Neo4j (here). The
constraint-effectiveness loop is fully restored.
Threshold 0.55 validated live: correct matches, no false positives.
Tier 2 (jiminy_outcome_test.go, integration tag, skip-on-empty): PASSES
on a populated stack with an idle LLM (7/10 items coded). The guide path
is LLM-latency-dependent (per-node classifier ~31s/call, serialized; a
call fired while the LLM is busy fast-fails empty), so the test
warm-retries and SKIPS (never false-fails) when the LLM path can't
produce items. Bonus check; Tier 3 is the definitive proof. The LLM
serialization/synthesis-timeout is RRF-SCALE-001 Follow-up B, tracked
separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): CHANGELOG + CLAUDE.md + post.md (Epic 3)
Final epic. CHANGELOG Unreleased gains the JIMINY-OUTCOME-001 Fixed
entry. CLAUDE.md gains a guidance-outcome constraint-code-matching note
(embedding-first via vector index, keyword fallback; both outcome sinks
now live). post.md: epic-by-epic, acceptance check-off, the loop-revival
completion (TSDB from RRF-SCALE-001 + Neo4j here), discipline notes (LLM
serialization is the test-flakiness source), forward-looking (Follow-up
B now the most operationally-visible remaining issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(guidance-synth-001): sprint plan — fix guidance synthesis timeout (Follow-up B)
Synthesis fails on every production warm call (6/6 jiminy.synthesize
errored). Root cause: the hook's /warm path runs background Guide() with
a hardcoded 30s timeout (handlers_jiminy.go:302), inside which the
per-node constraint classifier runs SERIALLY (~1.5s x ~10 nodes ~= 15s),
leaving only ~15s for synthesis which needs 8-27s -> deadline exceeded.
JIMINY_TIMEOUT_MS=240s is configured but the 30s hardcode caps it.
Fix (both): (1) parallelize the per-node classifier with bounded
concurrency (CONSULTING_CLASSIFY_CONCURRENCY, default 4 matching
llama-server --parallel 4); (2) config-drive the warm timeout
(JIMINY_WARM_COMPUTE_TIMEOUT_MS, default 90s). Acceptance: synthesis
succeeds live (no synthesis_error), measured latency drop, no
constraint-surfacing regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(consulting): parallelize per-node constraint classifier (GUIDANCE-SYNTH-001 Epic 1)
The per-node LLM constraint classifier in findApplicableConstraints ran
serially (~1.5s/node x ~10 nodes ~= 15s), starving guidance synthesis of
its time budget (synthesis 6/6 errored on the warm path). Now classifies
with bounded concurrency.
- Gate-first (RRF-SCALE-001 score floor) to fix a stable candidate order,
then classify each candidate into a position-indexed slot via a
semaphore-bounded worker pool, then collect-in-order + dedup-by-name —
output is identical to the serial path (determinism). Keyword-only
(no LLM) or cap=1 runs serially (no LLM latency to hide).
- Config: CONSULTING_CLASSIFY_CONCURRENCY (default 4, matching
llama-server --parallel 4; floor 1 = serial rollback). Zero-value
fallback to 4.
- Extracted constraintClassifierIface (minimal Classify surface) so the
concurrent path is unit-testable with a fake; *ConstraintClassifier
satisfies it; SetConstraintClassifier guards against a typed-nil
interface.
Tier 1 (5 new, -race clean): ParallelEqualsSerial (determinism + order),
ParallelIsFaster (concurrency overlaps latency), ErrorFallsBackToKeyword
(fallback intact), ScoreGateStillApplies (RRF-SCALE-001 gate preserved),
ConcurrencyDefaultFallback. Existing findApplicableConstraints tests
unchanged — no regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): config-drive warm-compute timeout (GUIDANCE-SYNTH-001 Epic 2)
The warm-path background Guide() ran with a hardcoded 30s timeout
(handlers_jiminy.go:302) even though JIMINY_TIMEOUT_MS=240s is
configured. 30s was too tight for the per-node classifier (~15s) +
synthesis (8-27s) -> synthesis deadline-exceeded every warm call.
Replaced with JIMINY_WARM_COMPUTE_TIMEOUT_MS (default 90000, zero-value
fallback 90000) — headroom for the now-parallel classifier (~7.5s) + a
slow 27s synthesis. No-hardcoding rule. Rollback: set to 30000.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test+docs(guidance-synth-001): Tier 2/3 verification + docs + close (Epic 3)
Tier 3 live e2e (verification.md): the warm production path now produces
a synthesized narrative — synthesis_used=true, no synthesis_error,
1892-char augmentation. Fresh jiminy.synthesize succeeded at 50.7s
latency (fit the new 90s budget; would die at the old 30s — validates
the default). Both fixes needed.
Tier 2 (guidance_synth_test.go, integration, skip-on-empty + LLM-
tolerant): PASS — warm path produces guidance without synthesis_error.
Docs: CHANGELOG Fixed entry; CLAUDE.md guidance-synthesis-budget note
('when adding LLM calls to the guidance hot path: respect the
warm-compute budget and prefer bounded concurrency over serial loops');
post.md with the data-driven diagnosis + forward-looking (Follow-up C
now the last open item).
Closes Follow-up B. The guidance pipeline (surfacing + codes +
synthesis) is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(followup-c): close JSON control-char escaping as NON-ISSUE (no fix)
Follow-up C (the last open item from RRF-SCALE-001 triage) investigated
and closed with evidence — NO code change, because there is no bug to fix.
The earlier /v1/jiminy/latest parse failures were client-side shell
artifacts (co-occurring with the session's 'failed to change group ID'
errors + ad-hoc variable-capture piping), not server bytes:
- writeJSON uses json.NewEncoder().Encode (encoding/json always escapes
control chars U+0000-U+001F); no raw-write bypass; no custom MarshalJSON.
- The synthesized narrative is double-StripControlChars'd (synthesizer.go
:127 + service.go:1116).
- prompt-context.sh already strips control chars via perl before jq, with
2>/dev/null + // empty fallbacks.
Live-verified: the hook's exact jq returns guidance_id correctly; 5 rapid
/latest fetches all parse as strict-valid JSON; 0 raw control chars.
Per 'don't fix a non-problem', shipping a fix would invent a bug that
doesn't exist. Closure documented in docs/development/followup-c-closure.md.
This closes the entire RRF-SCALE-001 follow-up triage: A (JIMINY-OUTCOME
-001), B (GUIDANCE-SYNTH-001), C (non-issue). The guidance->feedback->
outcome loop is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(jiminy): /guide 30s timeout sibling + single-source config defaults (GUIDANCE-SYNTH-001 fix-commit)
Two things, both from a live e2e of the full loop through the real
production hook path (run per user directive: standard tests don't find
live problems).
1. Sibling bug: the /v1/jiminy/guide handler had the SAME hardcoded 30s
cap as the warm path (handlers_jiminy.go). GUIDANCE-SYNTH-001 fixed
warm; /guide still deadline-exceeded synthesis at exactly 30.003s
(this is what made prior sprints' /guide integration tests flaky).
Now uses the config-driven budget. Live-verified: a 50.05s synthesis
completed (synthesis_used=true) — would die at 30s.
2. Single source of truth for config defaults (user directive: 'single
place to change all instances'). The 90s budget was duplicated as a
literal in 3 sites; prior sprints similarly duplicated each default
(the sigmoid 0.45/8.0 was in 3 places). Now each default is one
exported config.Default* const, referenced by FromEnv and aliased by
consuming-package fallbacks + a Config.JiminyWarmComputeTimeout()
method. Consolidated: warm-compute timeout, the 3 consulting score
floors, sigmoid midpoint/steepness, constraint-code sim threshold,
classify concurrency. Zero behavior change (compile-time aliases);
-race + full suites green.
Live e2e also re-confirmed: real hook captures guidance_id -> feedback
-> +7 Neo4j GUIDANCE_OUTCOME edges on real constraint nodes + 10 TSDB
rows (whole loop closes through the actual hook; re-confirms Follow-up C
non-issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): sprint plan — federation consumer CLI + UATS backfill
Builds the first consumer for EVENTGRAPH-001's reinforcement-neighborhood
federation API (which has no consumer): a 'mdemg eventgraph' CLI command.
Validates the Pattern Y1 bet + becomes the live-testing harness for
EVENTGRAPH-002/003 (user directive: build the consumer first).
Per the UxTS directive: maps the work to the frameworks. UATS applies to
the federation HTTP API -> add eventgraph_reinforcement_neighborhood.uats
.json (backfilling the -001 gap; the endpoint shipped with no UATS),
which replaces an ad-hoc Go integration test as the Tier 2 contract test.
UVTS/UBENCH N/A. UOTS panel-spec gap noted as a follow-up (out of scope).
CLI rendering -> Tier 1 Go units.
4 epics; CLI (--seed/--query/--hops/--since/--limit/--json) renders
summary + events table or JSON; server-driven defaults (no re-hardcoding);
read-only. ~1-1.5 dev-days.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-cli-001): mdemg eventgraph reinforcement-neighborhood (Epic 1)
First consumer of the EVENTGRAPH-001 federation API — POSTs to
/v1/eventgraph/reinforcement-neighborhood, renders a summary + events table
(or --json). Supports --seed, --query (resolves seed via /v1/memory/retrieve
top-1), --hops, --since, --limit. Unset flags are omitted from the request
so the server applies its config defaults (no re-hardcoding of hops/since/limit
in the CLI). Registered under the "advanced" command group.
Tier 1 (httptest, -race clean): request-mapping omit-when-unset + conversion,
--query seed resolution, no-results + invalid --since + surfaced-503 errors,
render (empty + table), helpers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph): neighbor_node_ids serializes as [] not null for empty neighborhood
Caught in EVENTGRAPH-CLI-001 live contract testing (standard code tests
missed it; the live UATS happy-path against the running server did not):
walkNeighborhood returns a nil slice when the seed has no neighborhood
(e.g. an unknown seed), which JSON-marshals to `null`, while Events is
defensively initialized to []. Both are array fields and must serialize
consistently — null breaks any consumer asserting an array type (incl. the
new UATS contract's `type_is array` on $.neighbor_node_ids).
EventsInGraphNeighborhood now coalesces the nil slice to []string{}.
Tier 1 TestFederationResult_EmptyArraysNotNull pins the JSON contract.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-cli-001): UATS contract spec for federation API (Epic 2)
Backfills the UATS gap EVENTGRAPH-001 left (no contract test for
/v1/eventgraph/reinforcement-neighborhood). 6 cases, validated 6/6 live
against the running server:
- happy 200: asserts the response contract shape (events/neighbor_node_ids
arrays, graph_hops/tsdb_rows_scanned numbers, truncated boolean) — robust
to data, works even with an unknown seed (empty neighborhood is valid 200)
- missing_space_id / missing_seed_node_id → 400 (empty-string override, since
the runner deep-merges variant body over base — key omission can't unset)
- negative_hops → 400, hops_over_ceiling (999 > 2×default) → 400
- method_not_allowed (GET) → 405
sha256 integrity hash added + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): live verification + feature doc + CHANGELOG + close (Epic 3)
Tier 3 live e2e verified the real binary against the real stack: --query
surfaced 20 reinforcement events in a 5-node neighborhood (demonstrating the
Hebbian-write → federation-read loop closing in one command); --seed/--json/
--limit/unknown-seed/no-arg paths all verified live. Feature doc gains the CLI
consumer section; CHANGELOG Added + Fixed entries; CLAUDE.md architecture note;
verification.md + post.md (UxTS mapping: UATS done, UOTS follow-up carried over).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph-cli-001): tag UATS spec 'tsdb' so CI skips it without TSDB
CI Test failed: the UATS contract step boots a minimal server without TSDB,
so the eventgraph service is nil and every POST returns 503 "service not
initialized" instead of the expected 200/400 (only GET→405 passed, since the
method check precedes the service check). Same class as PR #404. The
federation endpoint genuinely requires TSDB (it queries reinforcement_events;
the service is nil without TSDB at boot), and CI's UATS step already excludes
`tsdb`-tagged specs (ci.yml --exclude-tag ...,tsdb). Added "tsdb" to api.tags
(matching metrics_snapshot/readyz_tsdb); re-hashed. Verified locally: the spec
now reports Status: skip under the exact CI exclude filter, and still 6/6 live
against the full stack via explicit --spec.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): sprint plan — guidance-outcome federation (Epic 0)
Federate the guidance-outcome event stream (Pattern Y1, second event class):
walk a constraint's Neo4j neighborhood, surface time-windowed constraint_outcomes
(followed/ignored/contradicted) for the constraint + its graph-related constraints.
Data-decided architecture: reuse the existing constraint_outcomes table (no new
hypertable/writer/enqueue site — RRF-SCALE-001 already populates it, 1176 live
rows); join graph↔events on constraint_code (TSDB constraint_id UUID ≠ Neo4j
node_id CUID — code is the only viable key). One additive migration (V0023:
constraint_code index, schema 22→23). 8 epics, 3 testing tiers, live Tier 3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): V0023 constraint_code index on constraint_outcomes (Epic 1)
Adds idx_constraint_outcomes_code (space_id, constraint_code, time DESC) — the
guidance-outcome federation joins graph↔events on constraint_code (TSDB
constraint_id is a UUID that doesn't match the Neo4j node_id CUID; code is the
only viable key), and migration 011 indexed only space/constraint_id/outcome.
Partial index (constraint_code NOT NULL AND <> '') skips uncoded outcomes.
Bumps TSDB_REQUIRED_SCHEMA_VERSION default 22→23 (config.go) to match the
migration count — CI schema-version validator gates on this. Additive, no data
change, idempotent.
Live-verified: migration applies (schema 22→23), idx present, re-apply is a
no-op, config/tsdb tests green, CI schema check 23=23.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): GuidanceOutcomesInNeighborhood federation method (Epic 2)
Second Pattern Y1 federation: walk a constraint's Neo4j neighborhood, collect
each neighbor's constraint_code, and join constraint_outcomes on those codes
(backed by the V0023 index). walkNeighborhoodWithCodes returns the neighborhood
node IDs + a code→node map; queryGuidanceOutcomes pulls coded outcomes in the
window; Go-side join resolves each outcome's code → its neighborhood constraint
node. Non-nil slices from the start (EVENTGRAPH-CLI-001 lesson). Reuses the
existing constraint_outcomes sink — no new table/writer.
Tier 1 (-race): validation guards, empty-arrays-not-null, sortedKeys
determinism, join resolution. Tier 2 integration (live Neo4j+TSDB): full
round-trip — hops=1 (seed+related codes, off-neighborhood excluded), hops=0
(seed code only), unknown-seed (empty non-nil). PASS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): guidance-outcome federation handler + route (Epic 3)
POST /v1/eventgraph/guidance-outcome-neighborhood — walk a constraint's
neighborhood, surface constraint_outcomes whose code is in the neighborhood.
Same gating/auth/default convention as the reinforcement endpoint.
Single-source refactor (per the dynamic-variables directive): extracted the
shared gate (method/enabled/service → eventgraphGate) and default-resolution
(hops/since/limit + ceiling → resolveFederationDefaults) into helpers used by
BOTH handlers, so the federation rules live in exactly one place. The
reinforcement handler now calls them too — verified no regression (reinforcement
UATS still 6/6 live, unit tests green).
Live-verified: seeding from the real 'no-direct-main-commits' constraint node
surfaced real 'followed' outcomes with constraint_node_id resolved to the seed
and in_neighborhood=true.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): mdemg eventgraph guidance-outcome-neighborhood CLI (Epic 4)
Sibling subcommand consuming POST /v1/eventgraph/guidance-outcome-neighborhood.
Walks a constraint's neighborhood and renders guidance outcomes (followed/
ignored split + table: code · outcome · sim · g_type · guidance_id · recorded)
or --json. Seed via --seed/--query (--constraint-code seeding deferred — needs
server-side code→node resolution; --query covers discovery). Unset hops/since/
limit omitted so the server applies config defaults (single source of truth).
Tier 1 (-race): request-mapping omit-when-unset + conversion, --query seed
resolution, surfaced-503 error, render (empty + followed/ignored table),
truncStr. Help renders.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-002): UATS contract spec for guidance-outcome federation (Epic 5)
6 cases, validated 6/6 live: happy-200 response shape (outcomes/
neighbor_node_ids/neighbor_constraint_codes arrays, graph_hops/tsdb_rows_scanned
numbers, truncated boolean), missing space_id/seed → 400 (empty-string override
under deep-merge), negative_hops → 400, hops_over_ceiling → 400, GET → 405.
Tagged 'tsdb' so CI skips it without TSDB (the EVENTGRAPH-CLI-001 lesson).
sha256 hashed + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): Tier 3 live verification (Epic 6)
Real binary against the real stack. Key assertion: CLI --json output matches
direct constraint_outcomes SQL exactly (11 outcomes = 11, all followed) for the
no-direct-main-commits constraint. --seed/--query/--limit/--json/unknown-seed/
no-arg all verified live. The --query "0 outcomes" result was traced to SQL
ground truth — the 5 neighborhood codes genuinely have no feedback, so it's
correct (federation distinguishes "code in neighborhood" from "code has
outcomes"), not a join bug. Reinforcement endpoint un-regressed by the shared-
helper refactor (UATS 6/6).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): feature doc + CHANGELOG + CLAUDE.md + close (Epic 7)
Feature doc gains a Guidance-Outcome Federation section (why reuse
constraint_outcomes, why join on constraint_code, CLI usage) + forward-look
update. CHANGELOG Added (endpoint + CLI) + Changed (TSDB schema 22→23).
CLAUDE.md architecture note extended. post.md closes the sprint with UxTS
mapping + follow-ups (--constraint-code seeding, EVENTGRAPH-003, UOTS).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(metrics,backup): resolve docker binary robustly under minimal launchd PATH
The native server (launchd) inherits PATH=/usr/bin:/bin:/usr/sbin:/sbin, which
excludes the Docker Desktop symlink (/usr/local/bin/docker). So every
server-runtime `docker` shellout failed with "executable file not found in
$PATH": (1) Neo4j container CPU/mem stats (server.go) logged an ERROR every 60s
and left the neo4j_container_* gauges empty — so the neo4j_high_cpu/_memory
alert rules had no data; (2) the TSDB backup scheduler's `docker compose
pg_dump` (backup.go) failed with only a slog.Warn. The DATA PLANE was never
affected — Neo4j (Bolt) + TSDB (pgx) connect over mapped TCP ports, not the
docker CLI.
Fix (durable, configurable, single-source): new internal/dockerbin resolver —
MDEMG_DOCKER_BIN env override → exec.LookPath → well-known install locations
(/usr/local/bin, /opt/homebrew/bin, /usr/bin) → graceful unavailable. Wired
into server.go (stats) + backup.go (both compose calls). The perpetual 60s
ERROR is downgraded to a one-shot WARN when docker is genuinely absent (it's
optional telemetry). Added a sane PATH to the launchd server plist template as
defense-in-depth.
Live-verified: after restart, mdemg_neo4j_container_cpu_percent=0.59 /
mem_percent=29.13 now land in metric_samples (were absent); no more docker-stats
ERROR; `docker stats` + backup resolve docker under a simulated minimal PATH.
Note: `mdemg data export-auto` (training-export job) was NOT a victim — it
exports via network SQL, not docker (corrected from an earlier assumption).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): sprint plan — fail-loud scheduled jobs (Epic 0)
Triggered by a live-discovered silent failure: the TSDB backup scheduler was
failing every 24h run (docker-under-launchd-PATH) with only a buried slog.Warn.
Docker cause fixed (4cc7608); this sprint fixes the class — scheduled jobs that
fail with no record + no alert. V0024 scheduled_job_events hypertable + writer,
jobhealth.Report (record + alert on failure), wire the 3 jobs (backup,
maintenance, export-auto), 2 evaluator rules (backup staleness + recent
failure) so the server catches "job failed OR never ran". Config-driven, 3
testing tiers, live Tier-3 induced-failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): V0024 scheduled_job_events + writer (Epic 1)
Hypertable (job_name, success, latency_ms, error_message, metadata jsonb,
recorded_at) + RecordJobEvent synchronous single-row writer (mirrors V0021
model_install pattern). Indexes: per-job freshness, partial failed, per-space.
One row per scheduled-job run so the alert evaluator can detect "job failed"
AND "job never ran" (staleness). Schema 23->24; TSDB_REQUIRED_SCHEMA_VERSION
bumped to match migration count (CI check 24=24).
Tier 1 (-race): field mapping, optional-nulls, error truncation, nil-pool
no-op, insert-error propagation. Tier 2 (live TSDB): round-trip + the staleness
(recent successes) + failure (recent failures) query shapes the rules will use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): record + alert on scheduled-job outcomes (Epic 2)
New internal/jobhealth.Report — the single policy point: record a
scheduled_job_events row and fire a high-severity "scheduled-job" alert on
failure (both pool + dispatcher nil-safe). Wired into all three jobs:
- TSDB backup scheduler (internal/tsdb/backup.go): decoupled JobResultFunc hook
(mutex-guarded, -race clean) so internal/tsdb stays free of internal/alert;
server.go::SetTSDBClient sets it with the pool + s.alertDispatcher. A failed
or never-run backup now records + alerts instead of a silent slog.Warn.
- export-auto + maintenance (CLI): deferred reportScheduledJob on the named
return error — opens a short-lived pool + a file-backed dispatcher (same
~/.mdemg/alerts/current.json the hooks surface) so a separate-process CLI job
still alerts the operator.
Tier 1 (-race): jobhealth fires alert only on failure (real file-backend
dispatcher), nil-safe. Live smoke: export-auto recorded success=t latency=3050ms.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): scheduled-job staleness + failure alert rules (Epic 3)
Two server-native evaluator rules over V0024 scheduled_job_events:
- scheduled_job_recent_failure (always on): any job failure in the last
JOB_FAILURE_LOOKBACK_MIN (default 60) → high alert.
- backup_no_recent_success (gated on TSDB_BACKUP_ENABLED): zero successful
tsdb-backups within the staleness window → high alert. THIS is the "job
never ran" guarantee — it fires from the server observing ABSENT success, so
a backup that silently died or never started is caught, not just one that
errored.
Window derives from the real backup interval × 2 (JOB_BACKUP_STALENESS_HOURS
override; no hardcoded literal). JOB_HEALTH_ALERT_ENABLED master gate (default
true). Appended after DefaultRules() in serve.go.
Tier 1: failure rule always present (gt 0), staleness gated on backups-enabled
(lt 0.5), windows reflect config, non-positive fallback. Build/lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): distinct services for job rules so neither masks the other
Caught in live Tier-3 testing: both evaluator rules used Service="scheduled-jobs",
and the dispatcher cooldown key is (Service, Severity) — so the failure alert's
cooldown SUPPRESSED the staleness alert (only the failure fired). One alarm
masking another is the exact silent-failure class this sprint kills. Fixed:
scheduled-job-failure / scheduled-job-staleness distinct services. Re-verified
live — both fire independently and land as distinct alert-file entries. Tier-1
assertion pins that the two services differ. Includes Tier-3 verification.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): feature doc + CHANGELOG + CLAUDE.md + close (Epic 4)
Feature doc docs/features/scheduled-job-health.md (why / two mechanisms /
operator view / config). CHANGELOG Added (NOSILENT-001) + Changed (schema 23→24)
+ Fixed (docker-under-launchd-PATH). CLAUDE.md Service Alert System extended
with the scheduled-job-health note + the distinct-Service-per-rule cooldown
caveat. post.md closes the sprint (UxTS mapping + follow-ups).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): sync embedded launchd server plist with source (CI)
CI "Verify embedded launchd templates match source" diffs packaging/launchd/*
against internal/cli/launchd_templates/* (the embed.FS copy mdemg service
install uses). The PATH addition landed only in the source copy; sync the
embedded copy so they match byte-for-byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(roadmap): add jiminy-governance skill build-out (Workstream C, Action 7)
Records the jiminy-governance Claude Code skill on the active forward roadmap
(SPRINT_ROADMAP_POST_FT_LORA.md, cross-cutting governance) + brings the source
spec into the repo (docs/development/jiminy-governance-skill/SKILL.md, out of
~/Downloads). The skill makes Jiminy the deterministic source of context +
governance over J17, enforced by the PreToolUse hook — a routing/handshake shim,
not a rulebook. Build-out scope notes the wire-up placeholders that must be
resolved against the real instance (Jiminy MCP/endpoint, PreToolUse hook, J17
ack/RetireCode/GUIDANCE_OUTCOME calls). Aligns with the now-live guidance loop
(RRF-SCALE-001 / JIMINY-OUTCOME-001 / GUIDANCE-SYNTH-001).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): resolve skill wire-up against the real instance
Step 1 of the jiminy-governance build-out (roadmap Workstream C Action 7).
Resolved all five placeholders from the running MDEMG instance, verified live:
- Jiminy query: MCP `mdemg mcp` (stdio) → jiminy_guide/validate_changes; HTTP
/v1/jiminy/guide (returns guidance_id) + /bootstrap glossary (j17v1) + /latest.
- PreToolUse: pre-bash-check.py (Bash, fail-closed) + pre-write-check.py
(Write/Edit → /v1/jiminy/classify, /strict-only, fail-open).
- SessionID: claude-core convention.
- Comprehension ack: /v1/jiminy/protocol/feedback (verified ingested 1).
- GUIDANCE_OUTCOME: /v1/jiminy/feedback {guidance_id,…}.
- RetireCode: internal-only by design (RSIC/APE protocol-evolution) — no
agent-facing call; agent must never self-retire a constraint.
Also surfaced the two real integration gaps the build-out must close (the work,
not the prose): (a) the MDEMG MCP server is NOT registered (.mcp.json absent —
context is pushed by prompt-context.sh, not pulled by the agent); (b) PreToolUse
enforcement is /strict-gated + fail-open, so not deterministic-by-default.
Roadmap Action 7 updated to reflect step 1 done + the two gaps as next steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy-governance): ship the J17 governance skill + register MDEMG MCP
Build-out of roadmap Workstream C Action 7 (steps 2-4), per the resolved
wire-up. Closes the two integration gaps found while resolving:
- gap 1 (MCP not registered): .mcp.json registers `mdemg mcp` (stdio).
Live-probed: 20 tools incl. jiminy_guide + validate_changes — the agent can
now PULL guidance, not only receive the hook push.
- gap 2 (enforcement /strict-gated + fail-open): policy set — Write/Edit J17
gate kept fail-open (hard server dependency on every edit is too brittle);
the skill's handshake auto-enables /strict so the gate is active per session.
Bash gate already fail-closed (demonstrated live when a test payload with a
destructive force-push string was blocked by pre-bash-check.py).
Skill authored at the canonical .claude/skills/jiminy-governance/SKILL.md
(frontmatter valid; concrete wire-up inline — MCP tools + HTTP endpoints +
SessionID claude-core + the 5-step handshake). Kept a routing/handshake shim,
not a rulebook (rules stay in the graph).
Live Tier-3 PASSED: full handshake identify->request->comprehend->act->report
ran against the real instance; GUIDANCE_OUTCOME edges 906->909 (one per coded
constraint). Verification: docs/development/jiminy-governance-skill/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): commit install-ready skill + install README
.claude/ is gitignored (per-developer local config), so the installed skill at
.claude/skills/jiminy-governance/SKILL.md is local-only by convention. Commit
the reproducible, install-ready copy (jiminy-governance.skill.md) + a README
with the one-line install (cp into .claude/skills/) so the skill propagates via
the repo. The MCP server it uses is registered in the tracked repo-root
.mcp.json.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID across hooks + skill
Hooks and the jiminy-governance skill hardcoded session_id="claude-core", so
trust/escalation/observations from ALL Claude Code conversations collapsed into
one shared MDEMG session. Claude Code already passes a per-conversation
session_id on stdin to every hook — the implementation just never used it.
Resolver precedence (single rule everywhere): MDEMG_SESSION_ID env (stable-
identity escape hatch) > Claude Code stdin session_id (per-conversation default,
race-free per hook) > ~/.mdemg/.claude-session (published by SessionStart +
UserPromptSubmit for the agent / stdin-less contexts) > claude-core (fallback).
Realizes J17's intended per-(session,constraint) isolation.
Tracked templates updated (internal/cli/hook_templates/): session-start.sh,
prompt-context.sh, post-tool-observe.py, pre-compact.sh + Windows .ps1 variants
— every hardcoded claude-core in MDEMG calls replaced with the resolved id;
session-start/prompt-context publish the session file. Skill SessionID
instruction + handshake steps now resolve <SessionID> instead of claude-core.
Live-verified: a hook resolved a stdin session_id and published the session
file; post-tool-observe wrote an observation keyed to the per-conversation id in
Neo4j (not claude-core). bash -n / py_compile clean; go build + hooks test pass.
Note: pre-write-check.py is local-only (no tracked installer template); its fix
is applied on this machine but won't propagate via `mdemg hooks install` until
it's added to the tracked hooks (follow-up). .claude/* is gitignored, so the
live hook copies aren't committed — the tracked templates propagate via install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID — installed hook copies
The .claude/hooks/ installed copies are tracked (committed before the .claude/*
ignore), so apply the same SessionID resolver here as in the embedded templates
(prior commit). session-start.sh, prompt-context.sh, post-tool-observe.py,
pre-compact.sh now resolve MDEMG_SESSION_ID env > stdin session_id >
~/.mdemg/.claude-session > claude-core. (pre-write-check.py is untracked /
local-only — its fix lives on this machine only; tracking it is the follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcement,guidance-outcome}-neighborhood) + TOC entry;
these were the only two missing endpoints (audited the full route table).
(3) docs/user/cli-reference.md — add mdemg eventgraph {reinforcement,guidance-
outcome}-neighborhood, model run, watchdog status, migrate context-fingerprint,
data curate/validate/clean; fix the stale `model pull --adapter` description
(MODEL-DIST-002 shipped — no longer "deferred/errors"); update the Command Tree
Summary to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert.…
* docs(rrf-scale-001): CHANGELOG + CLAUDE.md score-scale contract + post.md (Epic 5)
Final epic. CHANGELOG Unreleased gains the RRF-SCALE-001 Fixed entry.
CLAUDE.md gains a 'score-scale contract' architecture note — the
structural defense against a 4th instance: downstream consumers MUST
NOT hardcode absolute thresholds against RetrieveResult.Score (the
scorer scale is not a stable contract); gate via config or a
scale-invariant signal, and re-audit on any scorer change. Notes that
NormalizedConfidence is positional (not a safe sole gate) and records
the three open follow-ups.
post.md: epic-by-epic, acceptance check-off (honest: #2 partial — TSDB
sink revived, Neo4j edge is distinct Follow-up A), scope note
separating the score-scale fix (done) from the 3 adjacent surfaced
issues (documented follow-ups), discipline notes (cold-start mask,
inner authority gate).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(rrf-scale-001): skip guidance integration tests on empty environment (CI fix)
CI failure on PR 404: TestRRFScale_SuggestSurfacesGuidance failed in
0.02s. Root cause: the test assumed the populated local mdemg-dev space
(111 constraint nodes), but CI boots a FRESH EMPTY Neo4j with stub
embeddings (and RETRIEVAL_COLUMN_VOTING_ENABLED=false / legacy scorer).
With no data, /v1/memory/suggest returns 0 candidates, so the
'total == 0' assertion fired.
Other integration tests self-seed data or skip when prerequisites are
absent; mine relied on ambient data — wrong for a reproducible CI run.
Fix: skip when debug.retrieved_count == 0 (no retrievable data → the
score-gate fix isn't exercisable; there's nothing for the gate to admit
or reject). The test stays meaningful against a populated stack (local:
9 suggestions from 15 retrieved → PASS) and skips cleanly in CI's
empty-DB environment. Verified both paths live: populated → PASS,
empty space → retrieved_count 0 → SKIP.
The gate fix itself is validated by Tier 1 unit tests + the live Tier 3
e2e (docs/development/rrf-scale-001/verification.md); this integration
test is a bonus live-stack assertion, not the primary proof.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): sprint plan — revive Neo4j GUIDANCE_OUTCOME sink
Follow-up A from RRF-SCALE-001: the Neo4j GUIDANCE_OUTCOME edge sink has
been dormant since Apr 12. Root cause: matchConstraintCode links guidance
items to constraint codes by keyword overlap (>=3 shared words), but
retrieval surfaces emergent_concept abstractions whose content does not
share 3+ literal words with raw constraint text -> no constraint_code ->
PersistGuidanceOutcome falls back to the concept SourceNode -> the
role_type=constraint filter rejects it -> no edge. Live-proven: all 17
recent outcome rows had constraint_code=(none).
Fix (Option 1): switch the matcher to embedding cosine similarity
(content already normalized to natural language ~0.70 cosine; Service
has an embedder; cosineSimilarity + embed->cosine pattern already exist
in-package via OutcomeClassifier). Existing PersistGuidanceOutcome +
findConstraintNodeID then create edges on the correct constraint nodes.
Keyword matching stays as fallback -- never regresses.
4 epics; ~1-1.5 dev-days; config-driven threshold; acceptance bar = a
fresh Neo4j GUIDANCE_OUTCOME edge on a real role_type=constraint node
dated today, reflected in GetConstraintEffectiveness.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): embedding-similarity constraint-code matching (JIMINY-OUTCOME-001 Epic 1)
Revives the Neo4j GUIDANCE_OUTCOME edge sink (dormant since Apr 12).
Root cause (RRF-SCALE-001 Follow-up A): matchConstraintCode links
guidance items to constraint codes by keyword overlap (>=3 shared
words), but retrieval surfaces emergent_concept abstractions whose
content rarely shares 3+ literal words with raw constraint text -> no
code -> PersistGuidanceOutcome falls back to the concept SourceNode ->
the role_type=constraint filter rejects it -> no edge.
Fix: new matchConstraintCodeByEmbedding queries the constraint vector
index (db.index.vector.queryNodes, role_type=constraint, sim >=
threshold) and returns the closest constraint's code. Guide() tries
this first, falling back to the keyword matcher when the embedder is
unavailable, content is empty, or nothing clears the threshold — never
regresses. The existing PersistGuidanceOutcome + findConstraintNodeID
then create the edge on the correct constraint node.
Implementation refinement vs plan: uses Neo4j's vector index server-side
(mirrors the proven Evaluator.findMatchingConstraints pattern) rather
than loading all constraint embeddings into Go and computing cosine in a
loop — cleaner, no constraintCodeEntry.Embedding needed. Same Option-1
outcome.
Config: JIMINY_CONSTRAINT_CODE_SIM_THRESHOLD (default 0.55, zero-value
fallback) — provisional; tuned against the live similarity distribution
in Epic 2.
Tier 1 (4 tests): nil-driver/empty-embedding guards, threshold default
resolution, keyword-fallback non-regression. Full jiminy + config
suites green; lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(jiminy-outcome-001): Tier 2 integration + Tier 3 live verification (Epic 2)
Tier 3 live e2e (verification.md) — acceptance bar MET:
- /v1/jiminy/guide now yields guidance items carrying constraint_codes
(10 items, 6 coded; was 0). Matched code 'no-direct-main-commits' is
semantically exact for the 'commit to main' context.
- Full warm->latest->feedback loop: Neo4j GUIDANCE_OUTCOME 893 -> 899
(+6), latest today. All 6 new edges land on REAL role_type=constraint
nodes ('CONSTRAINT: NEVER commit directly to main') — not
emergent_concept. The sink dormant since Apr 12 is revived on the
correct nodes.
- /v1/constraints/effectiveness reflects it: 'NEVER commit directly to
main | surfaced: 30 followed: 28 rate: 0.93'.
- Both sinks now revived: TSDB (RRF-SCALE-001) + Neo4j (here). The
constraint-effectiveness loop is fully restored.
Threshold 0.55 validated live: correct matches, no false positives.
Tier 2 (jiminy_outcome_test.go, integration tag, skip-on-empty): PASSES
on a populated stack with an idle LLM (7/10 items coded). The guide path
is LLM-latency-dependent (per-node classifier ~31s/call, serialized; a
call fired while the LLM is busy fast-fails empty), so the test
warm-retries and SKIPS (never false-fails) when the LLM path can't
produce items. Bonus check; Tier 3 is the definitive proof. The LLM
serialization/synthesis-timeout is RRF-SCALE-001 Follow-up B, tracked
separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): CHANGELOG + CLAUDE.md + post.md (Epic 3)
Final epic. CHANGELOG Unreleased gains the JIMINY-OUTCOME-001 Fixed
entry. CLAUDE.md gains a guidance-outcome constraint-code-matching note
(embedding-first via vector index, keyword fallback; both outcome sinks
now live). post.md: epic-by-epic, acceptance check-off, the loop-revival
completion (TSDB from RRF-SCALE-001 + Neo4j here), discipline notes (LLM
serialization is the test-flakiness source), forward-looking (Follow-up
B now the most operationally-visible remaining issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(guidance-synth-001): sprint plan — fix guidance synthesis timeout (Follow-up B)
Synthesis fails on every production warm call (6/6 jiminy.synthesize
errored). Root cause: the hook's /warm path runs background Guide() with
a hardcoded 30s timeout (handlers_jiminy.go:302), inside which the
per-node constraint classifier runs SERIALLY (~1.5s x ~10 nodes ~= 15s),
leaving only ~15s for synthesis which needs 8-27s -> deadline exceeded.
JIMINY_TIMEOUT_MS=240s is configured but the 30s hardcode caps it.
Fix (both): (1) parallelize the per-node classifier with bounded
concurrency (CONSULTING_CLASSIFY_CONCURRENCY, default 4 matching
llama-server --parallel 4); (2) config-drive the warm timeout
(JIMINY_WARM_COMPUTE_TIMEOUT_MS, default 90s). Acceptance: synthesis
succeeds live (no synthesis_error), measured latency drop, no
constraint-surfacing regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(consulting): parallelize per-node constraint classifier (GUIDANCE-SYNTH-001 Epic 1)
The per-node LLM constraint classifier in findApplicableConstraints ran
serially (~1.5s/node x ~10 nodes ~= 15s), starving guidance synthesis of
its time budget (synthesis 6/6 errored on the warm path). Now classifies
with bounded concurrency.
- Gate-first (RRF-SCALE-001 score floor) to fix a stable candidate order,
then classify each candidate into a position-indexed slot via a
semaphore-bounded worker pool, then collect-in-order + dedup-by-name —
output is identical to the serial path (determinism). Keyword-only
(no LLM) or cap=1 runs serially (no LLM latency to hide).
- Config: CONSULTING_CLASSIFY_CONCURRENCY (default 4, matching
llama-server --parallel 4; floor 1 = serial rollback). Zero-value
fallback to 4.
- Extracted constraintClassifierIface (minimal Classify surface) so the
concurrent path is unit-testable with a fake; *ConstraintClassifier
satisfies it; SetConstraintClassifier guards against a typed-nil
interface.
Tier 1 (5 new, -race clean): ParallelEqualsSerial (determinism + order),
ParallelIsFaster (concurrency overlaps latency), ErrorFallsBackToKeyword
(fallback intact), ScoreGateStillApplies (RRF-SCALE-001 gate preserved),
ConcurrencyDefaultFallback. Existing findApplicableConstraints tests
unchanged — no regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): config-drive warm-compute timeout (GUIDANCE-SYNTH-001 Epic 2)
The warm-path background Guide() ran with a hardcoded 30s timeout
(handlers_jiminy.go:302) even though JIMINY_TIMEOUT_MS=240s is
configured. 30s was too tight for the per-node classifier (~15s) +
synthesis (8-27s) -> synthesis deadline-exceeded every warm call.
Replaced with JIMINY_WARM_COMPUTE_TIMEOUT_MS (default 90000, zero-value
fallback 90000) — headroom for the now-parallel classifier (~7.5s) + a
slow 27s synthesis. No-hardcoding rule. Rollback: set to 30000.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test+docs(guidance-synth-001): Tier 2/3 verification + docs + close (Epic 3)
Tier 3 live e2e (verification.md): the warm production path now produces
a synthesized narrative — synthesis_used=true, no synthesis_error,
1892-char augmentation. Fresh jiminy.synthesize succeeded at 50.7s
latency (fit the new 90s budget; would die at the old 30s — validates
the default). Both fixes needed.
Tier 2 (guidance_synth_test.go, integration, skip-on-empty + LLM-
tolerant): PASS — warm path produces guidance without synthesis_error.
Docs: CHANGELOG Fixed entry; CLAUDE.md guidance-synthesis-budget note
('when adding LLM calls to the guidance hot path: respect the
warm-compute budget and prefer bounded concurrency over serial loops');
post.md with the data-driven diagnosis + forward-looking (Follow-up C
now the last open item).
Closes Follow-up B. The guidance pipeline (surfacing + codes +
synthesis) is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(followup-c): close JSON control-char escaping as NON-ISSUE (no fix)
Follow-up C (the last open item from RRF-SCALE-001 triage) investigated
and closed with evidence — NO code change, because there is no bug to fix.
The earlier /v1/jiminy/latest parse failures were client-side shell
artifacts (co-occurring with the session's 'failed to change group ID'
errors + ad-hoc variable-capture piping), not server bytes:
- writeJSON uses json.NewEncoder().Encode (encoding/json always escapes
control chars U+0000-U+001F); no raw-write bypass; no custom MarshalJSON.
- The synthesized narrative is double-StripControlChars'd (synthesizer.go
:127 + service.go:1116).
- prompt-context.sh already strips control chars via perl before jq, with
2>/dev/null + // empty fallbacks.
Live-verified: the hook's exact jq returns guidance_id correctly; 5 rapid
/latest fetches all parse as strict-valid JSON; 0 raw control chars.
Per 'don't fix a non-problem', shipping a fix would invent a bug that
doesn't exist. Closure documented in docs/development/followup-c-closure.md.
This closes the entire RRF-SCALE-001 follow-up triage: A (JIMINY-OUTCOME
-001), B (GUIDANCE-SYNTH-001), C (non-issue). The guidance->feedback->
outcome loop is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(jiminy): /guide 30s timeout sibling + single-source config defaults (GUIDANCE-SYNTH-001 fix-commit)
Two things, both from a live e2e of the full loop through the real
production hook path (run per user directive: standard tests don't find
live problems).
1. Sibling bug: the /v1/jiminy/guide handler had the SAME hardcoded 30s
cap as the warm path (handlers_jiminy.go). GUIDANCE-SYNTH-001 fixed
warm; /guide still deadline-exceeded synthesis at exactly 30.003s
(this is what made prior sprints' /guide integration tests flaky).
Now uses the config-driven budget. Live-verified: a 50.05s synthesis
completed (synthesis_used=true) — would die at 30s.
2. Single source of truth for config defaults (user directive: 'single
place to change all instances'). The 90s budget was duplicated as a
literal in 3 sites; prior sprints similarly duplicated each default
(the sigmoid 0.45/8.0 was in 3 places). Now each default is one
exported config.Default* const, referenced by FromEnv and aliased by
consuming-package fallbacks + a Config.JiminyWarmComputeTimeout()
method. Consolidated: warm-compute timeout, the 3 consulting score
floors, sigmoid midpoint/steepness, constraint-code sim threshold,
classify concurrency. Zero behavior change (compile-time aliases);
-race + full suites green.
Live e2e also re-confirmed: real hook captures guidance_id -> feedback
-> +7 Neo4j GUIDANCE_OUTCOME edges on real constraint nodes + 10 TSDB
rows (whole loop closes through the actual hook; re-confirms Follow-up C
non-issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): sprint plan — federation consumer CLI + UATS backfill
Builds the first consumer for EVENTGRAPH-001's reinforcement-neighborhood
federation API (which has no consumer): a 'mdemg eventgraph' CLI command.
Validates the Pattern Y1 bet + becomes the live-testing harness for
EVENTGRAPH-002/003 (user directive: build the consumer first).
Per the UxTS directive: maps the work to the frameworks. UATS applies to
the federation HTTP API -> add eventgraph_reinforcement_neighborhood.uats
.json (backfilling the -001 gap; the endpoint shipped with no UATS),
which replaces an ad-hoc Go integration test as the Tier 2 contract test.
UVTS/UBENCH N/A. UOTS panel-spec gap noted as a follow-up (out of scope).
CLI rendering -> Tier 1 Go units.
4 epics; CLI (--seed/--query/--hops/--since/--limit/--json) renders
summary + events table or JSON; server-driven defaults (no re-hardcoding);
read-only. ~1-1.5 dev-days.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-cli-001): mdemg eventgraph reinforcement-neighborhood (Epic 1)
First consumer of the EVENTGRAPH-001 federation API — POSTs to
/v1/eventgraph/reinforcement-neighborhood, renders a summary + events table
(or --json). Supports --seed, --query (resolves seed via /v1/memory/retrieve
top-1), --hops, --since, --limit. Unset flags are omitted from the request
so the server applies its config defaults (no re-hardcoding of hops/since/limit
in the CLI). Registered under the "advanced" command group.
Tier 1 (httptest, -race clean): request-mapping omit-when-unset + conversion,
--query seed resolution, no-results + invalid --since + surfaced-503 errors,
render (empty + table), helpers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph): neighbor_node_ids serializes as [] not null for empty neighborhood
Caught in EVENTGRAPH-CLI-001 live contract testing (standard code tests
missed it; the live UATS happy-path against the running server did not):
walkNeighborhood returns a nil slice when the seed has no neighborhood
(e.g. an unknown seed), which JSON-marshals to `null`, while Events is
defensively initialized to []. Both are array fields and must serialize
consistently — null breaks any consumer asserting an array type (incl. the
new UATS contract's `type_is array` on $.neighbor_node_ids).
EventsInGraphNeighborhood now coalesces the nil slice to []string{}.
Tier 1 TestFederationResult_EmptyArraysNotNull pins the JSON contract.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-cli-001): UATS contract spec for federation API (Epic 2)
Backfills the UATS gap EVENTGRAPH-001 left (no contract test for
/v1/eventgraph/reinforcement-neighborhood). 6 cases, validated 6/6 live
against the running server:
- happy 200: asserts the response contract shape (events/neighbor_node_ids
arrays, graph_hops/tsdb_rows_scanned numbers, truncated boolean) — robust
to data, works even with an unknown seed (empty neighborhood is valid 200)
- missing_space_id / missing_seed_node_id → 400 (empty-string override, since
the runner deep-merges variant body over base — key omission can't unset)
- negative_hops → 400, hops_over_ceiling (999 > 2×default) → 400
- method_not_allowed (GET) → 405
sha256 integrity hash added + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): live verification + feature doc + CHANGELOG + close (Epic 3)
Tier 3 live e2e verified the real binary against the real stack: --query
surfaced 20 reinforcement events in a 5-node neighborhood (demonstrating the
Hebbian-write → federation-read loop closing in one command); --seed/--json/
--limit/unknown-seed/no-arg paths all verified live. Feature doc gains the CLI
consumer section; CHANGELOG Added + Fixed entries; CLAUDE.md architecture note;
verification.md + post.md (UxTS mapping: UATS done, UOTS follow-up carried over).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph-cli-001): tag UATS spec 'tsdb' so CI skips it without TSDB
CI Test failed: the UATS contract step boots a minimal server without TSDB,
so the eventgraph service is nil and every POST returns 503 "service not
initialized" instead of the expected 200/400 (only GET→405 passed, since the
method check precedes the service check). Same class as PR #404. The
federation endpoint genuinely requires TSDB (it queries reinforcement_events;
the service is nil without TSDB at boot), and CI's UATS step already excludes
`tsdb`-tagged specs (ci.yml --exclude-tag ...,tsdb). Added "tsdb" to api.tags
(matching metrics_snapshot/readyz_tsdb); re-hashed. Verified locally: the spec
now reports Status: skip under the exact CI exclude filter, and still 6/6 live
against the full stack via explicit --spec.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): sprint plan — guidance-outcome federation (Epic 0)
Federate the guidance-outcome event stream (Pattern Y1, second event class):
walk a constraint's Neo4j neighborhood, surface time-windowed constraint_outcomes
(followed/ignored/contradicted) for the constraint + its graph-related constraints.
Data-decided architecture: reuse the existing constraint_outcomes table (no new
hypertable/writer/enqueue site — RRF-SCALE-001 already populates it, 1176 live
rows); join graph↔events on constraint_code (TSDB constraint_id UUID ≠ Neo4j
node_id CUID — code is the only viable key). One additive migration (V0023:
constraint_code index, schema 22→23). 8 epics, 3 testing tiers, live Tier 3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): V0023 constraint_code index on constraint_outcomes (Epic 1)
Adds idx_constraint_outcomes_code (space_id, constraint_code, time DESC) — the
guidance-outcome federation joins graph↔events on constraint_code (TSDB
constraint_id is a UUID that doesn't match the Neo4j node_id CUID; code is the
only viable key), and migration 011 indexed only space/constraint_id/outcome.
Partial index (constraint_code NOT NULL AND <> '') skips uncoded outcomes.
Bumps TSDB_REQUIRED_SCHEMA_VERSION default 22→23 (config.go) to match the
migration count — CI schema-version validator gates on this. Additive, no data
change, idempotent.
Live-verified: migration applies (schema 22→23), idx present, re-apply is a
no-op, config/tsdb tests green, CI schema check 23=23.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): GuidanceOutcomesInNeighborhood federation method (Epic 2)
Second Pattern Y1 federation: walk a constraint's Neo4j neighborhood, collect
each neighbor's constraint_code, and join constraint_outcomes on those codes
(backed by the V0023 index). walkNeighborhoodWithCodes returns the neighborhood
node IDs + a code→node map; queryGuidanceOutcomes pulls coded outcomes in the
window; Go-side join resolves each outcome's code → its neighborhood constraint
node. Non-nil slices from the start (EVENTGRAPH-CLI-001 lesson). Reuses the
existing constraint_outcomes sink — no new table/writer.
Tier 1 (-race): validation guards, empty-arrays-not-null, sortedKeys
determinism, join resolution. Tier 2 integration (live Neo4j+TSDB): full
round-trip — hops=1 (seed+related codes, off-neighborhood excluded), hops=0
(seed code only), unknown-seed (empty non-nil). PASS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): guidance-outcome federation handler + route (Epic 3)
POST /v1/eventgraph/guidance-outcome-neighborhood — walk a constraint's
neighborhood, surface constraint_outcomes whose code is in the neighborhood.
Same gating/auth/default convention as the reinforcement endpoint.
Single-source refactor (per the dynamic-variables directive): extracted the
shared gate (method/enabled/service → eventgraphGate) and default-resolution
(hops/since/limit + ceiling → resolveFederationDefaults) into helpers used by
BOTH handlers, so the federation rules live in exactly one place. The
reinforcement handler now calls them too — verified no regression (reinforcement
UATS still 6/6 live, unit tests green).
Live-verified: seeding from the real 'no-direct-main-commits' constraint node
surfaced real 'followed' outcomes with constraint_node_id resolved to the seed
and in_neighborhood=true.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): mdemg eventgraph guidance-outcome-neighborhood CLI (Epic 4)
Sibling subcommand consuming POST /v1/eventgraph/guidance-outcome-neighborhood.
Walks a constraint's neighborhood and renders guidance outcomes (followed/
ignored split + table: code · outcome · sim · g_type · guidance_id · recorded)
or --json. Seed via --seed/--query (--constraint-code seeding deferred — needs
server-side code→node resolution; --query covers discovery). Unset hops/since/
limit omitted so the server applies config defaults (single source of truth).
Tier 1 (-race): request-mapping omit-when-unset + conversion, --query seed
resolution, surfaced-503 error, render (empty + followed/ignored table),
truncStr. Help renders.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-002): UATS contract spec for guidance-outcome federation (Epic 5)
6 cases, validated 6/6 live: happy-200 response shape (outcomes/
neighbor_node_ids/neighbor_constraint_codes arrays, graph_hops/tsdb_rows_scanned
numbers, truncated boolean), missing space_id/seed → 400 (empty-string override
under deep-merge), negative_hops → 400, hops_over_ceiling → 400, GET → 405.
Tagged 'tsdb' so CI skips it without TSDB (the EVENTGRAPH-CLI-001 lesson).
sha256 hashed + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): Tier 3 live verification (Epic 6)
Real binary against the real stack. Key assertion: CLI --json output matches
direct constraint_outcomes SQL exactly (11 outcomes = 11, all followed) for the
no-direct-main-commits constraint. --seed/--query/--limit/--json/unknown-seed/
no-arg all verified live. The --query "0 outcomes" result was traced to SQL
ground truth — the 5 neighborhood codes genuinely have no feedback, so it's
correct (federation distinguishes "code in neighborhood" from "code has
outcomes"), not a join bug. Reinforcement endpoint un-regressed by the shared-
helper refactor (UATS 6/6).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): feature doc + CHANGELOG + CLAUDE.md + close (Epic 7)
Feature doc gains a Guidance-Outcome Federation section (why reuse
constraint_outcomes, why join on constraint_code, CLI usage) + forward-look
update. CHANGELOG Added (endpoint + CLI) + Changed (TSDB schema 22→23).
CLAUDE.md architecture note extended. post.md closes the sprint with UxTS
mapping + follow-ups (--constraint-code seeding, EVENTGRAPH-003, UOTS).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(metrics,backup): resolve docker binary robustly under minimal launchd PATH
The native server (launchd) inherits PATH=/usr/bin:/bin:/usr/sbin:/sbin, which
excludes the Docker Desktop symlink (/usr/local/bin/docker). So every
server-runtime `docker` shellout failed with "executable file not found in
$PATH": (1) Neo4j container CPU/mem stats (server.go) logged an ERROR every 60s
and left the neo4j_container_* gauges empty — so the neo4j_high_cpu/_memory
alert rules had no data; (2) the TSDB backup scheduler's `docker compose
pg_dump` (backup.go) failed with only a slog.Warn. The DATA PLANE was never
affected — Neo4j (Bolt) + TSDB (pgx) connect over mapped TCP ports, not the
docker CLI.
Fix (durable, configurable, single-source): new internal/dockerbin resolver —
MDEMG_DOCKER_BIN env override → exec.LookPath → well-known install locations
(/usr/local/bin, /opt/homebrew/bin, /usr/bin) → graceful unavailable. Wired
into server.go (stats) + backup.go (both compose calls). The perpetual 60s
ERROR is downgraded to a one-shot WARN when docker is genuinely absent (it's
optional telemetry). Added a sane PATH to the launchd server plist template as
defense-in-depth.
Live-verified: after restart, mdemg_neo4j_container_cpu_percent=0.59 /
mem_percent=29.13 now land in metric_samples (were absent); no more docker-stats
ERROR; `docker stats` + backup resolve docker under a simulated minimal PATH.
Note: `mdemg data export-auto` (training-export job) was NOT a victim — it
exports via network SQL, not docker (corrected from an earlier assumption).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): sprint plan — fail-loud scheduled jobs (Epic 0)
Triggered by a live-discovered silent failure: the TSDB backup scheduler was
failing every 24h run (docker-under-launchd-PATH) with only a buried slog.Warn.
Docker cause fixed (4cc7608); this sprint fixes the class — scheduled jobs that
fail with no record + no alert. V0024 scheduled_job_events hypertable + writer,
jobhealth.Report (record + alert on failure), wire the 3 jobs (backup,
maintenance, export-auto), 2 evaluator rules (backup staleness + recent
failure) so the server catches "job failed OR never ran". Config-driven, 3
testing tiers, live Tier-3 induced-failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): V0024 scheduled_job_events + writer (Epic 1)
Hypertable (job_name, success, latency_ms, error_message, metadata jsonb,
recorded_at) + RecordJobEvent synchronous single-row writer (mirrors V0021
model_install pattern). Indexes: per-job freshness, partial failed, per-space.
One row per scheduled-job run so the alert evaluator can detect "job failed"
AND "job never ran" (staleness). Schema 23->24; TSDB_REQUIRED_SCHEMA_VERSION
bumped to match migration count (CI check 24=24).
Tier 1 (-race): field mapping, optional-nulls, error truncation, nil-pool
no-op, insert-error propagation. Tier 2 (live TSDB): round-trip + the staleness
(recent successes) + failure (recent failures) query shapes the rules will use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): record + alert on scheduled-job outcomes (Epic 2)
New internal/jobhealth.Report — the single policy point: record a
scheduled_job_events row and fire a high-severity "scheduled-job" alert on
failure (both pool + dispatcher nil-safe). Wired into all three jobs:
- TSDB backup scheduler (internal/tsdb/backup.go): decoupled JobResultFunc hook
(mutex-guarded, -race clean) so internal/tsdb stays free of internal/alert;
server.go::SetTSDBClient sets it with the pool + s.alertDispatcher. A failed
or never-run backup now records + alerts instead of a silent slog.Warn.
- export-auto + maintenance (CLI): deferred reportScheduledJob on the named
return error — opens a short-lived pool + a file-backed dispatcher (same
~/.mdemg/alerts/current.json the hooks surface) so a separate-process CLI job
still alerts the operator.
Tier 1 (-race): jobhealth fires alert only on failure (real file-backend
dispatcher), nil-safe. Live smoke: export-auto recorded success=t latency=3050ms.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): scheduled-job staleness + failure alert rules (Epic 3)
Two server-native evaluator rules over V0024 scheduled_job_events:
- scheduled_job_recent_failure (always on): any job failure in the last
JOB_FAILURE_LOOKBACK_MIN (default 60) → high alert.
- backup_no_recent_success (gated on TSDB_BACKUP_ENABLED): zero successful
tsdb-backups within the staleness window → high alert. THIS is the "job
never ran" guarantee — it fires from the server observing ABSENT success, so
a backup that silently died or never started is caught, not just one that
errored.
Window derives from the real backup interval × 2 (JOB_BACKUP_STALENESS_HOURS
override; no hardcoded literal). JOB_HEALTH_ALERT_ENABLED master gate (default
true). Appended after DefaultRules() in serve.go.
Tier 1: failure rule always present (gt 0), staleness gated on backups-enabled
(lt 0.5), windows reflect config, non-positive fallback. Build/lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): distinct services for job rules so neither masks the other
Caught in live Tier-3 testing: both evaluator rules used Service="scheduled-jobs",
and the dispatcher cooldown key is (Service, Severity) — so the failure alert's
cooldown SUPPRESSED the staleness alert (only the failure fired). One alarm
masking another is the exact silent-failure class this sprint kills. Fixed:
scheduled-job-failure / scheduled-job-staleness distinct services. Re-verified
live — both fire independently and land as distinct alert-file entries. Tier-1
assertion pins that the two services differ. Includes Tier-3 verification.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): feature doc + CHANGELOG + CLAUDE.md + close (Epic 4)
Feature doc docs/features/scheduled-job-health.md (why / two mechanisms /
operator view / config). CHANGELOG Added (NOSILENT-001) + Changed (schema 23→24)
+ Fixed (docker-under-launchd-PATH). CLAUDE.md Service Alert System extended
with the scheduled-job-health note + the distinct-Service-per-rule cooldown
caveat. post.md closes the sprint (UxTS mapping + follow-ups).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): sync embedded launchd server plist with source (CI)
CI "Verify embedded launchd templates match source" diffs packaging/launchd/*
against internal/cli/launchd_templates/* (the embed.FS copy mdemg service
install uses). The PATH addition landed only in the source copy; sync the
embedded copy so they match byte-for-byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(roadmap): add jiminy-governance skill build-out (Workstream C, Action 7)
Records the jiminy-governance Claude Code skill on the active forward roadmap
(SPRINT_ROADMAP_POST_FT_LORA.md, cross-cutting governance) + brings the source
spec into the repo (docs/development/jiminy-governance-skill/SKILL.md, out of
~/Downloads). The skill makes Jiminy the deterministic source of context +
governance over J17, enforced by the PreToolUse hook — a routing/handshake shim,
not a rulebook. Build-out scope notes the wire-up placeholders that must be
resolved against the real instance (Jiminy MCP/endpoint, PreToolUse hook, J17
ack/RetireCode/GUIDANCE_OUTCOME calls). Aligns with the now-live guidance loop
(RRF-SCALE-001 / JIMINY-OUTCOME-001 / GUIDANCE-SYNTH-001).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): resolve skill wire-up against the real instance
Step 1 of the jiminy-governance build-out (roadmap Workstream C Action 7).
Resolved all five placeholders from the running MDEMG instance, verified live:
- Jiminy query: MCP `mdemg mcp` (stdio) → jiminy_guide/validate_changes; HTTP
/v1/jiminy/guide (returns guidance_id) + /bootstrap glossary (j17v1) + /latest.
- PreToolUse: pre-bash-check.py (Bash, fail-closed) + pre-write-check.py
(Write/Edit → /v1/jiminy/classify, /strict-only, fail-open).
- SessionID: claude-core convention.
- Comprehension ack: /v1/jiminy/protocol/feedback (verified ingested 1).
- GUIDANCE_OUTCOME: /v1/jiminy/feedback {guidance_id,…}.
- RetireCode: internal-only by design (RSIC/APE protocol-evolution) — no
agent-facing call; agent must never self-retire a constraint.
Also surfaced the two real integration gaps the build-out must close (the work,
not the prose): (a) the MDEMG MCP server is NOT registered (.mcp.json absent —
context is pushed by prompt-context.sh, not pulled by the agent); (b) PreToolUse
enforcement is /strict-gated + fail-open, so not deterministic-by-default.
Roadmap Action 7 updated to reflect step 1 done + the two gaps as next steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy-governance): ship the J17 governance skill + register MDEMG MCP
Build-out of roadmap Workstream C Action 7 (steps 2-4), per the resolved
wire-up. Closes the two integration gaps found while resolving:
- gap 1 (MCP not registered): .mcp.json registers `mdemg mcp` (stdio).
Live-probed: 20 tools incl. jiminy_guide + validate_changes — the agent can
now PULL guidance, not only receive the hook push.
- gap 2 (enforcement /strict-gated + fail-open): policy set — Write/Edit J17
gate kept fail-open (hard server dependency on every edit is too brittle);
the skill's handshake auto-enables /strict so the gate is active per session.
Bash gate already fail-closed (demonstrated live when a test payload with a
destructive force-push string was blocked by pre-bash-check.py).
Skill authored at the canonical .claude/skills/jiminy-governance/SKILL.md
(frontmatter valid; concrete wire-up inline — MCP tools + HTTP endpoints +
SessionID claude-core + the 5-step handshake). Kept a routing/handshake shim,
not a rulebook (rules stay in the graph).
Live Tier-3 PASSED: full handshake identify->request->comprehend->act->report
ran against the real instance; GUIDANCE_OUTCOME edges 906->909 (one per coded
constraint). Verification: docs/development/jiminy-governance-skill/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): commit install-ready skill + install README
.claude/ is gitignored (per-developer local config), so the installed skill at
.claude/skills/jiminy-governance/SKILL.md is local-only by convention. Commit
the reproducible, install-ready copy (jiminy-governance.skill.md) + a README
with the one-line install (cp into .claude/skills/) so the skill propagates via
the repo. The MCP server it uses is registered in the tracked repo-root
.mcp.json.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID across hooks + skill
Hooks and the jiminy-governance skill hardcoded session_id="claude-core", so
trust/escalation/observations from ALL Claude Code conversations collapsed into
one shared MDEMG session. Claude Code already passes a per-conversation
session_id on stdin to every hook — the implementation just never used it.
Resolver precedence (single rule everywhere): MDEMG_SESSION_ID env (stable-
identity escape hatch) > Claude Code stdin session_id (per-conversation default,
race-free per hook) > ~/.mdemg/.claude-session (published by SessionStart +
UserPromptSubmit for the agent / stdin-less contexts) > claude-core (fallback).
Realizes J17's intended per-(session,constraint) isolation.
Tracked templates updated (internal/cli/hook_templates/): session-start.sh,
prompt-context.sh, post-tool-observe.py, pre-compact.sh + Windows .ps1 variants
— every hardcoded claude-core in MDEMG calls replaced with the resolved id;
session-start/prompt-context publish the session file. Skill SessionID
instruction + handshake steps now resolve <SessionID> instead of claude-core.
Live-verified: a hook resolved a stdin session_id and published the session
file; post-tool-observe wrote an observation keyed to the per-conversation id in
Neo4j (not claude-core). bash -n / py_compile clean; go build + hooks test pass.
Note: pre-write-check.py is local-only (no tracked installer template); its fix
is applied on this machine but won't propagate via `mdemg hooks install` until
it's added to the tracked hooks (follow-up). .claude/* is gitignored, so the
live hook copies aren't committed — the tracked templates propagate via install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID — installed hook copies
The .claude/hooks/ installed copies are tracked (committed before the .claude/*
ignore), so apply the same SessionID resolver here as in the embedded templates
(prior commit). session-start.sh, prompt-context.sh, post-tool-observe.py,
pre-compact.sh now resolve MDEMG_SESSION_ID env > stdin session_id >
~/.mdemg/.claude-session > claude-core. (pre-write-check.py is untracked /
local-only — its fix lives on this machine only; tracking it is the follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcement,guidance-outcome}-neighborhood) + TOC entry;
these were the only two missing endpoints (audited the full route table).
(3) docs/user/cli-reference.md — add mdemg eventgraph {reinforcement,guidance-
outcome}-neighborhood, model run, watchdog status, migrate context-fingerprint,
data curate/validate/clean; fix the stale `model pull --adapter` description
(MODEL-DIST-002 shipped — no longer "deferred/errors"); update the Command Tree
Summary to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch …
* feat(jiminy): embedding-similarity constraint-code matching (JIMINY-OUTCOME-001 Epic 1)
Revives the Neo4j GUIDANCE_OUTCOME edge sink (dormant since Apr 12).
Root cause (RRF-SCALE-001 Follow-up A): matchConstraintCode links
guidance items to constraint codes by keyword overlap (>=3 shared
words), but retrieval surfaces emergent_concept abstractions whose
content rarely shares 3+ literal words with raw constraint text -> no
code -> PersistGuidanceOutcome falls back to the concept SourceNode ->
the role_type=constraint filter rejects it -> no edge.
Fix: new matchConstraintCodeByEmbedding queries the constraint vector
index (db.index.vector.queryNodes, role_type=constraint, sim >=
threshold) and returns the closest constraint's code. Guide() tries
this first, falling back to the keyword matcher when the embedder is
unavailable, content is empty, or nothing clears the threshold — never
regresses. The existing PersistGuidanceOutcome + findConstraintNodeID
then create the edge on the correct constraint node.
Implementation refinement vs plan: uses Neo4j's vector index server-side
(mirrors the proven Evaluator.findMatchingConstraints pattern) rather
than loading all constraint embeddings into Go and computing cosine in a
loop — cleaner, no constraintCodeEntry.Embedding needed. Same Option-1
outcome.
Config: JIMINY_CONSTRAINT_CODE_SIM_THRESHOLD (default 0.55, zero-value
fallback) — provisional; tuned against the live similarity distribution
in Epic 2.
Tier 1 (4 tests): nil-driver/empty-embedding guards, threshold default
resolution, keyword-fallback non-regression. Full jiminy + config
suites green; lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(jiminy-outcome-001): Tier 2 integration + Tier 3 live verification (Epic 2)
Tier 3 live e2e (verification.md) — acceptance bar MET:
- /v1/jiminy/guide now yields guidance items carrying constraint_codes
(10 items, 6 coded; was 0). Matched code 'no-direct-main-commits' is
semantically exact for the 'commit to main' context.
- Full warm->latest->feedback loop: Neo4j GUIDANCE_OUTCOME 893 -> 899
(+6), latest today. All 6 new edges land on REAL role_type=constraint
nodes ('CONSTRAINT: NEVER commit directly to main') — not
emergent_concept. The sink dormant since Apr 12 is revived on the
correct nodes.
- /v1/constraints/effectiveness reflects it: 'NEVER commit directly to
main | surfaced: 30 followed: 28 rate: 0.93'.
- Both sinks now revived: TSDB (RRF-SCALE-001) + Neo4j (here). The
constraint-effectiveness loop is fully restored.
Threshold 0.55 validated live: correct matches, no false positives.
Tier 2 (jiminy_outcome_test.go, integration tag, skip-on-empty): PASSES
on a populated stack with an idle LLM (7/10 items coded). The guide path
is LLM-latency-dependent (per-node classifier ~31s/call, serialized; a
call fired while the LLM is busy fast-fails empty), so the test
warm-retries and SKIPS (never false-fails) when the LLM path can't
produce items. Bonus check; Tier 3 is the definitive proof. The LLM
serialization/synthesis-timeout is RRF-SCALE-001 Follow-up B, tracked
separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-outcome-001): CHANGELOG + CLAUDE.md + post.md (Epic 3)
Final epic. CHANGELOG Unreleased gains the JIMINY-OUTCOME-001 Fixed
entry. CLAUDE.md gains a guidance-outcome constraint-code-matching note
(embedding-first via vector index, keyword fallback; both outcome sinks
now live). post.md: epic-by-epic, acceptance check-off, the loop-revival
completion (TSDB from RRF-SCALE-001 + Neo4j here), discipline notes (LLM
serialization is the test-flakiness source), forward-looking (Follow-up
B now the most operationally-visible remaining issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(guidance-synth-001): sprint plan — fix guidance synthesis timeout (Follow-up B)
Synthesis fails on every production warm call (6/6 jiminy.synthesize
errored). Root cause: the hook's /warm path runs background Guide() with
a hardcoded 30s timeout (handlers_jiminy.go:302), inside which the
per-node constraint classifier runs SERIALLY (~1.5s x ~10 nodes ~= 15s),
leaving only ~15s for synthesis which needs 8-27s -> deadline exceeded.
JIMINY_TIMEOUT_MS=240s is configured but the 30s hardcode caps it.
Fix (both): (1) parallelize the per-node classifier with bounded
concurrency (CONSULTING_CLASSIFY_CONCURRENCY, default 4 matching
llama-server --parallel 4); (2) config-drive the warm timeout
(JIMINY_WARM_COMPUTE_TIMEOUT_MS, default 90s). Acceptance: synthesis
succeeds live (no synthesis_error), measured latency drop, no
constraint-surfacing regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(consulting): parallelize per-node constraint classifier (GUIDANCE-SYNTH-001 Epic 1)
The per-node LLM constraint classifier in findApplicableConstraints ran
serially (~1.5s/node x ~10 nodes ~= 15s), starving guidance synthesis of
its time budget (synthesis 6/6 errored on the warm path). Now classifies
with bounded concurrency.
- Gate-first (RRF-SCALE-001 score floor) to fix a stable candidate order,
then classify each candidate into a position-indexed slot via a
semaphore-bounded worker pool, then collect-in-order + dedup-by-name —
output is identical to the serial path (determinism). Keyword-only
(no LLM) or cap=1 runs serially (no LLM latency to hide).
- Config: CONSULTING_CLASSIFY_CONCURRENCY (default 4, matching
llama-server --parallel 4; floor 1 = serial rollback). Zero-value
fallback to 4.
- Extracted constraintClassifierIface (minimal Classify surface) so the
concurrent path is unit-testable with a fake; *ConstraintClassifier
satisfies it; SetConstraintClassifier guards against a typed-nil
interface.
Tier 1 (5 new, -race clean): ParallelEqualsSerial (determinism + order),
ParallelIsFaster (concurrency overlaps latency), ErrorFallsBackToKeyword
(fallback intact), ScoreGateStillApplies (RRF-SCALE-001 gate preserved),
ConcurrencyDefaultFallback. Existing findApplicableConstraints tests
unchanged — no regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy): config-drive warm-compute timeout (GUIDANCE-SYNTH-001 Epic 2)
The warm-path background Guide() ran with a hardcoded 30s timeout
(handlers_jiminy.go:302) even though JIMINY_TIMEOUT_MS=240s is
configured. 30s was too tight for the per-node classifier (~15s) +
synthesis (8-27s) -> synthesis deadline-exceeded every warm call.
Replaced with JIMINY_WARM_COMPUTE_TIMEOUT_MS (default 90000, zero-value
fallback 90000) — headroom for the now-parallel classifier (~7.5s) + a
slow 27s synthesis. No-hardcoding rule. Rollback: set to 30000.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test+docs(guidance-synth-001): Tier 2/3 verification + docs + close (Epic 3)
Tier 3 live e2e (verification.md): the warm production path now produces
a synthesized narrative — synthesis_used=true, no synthesis_error,
1892-char augmentation. Fresh jiminy.synthesize succeeded at 50.7s
latency (fit the new 90s budget; would die at the old 30s — validates
the default). Both fixes needed.
Tier 2 (guidance_synth_test.go, integration, skip-on-empty + LLM-
tolerant): PASS — warm path produces guidance without synthesis_error.
Docs: CHANGELOG Fixed entry; CLAUDE.md guidance-synthesis-budget note
('when adding LLM calls to the guidance hot path: respect the
warm-compute budget and prefer bounded concurrency over serial loops');
post.md with the data-driven diagnosis + forward-looking (Follow-up C
now the last open item).
Closes Follow-up B. The guidance pipeline (surfacing + codes +
synthesis) is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(followup-c): close JSON control-char escaping as NON-ISSUE (no fix)
Follow-up C (the last open item from RRF-SCALE-001 triage) investigated
and closed with evidence — NO code change, because there is no bug to fix.
The earlier /v1/jiminy/latest parse failures were client-side shell
artifacts (co-occurring with the session's 'failed to change group ID'
errors + ad-hoc variable-capture piping), not server bytes:
- writeJSON uses json.NewEncoder().Encode (encoding/json always escapes
control chars U+0000-U+001F); no raw-write bypass; no custom MarshalJSON.
- The synthesized narrative is double-StripControlChars'd (synthesizer.go
:127 + service.go:1116).
- prompt-context.sh already strips control chars via perl before jq, with
2>/dev/null + // empty fallbacks.
Live-verified: the hook's exact jq returns guidance_id correctly; 5 rapid
/latest fetches all parse as strict-valid JSON; 0 raw control chars.
Per 'don't fix a non-problem', shipping a fix would invent a bug that
doesn't exist. Closure documented in docs/development/followup-c-closure.md.
This closes the entire RRF-SCALE-001 follow-up triage: A (JIMINY-OUTCOME
-001), B (GUIDANCE-SYNTH-001), C (non-issue). The guidance->feedback->
outcome loop is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(jiminy): /guide 30s timeout sibling + single-source config defaults (GUIDANCE-SYNTH-001 fix-commit)
Two things, both from a live e2e of the full loop through the real
production hook path (run per user directive: standard tests don't find
live problems).
1. Sibling bug: the /v1/jiminy/guide handler had the SAME hardcoded 30s
cap as the warm path (handlers_jiminy.go). GUIDANCE-SYNTH-001 fixed
warm; /guide still deadline-exceeded synthesis at exactly 30.003s
(this is what made prior sprints' /guide integration tests flaky).
Now uses the config-driven budget. Live-verified: a 50.05s synthesis
completed (synthesis_used=true) — would die at 30s.
2. Single source of truth for config defaults (user directive: 'single
place to change all instances'). The 90s budget was duplicated as a
literal in 3 sites; prior sprints similarly duplicated each default
(the sigmoid 0.45/8.0 was in 3 places). Now each default is one
exported config.Default* const, referenced by FromEnv and aliased by
consuming-package fallbacks + a Config.JiminyWarmComputeTimeout()
method. Consolidated: warm-compute timeout, the 3 consulting score
floors, sigmoid midpoint/steepness, constraint-code sim threshold,
classify concurrency. Zero behavior change (compile-time aliases);
-race + full suites green.
Live e2e also re-confirmed: real hook captures guidance_id -> feedback
-> +7 Neo4j GUIDANCE_OUTCOME edges on real constraint nodes + 10 TSDB
rows (whole loop closes through the actual hook; re-confirms Follow-up C
non-issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): sprint plan — federation consumer CLI + UATS backfill
Builds the first consumer for EVENTGRAPH-001's reinforcement-neighborhood
federation API (which has no consumer): a 'mdemg eventgraph' CLI command.
Validates the Pattern Y1 bet + becomes the live-testing harness for
EVENTGRAPH-002/003 (user directive: build the consumer first).
Per the UxTS directive: maps the work to the frameworks. UATS applies to
the federation HTTP API -> add eventgraph_reinforcement_neighborhood.uats
.json (backfilling the -001 gap; the endpoint shipped with no UATS),
which replaces an ad-hoc Go integration test as the Tier 2 contract test.
UVTS/UBENCH N/A. UOTS panel-spec gap noted as a follow-up (out of scope).
CLI rendering -> Tier 1 Go units.
4 epics; CLI (--seed/--query/--hops/--since/--limit/--json) renders
summary + events table or JSON; server-driven defaults (no re-hardcoding);
read-only. ~1-1.5 dev-days.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-cli-001): mdemg eventgraph reinforcement-neighborhood (Epic 1)
First consumer of the EVENTGRAPH-001 federation API — POSTs to
/v1/eventgraph/reinforcement-neighborhood, renders a summary + events table
(or --json). Supports --seed, --query (resolves seed via /v1/memory/retrieve
top-1), --hops, --since, --limit. Unset flags are omitted from the request
so the server applies its config defaults (no re-hardcoding of hops/since/limit
in the CLI). Registered under the "advanced" command group.
Tier 1 (httptest, -race clean): request-mapping omit-when-unset + conversion,
--query seed resolution, no-results + invalid --since + surfaced-503 errors,
render (empty + table), helpers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph): neighbor_node_ids serializes as [] not null for empty neighborhood
Caught in EVENTGRAPH-CLI-001 live contract testing (standard code tests
missed it; the live UATS happy-path against the running server did not):
walkNeighborhood returns a nil slice when the seed has no neighborhood
(e.g. an unknown seed), which JSON-marshals to `null`, while Events is
defensively initialized to []. Both are array fields and must serialize
consistently — null breaks any consumer asserting an array type (incl. the
new UATS contract's `type_is array` on $.neighbor_node_ids).
EventsInGraphNeighborhood now coalesces the nil slice to []string{}.
Tier 1 TestFederationResult_EmptyArraysNotNull pins the JSON contract.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-cli-001): UATS contract spec for federation API (Epic 2)
Backfills the UATS gap EVENTGRAPH-001 left (no contract test for
/v1/eventgraph/reinforcement-neighborhood). 6 cases, validated 6/6 live
against the running server:
- happy 200: asserts the response contract shape (events/neighbor_node_ids
arrays, graph_hops/tsdb_rows_scanned numbers, truncated boolean) — robust
to data, works even with an unknown seed (empty neighborhood is valid 200)
- missing_space_id / missing_seed_node_id → 400 (empty-string override, since
the runner deep-merges variant body over base — key omission can't unset)
- negative_hops → 400, hops_over_ceiling (999 > 2×default) → 400
- method_not_allowed (GET) → 405
sha256 integrity hash added + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): live verification + feature doc + CHANGELOG + close (Epic 3)
Tier 3 live e2e verified the real binary against the real stack: --query
surfaced 20 reinforcement events in a 5-node neighborhood (demonstrating the
Hebbian-write → federation-read loop closing in one command); --seed/--json/
--limit/unknown-seed/no-arg paths all verified live. Feature doc gains the CLI
consumer section; CHANGELOG Added + Fixed entries; CLAUDE.md architecture note;
verification.md + post.md (UxTS mapping: UATS done, UOTS follow-up carried over).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph-cli-001): tag UATS spec 'tsdb' so CI skips it without TSDB
CI Test failed: the UATS contract step boots a minimal server without TSDB,
so the eventgraph service is nil and every POST returns 503 "service not
initialized" instead of the expected 200/400 (only GET→405 passed, since the
method check precedes the service check). Same class as PR #404. The
federation endpoint genuinely requires TSDB (it queries reinforcement_events;
the service is nil without TSDB at boot), and CI's UATS step already excludes
`tsdb`-tagged specs (ci.yml --exclude-tag ...,tsdb). Added "tsdb" to api.tags
(matching metrics_snapshot/readyz_tsdb); re-hashed. Verified locally: the spec
now reports Status: skip under the exact CI exclude filter, and still 6/6 live
against the full stack via explicit --spec.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): sprint plan — guidance-outcome federation (Epic 0)
Federate the guidance-outcome event stream (Pattern Y1, second event class):
walk a constraint's Neo4j neighborhood, surface time-windowed constraint_outcomes
(followed/ignored/contradicted) for the constraint + its graph-related constraints.
Data-decided architecture: reuse the existing constraint_outcomes table (no new
hypertable/writer/enqueue site — RRF-SCALE-001 already populates it, 1176 live
rows); join graph↔events on constraint_code (TSDB constraint_id UUID ≠ Neo4j
node_id CUID — code is the only viable key). One additive migration (V0023:
constraint_code index, schema 22→23). 8 epics, 3 testing tiers, live Tier 3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): V0023 constraint_code index on constraint_outcomes (Epic 1)
Adds idx_constraint_outcomes_code (space_id, constraint_code, time DESC) — the
guidance-outcome federation joins graph↔events on constraint_code (TSDB
constraint_id is a UUID that doesn't match the Neo4j node_id CUID; code is the
only viable key), and migration 011 indexed only space/constraint_id/outcome.
Partial index (constraint_code NOT NULL AND <> '') skips uncoded outcomes.
Bumps TSDB_REQUIRED_SCHEMA_VERSION default 22→23 (config.go) to match the
migration count — CI schema-version validator gates on this. Additive, no data
change, idempotent.
Live-verified: migration applies (schema 22→23), idx present, re-apply is a
no-op, config/tsdb tests green, CI schema check 23=23.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): GuidanceOutcomesInNeighborhood federation method (Epic 2)
Second Pattern Y1 federation: walk a constraint's Neo4j neighborhood, collect
each neighbor's constraint_code, and join constraint_outcomes on those codes
(backed by the V0023 index). walkNeighborhoodWithCodes returns the neighborhood
node IDs + a code→node map; queryGuidanceOutcomes pulls coded outcomes in the
window; Go-side join resolves each outcome's code → its neighborhood constraint
node. Non-nil slices from the start (EVENTGRAPH-CLI-001 lesson). Reuses the
existing constraint_outcomes sink — no new table/writer.
Tier 1 (-race): validation guards, empty-arrays-not-null, sortedKeys
determinism, join resolution. Tier 2 integration (live Neo4j+TSDB): full
round-trip — hops=1 (seed+related codes, off-neighborhood excluded), hops=0
(seed code only), unknown-seed (empty non-nil). PASS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): guidance-outcome federation handler + route (Epic 3)
POST /v1/eventgraph/guidance-outcome-neighborhood — walk a constraint's
neighborhood, surface constraint_outcomes whose code is in the neighborhood.
Same gating/auth/default convention as the reinforcement endpoint.
Single-source refactor (per the dynamic-variables directive): extracted the
shared gate (method/enabled/service → eventgraphGate) and default-resolution
(hops/since/limit + ceiling → resolveFederationDefaults) into helpers used by
BOTH handlers, so the federation rules live in exactly one place. The
reinforcement handler now calls them too — verified no regression (reinforcement
UATS still 6/6 live, unit tests green).
Live-verified: seeding from the real 'no-direct-main-commits' constraint node
surfaced real 'followed' outcomes with constraint_node_id resolved to the seed
and in_neighborhood=true.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): mdemg eventgraph guidance-outcome-neighborhood CLI (Epic 4)
Sibling subcommand consuming POST /v1/eventgraph/guidance-outcome-neighborhood.
Walks a constraint's neighborhood and renders guidance outcomes (followed/
ignored split + table: code · outcome · sim · g_type · guidance_id · recorded)
or --json. Seed via --seed/--query (--constraint-code seeding deferred — needs
server-side code→node resolution; --query covers discovery). Unset hops/since/
limit omitted so the server applies config defaults (single source of truth).
Tier 1 (-race): request-mapping omit-when-unset + conversion, --query seed
resolution, surfaced-503 error, render (empty + followed/ignored table),
truncStr. Help renders.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-002): UATS contract spec for guidance-outcome federation (Epic 5)
6 cases, validated 6/6 live: happy-200 response shape (outcomes/
neighbor_node_ids/neighbor_constraint_codes arrays, graph_hops/tsdb_rows_scanned
numbers, truncated boolean), missing space_id/seed → 400 (empty-string override
under deep-merge), negative_hops → 400, hops_over_ceiling → 400, GET → 405.
Tagged 'tsdb' so CI skips it without TSDB (the EVENTGRAPH-CLI-001 lesson).
sha256 hashed + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): Tier 3 live verification (Epic 6)
Real binary against the real stack. Key assertion: CLI --json output matches
direct constraint_outcomes SQL exactly (11 outcomes = 11, all followed) for the
no-direct-main-commits constraint. --seed/--query/--limit/--json/unknown-seed/
no-arg all verified live. The --query "0 outcomes" result was traced to SQL
ground truth — the 5 neighborhood codes genuinely have no feedback, so it's
correct (federation distinguishes "code in neighborhood" from "code has
outcomes"), not a join bug. Reinforcement endpoint un-regressed by the shared-
helper refactor (UATS 6/6).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): feature doc + CHANGELOG + CLAUDE.md + close (Epic 7)
Feature doc gains a Guidance-Outcome Federation section (why reuse
constraint_outcomes, why join on constraint_code, CLI usage) + forward-look
update. CHANGELOG Added (endpoint + CLI) + Changed (TSDB schema 22→23).
CLAUDE.md architecture note extended. post.md closes the sprint with UxTS
mapping + follow-ups (--constraint-code seeding, EVENTGRAPH-003, UOTS).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(metrics,backup): resolve docker binary robustly under minimal launchd PATH
The native server (launchd) inherits PATH=/usr/bin:/bin:/usr/sbin:/sbin, which
excludes the Docker Desktop symlink (/usr/local/bin/docker). So every
server-runtime `docker` shellout failed with "executable file not found in
$PATH": (1) Neo4j container CPU/mem stats (server.go) logged an ERROR every 60s
and left the neo4j_container_* gauges empty — so the neo4j_high_cpu/_memory
alert rules had no data; (2) the TSDB backup scheduler's `docker compose
pg_dump` (backup.go) failed with only a slog.Warn. The DATA PLANE was never
affected — Neo4j (Bolt) + TSDB (pgx) connect over mapped TCP ports, not the
docker CLI.
Fix (durable, configurable, single-source): new internal/dockerbin resolver —
MDEMG_DOCKER_BIN env override → exec.LookPath → well-known install locations
(/usr/local/bin, /opt/homebrew/bin, /usr/bin) → graceful unavailable. Wired
into server.go (stats) + backup.go (both compose calls). The perpetual 60s
ERROR is downgraded to a one-shot WARN when docker is genuinely absent (it's
optional telemetry). Added a sane PATH to the launchd server plist template as
defense-in-depth.
Live-verified: after restart, mdemg_neo4j_container_cpu_percent=0.59 /
mem_percent=29.13 now land in metric_samples (were absent); no more docker-stats
ERROR; `docker stats` + backup resolve docker under a simulated minimal PATH.
Note: `mdemg data export-auto` (training-export job) was NOT a victim — it
exports via network SQL, not docker (corrected from an earlier assumption).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): sprint plan — fail-loud scheduled jobs (Epic 0)
Triggered by a live-discovered silent failure: the TSDB backup scheduler was
failing every 24h run (docker-under-launchd-PATH) with only a buried slog.Warn.
Docker cause fixed (4cc7608); this sprint fixes the class — scheduled jobs that
fail with no record + no alert. V0024 scheduled_job_events hypertable + writer,
jobhealth.Report (record + alert on failure), wire the 3 jobs (backup,
maintenance, export-auto), 2 evaluator rules (backup staleness + recent
failure) so the server catches "job failed OR never ran". Config-driven, 3
testing tiers, live Tier-3 induced-failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): V0024 scheduled_job_events + writer (Epic 1)
Hypertable (job_name, success, latency_ms, error_message, metadata jsonb,
recorded_at) + RecordJobEvent synchronous single-row writer (mirrors V0021
model_install pattern). Indexes: per-job freshness, partial failed, per-space.
One row per scheduled-job run so the alert evaluator can detect "job failed"
AND "job never ran" (staleness). Schema 23->24; TSDB_REQUIRED_SCHEMA_VERSION
bumped to match migration count (CI check 24=24).
Tier 1 (-race): field mapping, optional-nulls, error truncation, nil-pool
no-op, insert-error propagation. Tier 2 (live TSDB): round-trip + the staleness
(recent successes) + failure (recent failures) query shapes the rules will use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): record + alert on scheduled-job outcomes (Epic 2)
New internal/jobhealth.Report — the single policy point: record a
scheduled_job_events row and fire a high-severity "scheduled-job" alert on
failure (both pool + dispatcher nil-safe). Wired into all three jobs:
- TSDB backup scheduler (internal/tsdb/backup.go): decoupled JobResultFunc hook
(mutex-guarded, -race clean) so internal/tsdb stays free of internal/alert;
server.go::SetTSDBClient sets it with the pool + s.alertDispatcher. A failed
or never-run backup now records + alerts instead of a silent slog.Warn.
- export-auto + maintenance (CLI): deferred reportScheduledJob on the named
return error — opens a short-lived pool + a file-backed dispatcher (same
~/.mdemg/alerts/current.json the hooks surface) so a separate-process CLI job
still alerts the operator.
Tier 1 (-race): jobhealth fires alert only on failure (real file-backend
dispatcher), nil-safe. Live smoke: export-auto recorded success=t latency=3050ms.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): scheduled-job staleness + failure alert rules (Epic 3)
Two server-native evaluator rules over V0024 scheduled_job_events:
- scheduled_job_recent_failure (always on): any job failure in the last
JOB_FAILURE_LOOKBACK_MIN (default 60) → high alert.
- backup_no_recent_success (gated on TSDB_BACKUP_ENABLED): zero successful
tsdb-backups within the staleness window → high alert. THIS is the "job
never ran" guarantee — it fires from the server observing ABSENT success, so
a backup that silently died or never started is caught, not just one that
errored.
Window derives from the real backup interval × 2 (JOB_BACKUP_STALENESS_HOURS
override; no hardcoded literal). JOB_HEALTH_ALERT_ENABLED master gate (default
true). Appended after DefaultRules() in serve.go.
Tier 1: failure rule always present (gt 0), staleness gated on backups-enabled
(lt 0.5), windows reflect config, non-positive fallback. Build/lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): distinct services for job rules so neither masks the other
Caught in live Tier-3 testing: both evaluator rules used Service="scheduled-jobs",
and the dispatcher cooldown key is (Service, Severity) — so the failure alert's
cooldown SUPPRESSED the staleness alert (only the failure fired). One alarm
masking another is the exact silent-failure class this sprint kills. Fixed:
scheduled-job-failure / scheduled-job-staleness distinct services. Re-verified
live — both fire independently and land as distinct alert-file entries. Tier-1
assertion pins that the two services differ. Includes Tier-3 verification.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): feature doc + CHANGELOG + CLAUDE.md + close (Epic 4)
Feature doc docs/features/scheduled-job-health.md (why / two mechanisms /
operator view / config). CHANGELOG Added (NOSILENT-001) + Changed (schema 23→24)
+ Fixed (docker-under-launchd-PATH). CLAUDE.md Service Alert System extended
with the scheduled-job-health note + the distinct-Service-per-rule cooldown
caveat. post.md closes the sprint (UxTS mapping + follow-ups).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): sync embedded launchd server plist with source (CI)
CI "Verify embedded launchd templates match source" diffs packaging/launchd/*
against internal/cli/launchd_templates/* (the embed.FS copy mdemg service
install uses). The PATH addition landed only in the source copy; sync the
embedded copy so they match byte-for-byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(roadmap): add jiminy-governance skill build-out (Workstream C, Action 7)
Records the jiminy-governance Claude Code skill on the active forward roadmap
(SPRINT_ROADMAP_POST_FT_LORA.md, cross-cutting governance) + brings the source
spec into the repo (docs/development/jiminy-governance-skill/SKILL.md, out of
~/Downloads). The skill makes Jiminy the deterministic source of context +
governance over J17, enforced by the PreToolUse hook — a routing/handshake shim,
not a rulebook. Build-out scope notes the wire-up placeholders that must be
resolved against the real instance (Jiminy MCP/endpoint, PreToolUse hook, J17
ack/RetireCode/GUIDANCE_OUTCOME calls). Aligns with the now-live guidance loop
(RRF-SCALE-001 / JIMINY-OUTCOME-001 / GUIDANCE-SYNTH-001).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): resolve skill wire-up against the real instance
Step 1 of the jiminy-governance build-out (roadmap Workstream C Action 7).
Resolved all five placeholders from the running MDEMG instance, verified live:
- Jiminy query: MCP `mdemg mcp` (stdio) → jiminy_guide/validate_changes; HTTP
/v1/jiminy/guide (returns guidance_id) + /bootstrap glossary (j17v1) + /latest.
- PreToolUse: pre-bash-check.py (Bash, fail-closed) + pre-write-check.py
(Write/Edit → /v1/jiminy/classify, /strict-only, fail-open).
- SessionID: claude-core convention.
- Comprehension ack: /v1/jiminy/protocol/feedback (verified ingested 1).
- GUIDANCE_OUTCOME: /v1/jiminy/feedback {guidance_id,…}.
- RetireCode: internal-only by design (RSIC/APE protocol-evolution) — no
agent-facing call; agent must never self-retire a constraint.
Also surfaced the two real integration gaps the build-out must close (the work,
not the prose): (a) the MDEMG MCP server is NOT registered (.mcp.json absent —
context is pushed by prompt-context.sh, not pulled by the agent); (b) PreToolUse
enforcement is /strict-gated + fail-open, so not deterministic-by-default.
Roadmap Action 7 updated to reflect step 1 done + the two gaps as next steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy-governance): ship the J17 governance skill + register MDEMG MCP
Build-out of roadmap Workstream C Action 7 (steps 2-4), per the resolved
wire-up. Closes the two integration gaps found while resolving:
- gap 1 (MCP not registered): .mcp.json registers `mdemg mcp` (stdio).
Live-probed: 20 tools incl. jiminy_guide + validate_changes — the agent can
now PULL guidance, not only receive the hook push.
- gap 2 (enforcement /strict-gated + fail-open): policy set — Write/Edit J17
gate kept fail-open (hard server dependency on every edit is too brittle);
the skill's handshake auto-enables /strict so the gate is active per session.
Bash gate already fail-closed (demonstrated live when a test payload with a
destructive force-push string was blocked by pre-bash-check.py).
Skill authored at the canonical .claude/skills/jiminy-governance/SKILL.md
(frontmatter valid; concrete wire-up inline — MCP tools + HTTP endpoints +
SessionID claude-core + the 5-step handshake). Kept a routing/handshake shim,
not a rulebook (rules stay in the graph).
Live Tier-3 PASSED: full handshake identify->request->comprehend->act->report
ran against the real instance; GUIDANCE_OUTCOME edges 906->909 (one per coded
constraint). Verification: docs/development/jiminy-governance-skill/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): commit install-ready skill + install README
.claude/ is gitignored (per-developer local config), so the installed skill at
.claude/skills/jiminy-governance/SKILL.md is local-only by convention. Commit
the reproducible, install-ready copy (jiminy-governance.skill.md) + a README
with the one-line install (cp into .claude/skills/) so the skill propagates via
the repo. The MCP server it uses is registered in the tracked repo-root
.mcp.json.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID across hooks + skill
Hooks and the jiminy-governance skill hardcoded session_id="claude-core", so
trust/escalation/observations from ALL Claude Code conversations collapsed into
one shared MDEMG session. Claude Code already passes a per-conversation
session_id on stdin to every hook — the implementation just never used it.
Resolver precedence (single rule everywhere): MDEMG_SESSION_ID env (stable-
identity escape hatch) > Claude Code stdin session_id (per-conversation default,
race-free per hook) > ~/.mdemg/.claude-session (published by SessionStart +
UserPromptSubmit for the agent / stdin-less contexts) > claude-core (fallback).
Realizes J17's intended per-(session,constraint) isolation.
Tracked templates updated (internal/cli/hook_templates/): session-start.sh,
prompt-context.sh, post-tool-observe.py, pre-compact.sh + Windows .ps1 variants
— every hardcoded claude-core in MDEMG calls replaced with the resolved id;
session-start/prompt-context publish the session file. Skill SessionID
instruction + handshake steps now resolve <SessionID> instead of claude-core.
Live-verified: a hook resolved a stdin session_id and published the session
file; post-tool-observe wrote an observation keyed to the per-conversation id in
Neo4j (not claude-core). bash -n / py_compile clean; go build + hooks test pass.
Note: pre-write-check.py is local-only (no tracked installer template); its fix
is applied on this machine but won't propagate via `mdemg hooks install` until
it's added to the tracked hooks (follow-up). .claude/* is gitignored, so the
live hook copies aren't committed — the tracked templates propagate via install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID — installed hook copies
The .claude/hooks/ installed copies are tracked (committed before the .claude/*
ignore), so apply the same SessionID resolver here as in the embedded templates
(prior commit). session-start.sh, prompt-context.sh, post-tool-observe.py,
pre-compact.sh now resolve MDEMG_SESSION_ID env > stdin session_id >
~/.mdemg/.claude-session > claude-core. (pre-write-check.py is untracked /
local-only — its fix lives on this machine only; tracking it is the follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcement,guidance-outcome}-neighborhood) + TOC entry;
these were the only two missing endpoints (audited the full route table).
(3) docs/user/cli-reference.md — add mdemg eventgraph {reinforcement,guidance-
outcome}-neighborhood, model run, watchdog status, migrate context-fingerprint,
data curate/validate/clean; fix the stale `model pull --adapter` description
(MODEL-DIST-002 shipped — no longer "deferred/errors"); update the Command Tree
Summary to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch budget
Caught live during the HIDDEN-WEIGHT-001 corpus reingest: the post-ingest
/v1/memory/consolidate call used the shared batch-ingest client
(--timeout, 300s); consolidating a ~10k-node space exceeds that, so the
client reported failure while the server completed the work — the
GUIDANCE-SYNTH-001 bug class (long graph/LLM work needs its own budget).
New --consolidate-timeout flag / INGEST_CONSOLIDATE_TIMEOUT_SEC env
(default 1800s) with a dedicated client. Live-verified: "running
consolidation timeout_sec=1800" → complete.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): Tier 3 verification + corpus restoration + UVTS harness audit + close (Epics 4-5)
Tier 3: real consolidation minted edges with varied cosine weights
(0.83-0.94) + CUIDv2 ids; at-scale via the corpus reingest (9,500 edges,
0 NULL, mean 0.923); gauge holds 0; evaluator rules 16→17.
UVTS harness: corpus space lnl-demo-whk had been deleted with zero trace
(no UVTS run since 2026-05-04 measured anything real); restored by
operator-directed full reingest. A fresh baseline NUMBER remains blocked
by further live-found harness rot — grader/persist breakage, expected-
path format drift, vector post-filter dilution (service.go:1137 global
top-K then space filter) amplified by the duplicate whk-wms space —
complete defect inventory handed to UXTS-CI-001. Retrieval ranking on
the restored corpus verified correct (expected files at ranks 1-4).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): sprint plan — scheduled maintenance actually runs (Epic 0)
Roadmap Q3 Phase 1 rank #4. Weekly decay+prune has never executed
(--dry-run defaults true; plist passes no override) while reporting
success — NOSILENT's blind spot. Tonight's Memory Bloat alerts (79k+
nodes) are the accumulated backlog. Safety verified in code before
planning: nodes are tombstoned (never deleted) with abstraction-chain/
degree/recency protections; edge deletion is the designed near-zero-
weight lifecycle, meaningful now that HIDDEN-WEIGHT made weights real.
Plan: live-by-default plist (+installed refresh), dry_run in job-event
metadata (no schema change — disclosed), maintenance_no_live_run
evaluator rule, darwin upgrade refreshes plists/hooks, first-ever live
run with preview-first protocol.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(maint-live-001): scheduled maintenance runs live — plist passes --dry-run=false (Epic 1)
The weekly LaunchAgent ran `mdemg maintenance` with no dry-run override;
the CLI defaults --dry-run=true, so every scheduled cycle previewed and
reported success — decay+prune NEVER executed (the 79k-node Memory
Bloat backlog). Both plist copies now pass --dry-run=false (the CLI
default stays true for safe manual previews — the SCHEDULE is what must
not silently no-op); installed plist refreshed + agent reloaded.
reportScheduledJobMeta threads job metadata into V0024; maintenance
records dry_run so the only-ever-dry-runs pattern is queryable
(metadata JSONB — no schema change, disclosed in the plan).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): maintenance_no_live_run evaluator rule (Epic 2)
Fires when maintenance rows exist in MAINT_LIVE_LOOKBACK_DAYS (default
8) but none ran live (success + metadata dry_run=false) — the only-
ever-dry-runs pattern self-reports instead of hiding inside "the job
ran". Distinct service maintenance-…
* docs(followup-c): close JSON control-char escaping as NON-ISSUE (no fix)
Follow-up C (the last open item from RRF-SCALE-001 triage) investigated
and closed with evidence — NO code change, because there is no bug to fix.
The earlier /v1/jiminy/latest parse failures were client-side shell
artifacts (co-occurring with the session's 'failed to change group ID'
errors + ad-hoc variable-capture piping), not server bytes:
- writeJSON uses json.NewEncoder().Encode (encoding/json always escapes
control chars U+0000-U+001F); no raw-write bypass; no custom MarshalJSON.
- The synthesized narrative is double-StripControlChars'd (synthesizer.go
:127 + service.go:1116).
- prompt-context.sh already strips control chars via perl before jq, with
2>/dev/null + // empty fallbacks.
Live-verified: the hook's exact jq returns guidance_id correctly; 5 rapid
/latest fetches all parse as strict-valid JSON; 0 raw control chars.
Per 'don't fix a non-problem', shipping a fix would invent a bug that
doesn't exist. Closure documented in docs/development/followup-c-closure.md.
This closes the entire RRF-SCALE-001 follow-up triage: A (JIMINY-OUTCOME
-001), B (GUIDANCE-SYNTH-001), C (non-issue). The guidance->feedback->
outcome loop is fully functional end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(jiminy): /guide 30s timeout sibling + single-source config defaults (GUIDANCE-SYNTH-001 fix-commit)
Two things, both from a live e2e of the full loop through the real
production hook path (run per user directive: standard tests don't find
live problems).
1. Sibling bug: the /v1/jiminy/guide handler had the SAME hardcoded 30s
cap as the warm path (handlers_jiminy.go). GUIDANCE-SYNTH-001 fixed
warm; /guide still deadline-exceeded synthesis at exactly 30.003s
(this is what made prior sprints' /guide integration tests flaky).
Now uses the config-driven budget. Live-verified: a 50.05s synthesis
completed (synthesis_used=true) — would die at 30s.
2. Single source of truth for config defaults (user directive: 'single
place to change all instances'). The 90s budget was duplicated as a
literal in 3 sites; prior sprints similarly duplicated each default
(the sigmoid 0.45/8.0 was in 3 places). Now each default is one
exported config.Default* const, referenced by FromEnv and aliased by
consuming-package fallbacks + a Config.JiminyWarmComputeTimeout()
method. Consolidated: warm-compute timeout, the 3 consulting score
floors, sigmoid midpoint/steepness, constraint-code sim threshold,
classify concurrency. Zero behavior change (compile-time aliases);
-race + full suites green.
Live e2e also re-confirmed: real hook captures guidance_id -> feedback
-> +7 Neo4j GUIDANCE_OUTCOME edges on real constraint nodes + 10 TSDB
rows (whole loop closes through the actual hook; re-confirms Follow-up C
non-issue).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): sprint plan — federation consumer CLI + UATS backfill
Builds the first consumer for EVENTGRAPH-001's reinforcement-neighborhood
federation API (which has no consumer): a 'mdemg eventgraph' CLI command.
Validates the Pattern Y1 bet + becomes the live-testing harness for
EVENTGRAPH-002/003 (user directive: build the consumer first).
Per the UxTS directive: maps the work to the frameworks. UATS applies to
the federation HTTP API -> add eventgraph_reinforcement_neighborhood.uats
.json (backfilling the -001 gap; the endpoint shipped with no UATS),
which replaces an ad-hoc Go integration test as the Tier 2 contract test.
UVTS/UBENCH N/A. UOTS panel-spec gap noted as a follow-up (out of scope).
CLI rendering -> Tier 1 Go units.
4 epics; CLI (--seed/--query/--hops/--since/--limit/--json) renders
summary + events table or JSON; server-driven defaults (no re-hardcoding);
read-only. ~1-1.5 dev-days.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-cli-001): mdemg eventgraph reinforcement-neighborhood (Epic 1)
First consumer of the EVENTGRAPH-001 federation API — POSTs to
/v1/eventgraph/reinforcement-neighborhood, renders a summary + events table
(or --json). Supports --seed, --query (resolves seed via /v1/memory/retrieve
top-1), --hops, --since, --limit. Unset flags are omitted from the request
so the server applies its config defaults (no re-hardcoding of hops/since/limit
in the CLI). Registered under the "advanced" command group.
Tier 1 (httptest, -race clean): request-mapping omit-when-unset + conversion,
--query seed resolution, no-results + invalid --since + surfaced-503 errors,
render (empty + table), helpers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph): neighbor_node_ids serializes as [] not null for empty neighborhood
Caught in EVENTGRAPH-CLI-001 live contract testing (standard code tests
missed it; the live UATS happy-path against the running server did not):
walkNeighborhood returns a nil slice when the seed has no neighborhood
(e.g. an unknown seed), which JSON-marshals to `null`, while Events is
defensively initialized to []. Both are array fields and must serialize
consistently — null breaks any consumer asserting an array type (incl. the
new UATS contract's `type_is array` on $.neighbor_node_ids).
EventsInGraphNeighborhood now coalesces the nil slice to []string{}.
Tier 1 TestFederationResult_EmptyArraysNotNull pins the JSON contract.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-cli-001): UATS contract spec for federation API (Epic 2)
Backfills the UATS gap EVENTGRAPH-001 left (no contract test for
/v1/eventgraph/reinforcement-neighborhood). 6 cases, validated 6/6 live
against the running server:
- happy 200: asserts the response contract shape (events/neighbor_node_ids
arrays, graph_hops/tsdb_rows_scanned numbers, truncated boolean) — robust
to data, works even with an unknown seed (empty neighborhood is valid 200)
- missing_space_id / missing_seed_node_id → 400 (empty-string override, since
the runner deep-merges variant body over base — key omission can't unset)
- negative_hops → 400, hops_over_ceiling (999 > 2×default) → 400
- method_not_allowed (GET) → 405
sha256 integrity hash added + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-cli-001): live verification + feature doc + CHANGELOG + close (Epic 3)
Tier 3 live e2e verified the real binary against the real stack: --query
surfaced 20 reinforcement events in a 5-node neighborhood (demonstrating the
Hebbian-write → federation-read loop closing in one command); --seed/--json/
--limit/unknown-seed/no-arg paths all verified live. Feature doc gains the CLI
consumer section; CHANGELOG Added + Fixed entries; CLAUDE.md architecture note;
verification.md + post.md (UxTS mapping: UATS done, UOTS follow-up carried over).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph-cli-001): tag UATS spec 'tsdb' so CI skips it without TSDB
CI Test failed: the UATS contract step boots a minimal server without TSDB,
so the eventgraph service is nil and every POST returns 503 "service not
initialized" instead of the expected 200/400 (only GET→405 passed, since the
method check precedes the service check). Same class as PR #404. The
federation endpoint genuinely requires TSDB (it queries reinforcement_events;
the service is nil without TSDB at boot), and CI's UATS step already excludes
`tsdb`-tagged specs (ci.yml --exclude-tag ...,tsdb). Added "tsdb" to api.tags
(matching metrics_snapshot/readyz_tsdb); re-hashed. Verified locally: the spec
now reports Status: skip under the exact CI exclude filter, and still 6/6 live
against the full stack via explicit --spec.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): sprint plan — guidance-outcome federation (Epic 0)
Federate the guidance-outcome event stream (Pattern Y1, second event class):
walk a constraint's Neo4j neighborhood, surface time-windowed constraint_outcomes
(followed/ignored/contradicted) for the constraint + its graph-related constraints.
Data-decided architecture: reuse the existing constraint_outcomes table (no new
hypertable/writer/enqueue site — RRF-SCALE-001 already populates it, 1176 live
rows); join graph↔events on constraint_code (TSDB constraint_id UUID ≠ Neo4j
node_id CUID — code is the only viable key). One additive migration (V0023:
constraint_code index, schema 22→23). 8 epics, 3 testing tiers, live Tier 3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): V0023 constraint_code index on constraint_outcomes (Epic 1)
Adds idx_constraint_outcomes_code (space_id, constraint_code, time DESC) — the
guidance-outcome federation joins graph↔events on constraint_code (TSDB
constraint_id is a UUID that doesn't match the Neo4j node_id CUID; code is the
only viable key), and migration 011 indexed only space/constraint_id/outcome.
Partial index (constraint_code NOT NULL AND <> '') skips uncoded outcomes.
Bumps TSDB_REQUIRED_SCHEMA_VERSION default 22→23 (config.go) to match the
migration count — CI schema-version validator gates on this. Additive, no data
change, idempotent.
Live-verified: migration applies (schema 22→23), idx present, re-apply is a
no-op, config/tsdb tests green, CI schema check 23=23.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): GuidanceOutcomesInNeighborhood federation method (Epic 2)
Second Pattern Y1 federation: walk a constraint's Neo4j neighborhood, collect
each neighbor's constraint_code, and join constraint_outcomes on those codes
(backed by the V0023 index). walkNeighborhoodWithCodes returns the neighborhood
node IDs + a code→node map; queryGuidanceOutcomes pulls coded outcomes in the
window; Go-side join resolves each outcome's code → its neighborhood constraint
node. Non-nil slices from the start (EVENTGRAPH-CLI-001 lesson). Reuses the
existing constraint_outcomes sink — no new table/writer.
Tier 1 (-race): validation guards, empty-arrays-not-null, sortedKeys
determinism, join resolution. Tier 2 integration (live Neo4j+TSDB): full
round-trip — hops=1 (seed+related codes, off-neighborhood excluded), hops=0
(seed code only), unknown-seed (empty non-nil). PASS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): guidance-outcome federation handler + route (Epic 3)
POST /v1/eventgraph/guidance-outcome-neighborhood — walk a constraint's
neighborhood, surface constraint_outcomes whose code is in the neighborhood.
Same gating/auth/default convention as the reinforcement endpoint.
Single-source refactor (per the dynamic-variables directive): extracted the
shared gate (method/enabled/service → eventgraphGate) and default-resolution
(hops/since/limit + ceiling → resolveFederationDefaults) into helpers used by
BOTH handlers, so the federation rules live in exactly one place. The
reinforcement handler now calls them too — verified no regression (reinforcement
UATS still 6/6 live, unit tests green).
Live-verified: seeding from the real 'no-direct-main-commits' constraint node
surfaced real 'followed' outcomes with constraint_node_id resolved to the seed
and in_neighborhood=true.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): mdemg eventgraph guidance-outcome-neighborhood CLI (Epic 4)
Sibling subcommand consuming POST /v1/eventgraph/guidance-outcome-neighborhood.
Walks a constraint's neighborhood and renders guidance outcomes (followed/
ignored split + table: code · outcome · sim · g_type · guidance_id · recorded)
or --json. Seed via --seed/--query (--constraint-code seeding deferred — needs
server-side code→node resolution; --query covers discovery). Unset hops/since/
limit omitted so the server applies config defaults (single source of truth).
Tier 1 (-race): request-mapping omit-when-unset + conversion, --query seed
resolution, surfaced-503 error, render (empty + followed/ignored table),
truncStr. Help renders.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-002): UATS contract spec for guidance-outcome federation (Epic 5)
6 cases, validated 6/6 live: happy-200 response shape (outcomes/
neighbor_node_ids/neighbor_constraint_codes arrays, graph_hops/tsdb_rows_scanned
numbers, truncated boolean), missing space_id/seed → 400 (empty-string override
under deep-merge), negative_hops → 400, hops_over_ceiling → 400, GET → 405.
Tagged 'tsdb' so CI skips it without TSDB (the EVENTGRAPH-CLI-001 lesson).
sha256 hashed + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): Tier 3 live verification (Epic 6)
Real binary against the real stack. Key assertion: CLI --json output matches
direct constraint_outcomes SQL exactly (11 outcomes = 11, all followed) for the
no-direct-main-commits constraint. --seed/--query/--limit/--json/unknown-seed/
no-arg all verified live. The --query "0 outcomes" result was traced to SQL
ground truth — the 5 neighborhood codes genuinely have no feedback, so it's
correct (federation distinguishes "code in neighborhood" from "code has
outcomes"), not a join bug. Reinforcement endpoint un-regressed by the shared-
helper refactor (UATS 6/6).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): feature doc + CHANGELOG + CLAUDE.md + close (Epic 7)
Feature doc gains a Guidance-Outcome Federation section (why reuse
constraint_outcomes, why join on constraint_code, CLI usage) + forward-look
update. CHANGELOG Added (endpoint + CLI) + Changed (TSDB schema 22→23).
CLAUDE.md architecture note extended. post.md closes the sprint with UxTS
mapping + follow-ups (--constraint-code seeding, EVENTGRAPH-003, UOTS).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(metrics,backup): resolve docker binary robustly under minimal launchd PATH
The native server (launchd) inherits PATH=/usr/bin:/bin:/usr/sbin:/sbin, which
excludes the Docker Desktop symlink (/usr/local/bin/docker). So every
server-runtime `docker` shellout failed with "executable file not found in
$PATH": (1) Neo4j container CPU/mem stats (server.go) logged an ERROR every 60s
and left the neo4j_container_* gauges empty — so the neo4j_high_cpu/_memory
alert rules had no data; (2) the TSDB backup scheduler's `docker compose
pg_dump` (backup.go) failed with only a slog.Warn. The DATA PLANE was never
affected — Neo4j (Bolt) + TSDB (pgx) connect over mapped TCP ports, not the
docker CLI.
Fix (durable, configurable, single-source): new internal/dockerbin resolver —
MDEMG_DOCKER_BIN env override → exec.LookPath → well-known install locations
(/usr/local/bin, /opt/homebrew/bin, /usr/bin) → graceful unavailable. Wired
into server.go (stats) + backup.go (both compose calls). The perpetual 60s
ERROR is downgraded to a one-shot WARN when docker is genuinely absent (it's
optional telemetry). Added a sane PATH to the launchd server plist template as
defense-in-depth.
Live-verified: after restart, mdemg_neo4j_container_cpu_percent=0.59 /
mem_percent=29.13 now land in metric_samples (were absent); no more docker-stats
ERROR; `docker stats` + backup resolve docker under a simulated minimal PATH.
Note: `mdemg data export-auto` (training-export job) was NOT a victim — it
exports via network SQL, not docker (corrected from an earlier assumption).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): sprint plan — fail-loud scheduled jobs (Epic 0)
Triggered by a live-discovered silent failure: the TSDB backup scheduler was
failing every 24h run (docker-under-launchd-PATH) with only a buried slog.Warn.
Docker cause fixed (4cc7608); this sprint fixes the class — scheduled jobs that
fail with no record + no alert. V0024 scheduled_job_events hypertable + writer,
jobhealth.Report (record + alert on failure), wire the 3 jobs (backup,
maintenance, export-auto), 2 evaluator rules (backup staleness + recent
failure) so the server catches "job failed OR never ran". Config-driven, 3
testing tiers, live Tier-3 induced-failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): V0024 scheduled_job_events + writer (Epic 1)
Hypertable (job_name, success, latency_ms, error_message, metadata jsonb,
recorded_at) + RecordJobEvent synchronous single-row writer (mirrors V0021
model_install pattern). Indexes: per-job freshness, partial failed, per-space.
One row per scheduled-job run so the alert evaluator can detect "job failed"
AND "job never ran" (staleness). Schema 23->24; TSDB_REQUIRED_SCHEMA_VERSION
bumped to match migration count (CI check 24=24).
Tier 1 (-race): field mapping, optional-nulls, error truncation, nil-pool
no-op, insert-error propagation. Tier 2 (live TSDB): round-trip + the staleness
(recent successes) + failure (recent failures) query shapes the rules will use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): record + alert on scheduled-job outcomes (Epic 2)
New internal/jobhealth.Report — the single policy point: record a
scheduled_job_events row and fire a high-severity "scheduled-job" alert on
failure (both pool + dispatcher nil-safe). Wired into all three jobs:
- TSDB backup scheduler (internal/tsdb/backup.go): decoupled JobResultFunc hook
(mutex-guarded, -race clean) so internal/tsdb stays free of internal/alert;
server.go::SetTSDBClient sets it with the pool + s.alertDispatcher. A failed
or never-run backup now records + alerts instead of a silent slog.Warn.
- export-auto + maintenance (CLI): deferred reportScheduledJob on the named
return error — opens a short-lived pool + a file-backed dispatcher (same
~/.mdemg/alerts/current.json the hooks surface) so a separate-process CLI job
still alerts the operator.
Tier 1 (-race): jobhealth fires alert only on failure (real file-backend
dispatcher), nil-safe. Live smoke: export-auto recorded success=t latency=3050ms.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): scheduled-job staleness + failure alert rules (Epic 3)
Two server-native evaluator rules over V0024 scheduled_job_events:
- scheduled_job_recent_failure (always on): any job failure in the last
JOB_FAILURE_LOOKBACK_MIN (default 60) → high alert.
- backup_no_recent_success (gated on TSDB_BACKUP_ENABLED): zero successful
tsdb-backups within the staleness window → high alert. THIS is the "job
never ran" guarantee — it fires from the server observing ABSENT success, so
a backup that silently died or never started is caught, not just one that
errored.
Window derives from the real backup interval × 2 (JOB_BACKUP_STALENESS_HOURS
override; no hardcoded literal). JOB_HEALTH_ALERT_ENABLED master gate (default
true). Appended after DefaultRules() in serve.go.
Tier 1: failure rule always present (gt 0), staleness gated on backups-enabled
(lt 0.5), windows reflect config, non-positive fallback. Build/lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): distinct services for job rules so neither masks the other
Caught in live Tier-3 testing: both evaluator rules used Service="scheduled-jobs",
and the dispatcher cooldown key is (Service, Severity) — so the failure alert's
cooldown SUPPRESSED the staleness alert (only the failure fired). One alarm
masking another is the exact silent-failure class this sprint kills. Fixed:
scheduled-job-failure / scheduled-job-staleness distinct services. Re-verified
live — both fire independently and land as distinct alert-file entries. Tier-1
assertion pins that the two services differ. Includes Tier-3 verification.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): feature doc + CHANGELOG + CLAUDE.md + close (Epic 4)
Feature doc docs/features/scheduled-job-health.md (why / two mechanisms /
operator view / config). CHANGELOG Added (NOSILENT-001) + Changed (schema 23→24)
+ Fixed (docker-under-launchd-PATH). CLAUDE.md Service Alert System extended
with the scheduled-job-health note + the distinct-Service-per-rule cooldown
caveat. post.md closes the sprint (UxTS mapping + follow-ups).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): sync embedded launchd server plist with source (CI)
CI "Verify embedded launchd templates match source" diffs packaging/launchd/*
against internal/cli/launchd_templates/* (the embed.FS copy mdemg service
install uses). The PATH addition landed only in the source copy; sync the
embedded copy so they match byte-for-byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(roadmap): add jiminy-governance skill build-out (Workstream C, Action 7)
Records the jiminy-governance Claude Code skill on the active forward roadmap
(SPRINT_ROADMAP_POST_FT_LORA.md, cross-cutting governance) + brings the source
spec into the repo (docs/development/jiminy-governance-skill/SKILL.md, out of
~/Downloads). The skill makes Jiminy the deterministic source of context +
governance over J17, enforced by the PreToolUse hook — a routing/handshake shim,
not a rulebook. Build-out scope notes the wire-up placeholders that must be
resolved against the real instance (Jiminy MCP/endpoint, PreToolUse hook, J17
ack/RetireCode/GUIDANCE_OUTCOME calls). Aligns with the now-live guidance loop
(RRF-SCALE-001 / JIMINY-OUTCOME-001 / GUIDANCE-SYNTH-001).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): resolve skill wire-up against the real instance
Step 1 of the jiminy-governance build-out (roadmap Workstream C Action 7).
Resolved all five placeholders from the running MDEMG instance, verified live:
- Jiminy query: MCP `mdemg mcp` (stdio) → jiminy_guide/validate_changes; HTTP
/v1/jiminy/guide (returns guidance_id) + /bootstrap glossary (j17v1) + /latest.
- PreToolUse: pre-bash-check.py (Bash, fail-closed) + pre-write-check.py
(Write/Edit → /v1/jiminy/classify, /strict-only, fail-open).
- SessionID: claude-core convention.
- Comprehension ack: /v1/jiminy/protocol/feedback (verified ingested 1).
- GUIDANCE_OUTCOME: /v1/jiminy/feedback {guidance_id,…}.
- RetireCode: internal-only by design (RSIC/APE protocol-evolution) — no
agent-facing call; agent must never self-retire a constraint.
Also surfaced the two real integration gaps the build-out must close (the work,
not the prose): (a) the MDEMG MCP server is NOT registered (.mcp.json absent —
context is pushed by prompt-context.sh, not pulled by the agent); (b) PreToolUse
enforcement is /strict-gated + fail-open, so not deterministic-by-default.
Roadmap Action 7 updated to reflect step 1 done + the two gaps as next steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy-governance): ship the J17 governance skill + register MDEMG MCP
Build-out of roadmap Workstream C Action 7 (steps 2-4), per the resolved
wire-up. Closes the two integration gaps found while resolving:
- gap 1 (MCP not registered): .mcp.json registers `mdemg mcp` (stdio).
Live-probed: 20 tools incl. jiminy_guide + validate_changes — the agent can
now PULL guidance, not only receive the hook push.
- gap 2 (enforcement /strict-gated + fail-open): policy set — Write/Edit J17
gate kept fail-open (hard server dependency on every edit is too brittle);
the skill's handshake auto-enables /strict so the gate is active per session.
Bash gate already fail-closed (demonstrated live when a test payload with a
destructive force-push string was blocked by pre-bash-check.py).
Skill authored at the canonical .claude/skills/jiminy-governance/SKILL.md
(frontmatter valid; concrete wire-up inline — MCP tools + HTTP endpoints +
SessionID claude-core + the 5-step handshake). Kept a routing/handshake shim,
not a rulebook (rules stay in the graph).
Live Tier-3 PASSED: full handshake identify->request->comprehend->act->report
ran against the real instance; GUIDANCE_OUTCOME edges 906->909 (one per coded
constraint). Verification: docs/development/jiminy-governance-skill/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): commit install-ready skill + install README
.claude/ is gitignored (per-developer local config), so the installed skill at
.claude/skills/jiminy-governance/SKILL.md is local-only by convention. Commit
the reproducible, install-ready copy (jiminy-governance.skill.md) + a README
with the one-line install (cp into .claude/skills/) so the skill propagates via
the repo. The MCP server it uses is registered in the tracked repo-root
.mcp.json.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID across hooks + skill
Hooks and the jiminy-governance skill hardcoded session_id="claude-core", so
trust/escalation/observations from ALL Claude Code conversations collapsed into
one shared MDEMG session. Claude Code already passes a per-conversation
session_id on stdin to every hook — the implementation just never used it.
Resolver precedence (single rule everywhere): MDEMG_SESSION_ID env (stable-
identity escape hatch) > Claude Code stdin session_id (per-conversation default,
race-free per hook) > ~/.mdemg/.claude-session (published by SessionStart +
UserPromptSubmit for the agent / stdin-less contexts) > claude-core (fallback).
Realizes J17's intended per-(session,constraint) isolation.
Tracked templates updated (internal/cli/hook_templates/): session-start.sh,
prompt-context.sh, post-tool-observe.py, pre-compact.sh + Windows .ps1 variants
— every hardcoded claude-core in MDEMG calls replaced with the resolved id;
session-start/prompt-context publish the session file. Skill SessionID
instruction + handshake steps now resolve <SessionID> instead of claude-core.
Live-verified: a hook resolved a stdin session_id and published the session
file; post-tool-observe wrote an observation keyed to the per-conversation id in
Neo4j (not claude-core). bash -n / py_compile clean; go build + hooks test pass.
Note: pre-write-check.py is local-only (no tracked installer template); its fix
is applied on this machine but won't propagate via `mdemg hooks install` until
it's added to the tracked hooks (follow-up). .claude/* is gitignored, so the
live hook copies aren't committed — the tracked templates propagate via install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID — installed hook copies
The .claude/hooks/ installed copies are tracked (committed before the .claude/*
ignore), so apply the same SessionID resolver here as in the embedded templates
(prior commit). session-start.sh, prompt-context.sh, post-tool-observe.py,
pre-compact.sh now resolve MDEMG_SESSION_ID env > stdin session_id >
~/.mdemg/.claude-session > claude-core. (pre-write-check.py is untracked /
local-only — its fix lives on this machine only; tracking it is the follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcement,guidance-outcome}-neighborhood) + TOC entry;
these were the only two missing endpoints (audited the full route table).
(3) docs/user/cli-reference.md — add mdemg eventgraph {reinforcement,guidance-
outcome}-neighborhood, model run, watchdog status, migrate context-fingerprint,
data curate/validate/clean; fix the stale `model pull --adapter` description
(MODEL-DIST-002 shipped — no longer "deferred/errors"); update the Command Tree
Summary to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch budget
Caught live during the HIDDEN-WEIGHT-001 corpus reingest: the post-ingest
/v1/memory/consolidate call used the shared batch-ingest client
(--timeout, 300s); consolidating a ~10k-node space exceeds that, so the
client reported failure while the server completed the work — the
GUIDANCE-SYNTH-001 bug class (long graph/LLM work needs its own budget).
New --consolidate-timeout flag / INGEST_CONSOLIDATE_TIMEOUT_SEC env
(default 1800s) with a dedicated client. Live-verified: "running
consolidation timeout_sec=1800" → complete.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): Tier 3 verification + corpus restoration + UVTS harness audit + close (Epics 4-5)
Tier 3: real consolidation minted edges with varied cosine weights
(0.83-0.94) + CUIDv2 ids; at-scale via the corpus reingest (9,500 edges,
0 NULL, mean 0.923); gauge holds 0; evaluator rules 16→17.
UVTS harness: corpus space lnl-demo-whk had been deleted with zero trace
(no UVTS run since 2026-05-04 measured anything real); restored by
operator-directed full reingest. A fresh baseline NUMBER remains blocked
by further live-found harness rot — grader/persist breakage, expected-
path format drift, vector post-filter dilution (service.go:1137 global
top-K then space filter) amplified by the duplicate whk-wms space —
complete defect inventory handed to UXTS-CI-001. Retrieval ranking on
the restored corpus verified correct (expected files at ranks 1-4).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): sprint plan — scheduled maintenance actually runs (Epic 0)
Roadmap Q3 Phase 1 rank #4. Weekly decay+prune has never executed
(--dry-run defaults true; plist passes no override) while reporting
success — NOSILENT's blind spot. Tonight's Memory Bloat alerts (79k+
nodes) are the accumulated backlog. Safety verified in code before
planning: nodes are tombstoned (never deleted) with abstraction-chain/
degree/recency protections; edge deletion is the designed near-zero-
weight lifecycle, meaningful now that HIDDEN-WEIGHT made weights real.
Plan: live-by-default plist (+installed refresh), dry_run in job-event
metadata (no schema change — disclosed), maintenance_no_live_run
evaluator rule, darwin upgrade refreshes plists/hooks, first-ever live
run with preview-first protocol.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(maint-live-001): scheduled maintenance runs live — plist passes --dry-run=false (Epic 1)
The weekly LaunchAgent ran `mdemg maintenance` with no dry-run override;
the CLI defaults --dry-run=true, so every scheduled cycle previewed and
reported success — decay+prune NEVER executed (the 79k-node Memory
Bloat backlog). Both plist copies now pass --dry-run=false (the CLI
default stays true for safe manual previews — the SCHEDULE is what must
not silently no-op); installed plist refreshed + agent reloaded.
reportScheduledJobMeta threads job metadata into V0024; maintenance
records dry_run so the only-ever-dry-runs pattern is queryable
(metadata JSONB — no schema change, disclosed in the plan).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): maintenance_no_live_run evaluator rule (Epic 2)
Fires when maintenance rows exist in MAINT_LIVE_LOOKBACK_DAYS (default
8) but none ran live (success + metadata dry_run=false) — the only-
ever-dry-runs pattern self-reports instead of hiding inside "the job
ran". Distinct service maintenance-liveness per the cooldown rule.
Config: MAINT_LIVE_ALERT_ENABLED (true), MAINT_LIVE_LOOKBACK_DAYS (8).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): mdemg upgrade refreshes installed LaunchAgents + hooks (darwin) (Epic 3)
Plist/hook fixes shipped in releases but never reached installed
machines — the maintenance dry-run override would have sat unreachable
next to upgraded binaries forever. Upgrade now re-renders ALREADY-
INSTALLED mdemg LaunchAgents from the new binary's embedded templates
(refresh-only — never installs new services) + re-syncs mdemg-managed
Claude hooks in the current project (marker-checked). Substitution
logic single-sourced into renderLaunchdTemplate (Install + Refresh —
the drift class that exit-78'd the sidecar during HOOKSYNC live smoke).
Mirrors the existing Linux systemd-unit refresh.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): context-dependent orphan policy — --exclude-role-types (Epic 4a)
Orphan disposition is context-dependent (operator, 2026-06-11): a
uniform degree/age rule conflates governance constraints, conversation
history, test junk, and hierarchy debris. New --exclude-role-types on
prune + maintenance (env PRUNE_EXCLUDE_ROLE_TYPES) makes the policy
expressible; the scheduled plist ships
constraint,conversation_observation excluded per the operator's call
(constraints are load-bearing governance rules at any degree;
conversation observations differ by SESSION which the knob can't
express yet). Aged hierarchy debris stays eligible — that's the
lifecycle working. Candidate census that drove the decision: 5,388
conv-obs (9 eligible tonight under the 90d shield), 11 constraints,
238 hierarchy nodes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(prune): orphan sweeps use implicit transactions for batched deletes
Caught by the FIRST-EVER live maintenance run (MAINT-LIVE-001 Tier 3):
Neo4j raises TransactionStartFailed when a batched CALL-IN-TRANSACTIONS
statement executes inside an explicit transaction. Both orphan sweeps
(SymbolNode + Observation) ran their batched delete via ExecuteWrite;
the dry-run path never executes the deleting statement, so no preview
or unit test could surface it — only live execution. Switched to
session.Run (implicit tx). The failure ALSO proved the NOSILENT chain
live: the run fired "Scheduled job failed: maintenance" before exiting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): first live run verification + feature doc + CHANGELOG + close (Epics 4b-5)
First live maintenance in MDEMG history: 20,236 orphan SymbolNodes
deleted; all 5,010 tombstone candidates protected (recency + operator
exclusions); liveness rule born-firing → silenced by the real run; the
3-row job-event story (preview/true → failure/false alerted →
success/false) proves the dry_run plumbing through every path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md architecture note for MAINT-LIVE-001
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): sprint plan — embedder wiring + ingest exec resolution (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(embed-wire-001): breaker + recorder reach the real embedder through the wrapper chain (Epic 1)
The embedding circuit breaker was NEVER wired in any default deployment:
embeddings.New returns *CachedEmbedder when EMBEDDING_CACHE_ENABLED=true
(the default), so the server's emb.(*embeddings.OpenAI)/(*Ollama)
assertions on the OUTERMOST value failed silently (no else branch). The
recorder assertion had the inverse fragility (cache off → training-data
recording silently dies).
New: Unwrap() chain (CachedEmbedder joins RateLimitedEmbedder's existing
one) + embeddings.Base() / FindCached() interface-driven walkers — any
future wrapper joins by adding Unwrap(), no type lists. Wiring now walks
to the base for the breaker and to the cache layer for the recorder,
with LOUD warns when nothing matches. Tier 1 pins the production shape
(ratelimit(cache(provider))) plus cache-off and bare chains.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest-exec-001): server-triggered ingest resolves the mdemg binary — was hardcoded ./bin/mdemg (Epic 2)
Both ingest-job exec sites ran a relative "./bin/mdemg": broken in
Docker (the documented-primary deployment — binary at /usr/local/bin,
no repo checkout) and any CWD other than the repo root. New
resolveMdemgBin(): MDEMG_BIN env → os.Executable() (the server IS the
binary) → PATH → ./bin/mdemg legacy fallback; cached; Tier 1 pins the
order. Scheduled-sync jobs now report outcomes to scheduled_job_events
via jobhealth (job_name codebase-sync) — an unattended sync that keeps
failing is never silent; manual API jobs stay queue-visible only.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): live verification + CHANGELOG + CLAUDE.md + close (Epics 3-4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): sprint plan — documentation matches reality (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): CLAUDE.md FT section rewritten to post-pivot reality (Epic 1)
The section presented the abandoned Qwen3.6-35B-A3B MoE target, two-tier
MoE-Sieve strategy, and Sprint A→E critical path as CURRENT — all
superseded by the 2026-04-22 MoE→dense pivot; this stale text seeded the
Q3 roadmap audit with a dead architecture. Rewritten: shipped state
(dense Qwen3-14B mdemg-llm-v1, 0.8389, llama-server runtime), superseded
plan documented with the pivot rationale (never deleted — supersede-with-
pointer), guardrail llmclient exception marked CLOSED (re-verified in
code), memo-07 provenance break disclosed (the file never existed;
00_README_v2.md is canonical), open FT work named (FT-CLASSIFY-002 +
recursive-retraining trigger). Adapter env-name drift fixed
(MDEMG_ADAPTER_BASE, not MDEMG_MODEL_ADAPTER_BASE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(doc-truth-001): operator-facing text matches the Phase 13.5 reality (Epic 2)
preflight errors directed operators to start the DECOMMISSIONED
mlx_lm.server on :8101 — following them reintroduces the crash-looping
stack Phase 13.5 replaced. Now: llama-server :8102 guidance (managed
service install + manual command), backend-agnostic wording. model.go
help text dropped three stale "deferred to MODEL-DIST-002" mentions
(shipped 2026-05-25). Operationally (untracked .env): removed the
J17_SIDECAR_TIMEOUT_MS=200 override that re-pinned the exact value
DH-004 remediated — the 1000ms default now applies; server restarted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): 00_README STATUS block + AGENT_HANDOFF retired (Epic 3)
00_README_v2.md gains a top-of-file STATUS block: shipped-through-
cutover state, superseded MoE plan (FT-2 skip + FT-3 supersession +
R-LT-4 prototype-discipline adjudication recorded), the NOT-STARTED
recursive-retraining loop with its FT-CLASSIFY-002 trigger, and
provenance notes (memo-07 never existed; the spec is untracked pending
FG-2). AGENT_HANDOFF.md (stale since 2026-05-06) retired to a pointer
stub — handoff state lives in CLAUDE.md/roadmap/CHANGELOG/CMS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): grep-sweep proof + CHANGELOG + close (Epic 4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): last stale --adapter help string (sweep straggler)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): sprint plan — fail-closed self-improvement (Epic 0)
Co-Authored-…
* docs(eventgraph-cli-001): live verification + feature doc + CHANGELOG + close (Epic 3)
Tier 3 live e2e verified the real binary against the real stack: --query
surfaced 20 reinforcement events in a 5-node neighborhood (demonstrating the
Hebbian-write → federation-read loop closing in one command); --seed/--json/
--limit/unknown-seed/no-arg paths all verified live. Feature doc gains the CLI
consumer section; CHANGELOG Added + Fixed entries; CLAUDE.md architecture note;
verification.md + post.md (UxTS mapping: UATS done, UOTS follow-up carried over).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eventgraph-cli-001): tag UATS spec 'tsdb' so CI skips it without TSDB
CI Test failed: the UATS contract step boots a minimal server without TSDB,
so the eventgraph service is nil and every POST returns 503 "service not
initialized" instead of the expected 200/400 (only GET→405 passed, since the
method check precedes the service check). Same class as PR #404. The
federation endpoint genuinely requires TSDB (it queries reinforcement_events;
the service is nil without TSDB at boot), and CI's UATS step already excludes
`tsdb`-tagged specs (ci.yml --exclude-tag ...,tsdb). Added "tsdb" to api.tags
(matching metrics_snapshot/readyz_tsdb); re-hashed. Verified locally: the spec
now reports Status: skip under the exact CI exclude filter, and still 6/6 live
against the full stack via explicit --spec.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): sprint plan — guidance-outcome federation (Epic 0)
Federate the guidance-outcome event stream (Pattern Y1, second event class):
walk a constraint's Neo4j neighborhood, surface time-windowed constraint_outcomes
(followed/ignored/contradicted) for the constraint + its graph-related constraints.
Data-decided architecture: reuse the existing constraint_outcomes table (no new
hypertable/writer/enqueue site — RRF-SCALE-001 already populates it, 1176 live
rows); join graph↔events on constraint_code (TSDB constraint_id UUID ≠ Neo4j
node_id CUID — code is the only viable key). One additive migration (V0023:
constraint_code index, schema 22→23). 8 epics, 3 testing tiers, live Tier 3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): V0023 constraint_code index on constraint_outcomes (Epic 1)
Adds idx_constraint_outcomes_code (space_id, constraint_code, time DESC) — the
guidance-outcome federation joins graph↔events on constraint_code (TSDB
constraint_id is a UUID that doesn't match the Neo4j node_id CUID; code is the
only viable key), and migration 011 indexed only space/constraint_id/outcome.
Partial index (constraint_code NOT NULL AND <> '') skips uncoded outcomes.
Bumps TSDB_REQUIRED_SCHEMA_VERSION default 22→23 (config.go) to match the
migration count — CI schema-version validator gates on this. Additive, no data
change, idempotent.
Live-verified: migration applies (schema 22→23), idx present, re-apply is a
no-op, config/tsdb tests green, CI schema check 23=23.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): GuidanceOutcomesInNeighborhood federation method (Epic 2)
Second Pattern Y1 federation: walk a constraint's Neo4j neighborhood, collect
each neighbor's constraint_code, and join constraint_outcomes on those codes
(backed by the V0023 index). walkNeighborhoodWithCodes returns the neighborhood
node IDs + a code→node map; queryGuidanceOutcomes pulls coded outcomes in the
window; Go-side join resolves each outcome's code → its neighborhood constraint
node. Non-nil slices from the start (EVENTGRAPH-CLI-001 lesson). Reuses the
existing constraint_outcomes sink — no new table/writer.
Tier 1 (-race): validation guards, empty-arrays-not-null, sortedKeys
determinism, join resolution. Tier 2 integration (live Neo4j+TSDB): full
round-trip — hops=1 (seed+related codes, off-neighborhood excluded), hops=0
(seed code only), unknown-seed (empty non-nil). PASS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): guidance-outcome federation handler + route (Epic 3)
POST /v1/eventgraph/guidance-outcome-neighborhood — walk a constraint's
neighborhood, surface constraint_outcomes whose code is in the neighborhood.
Same gating/auth/default convention as the reinforcement endpoint.
Single-source refactor (per the dynamic-variables directive): extracted the
shared gate (method/enabled/service → eventgraphGate) and default-resolution
(hops/since/limit + ceiling → resolveFederationDefaults) into helpers used by
BOTH handlers, so the federation rules live in exactly one place. The
reinforcement handler now calls them too — verified no regression (reinforcement
UATS still 6/6 live, unit tests green).
Live-verified: seeding from the real 'no-direct-main-commits' constraint node
surfaced real 'followed' outcomes with constraint_node_id resolved to the seed
and in_neighborhood=true.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-002): mdemg eventgraph guidance-outcome-neighborhood CLI (Epic 4)
Sibling subcommand consuming POST /v1/eventgraph/guidance-outcome-neighborhood.
Walks a constraint's neighborhood and renders guidance outcomes (followed/
ignored split + table: code · outcome · sim · g_type · guidance_id · recorded)
or --json. Seed via --seed/--query (--constraint-code seeding deferred — needs
server-side code→node resolution; --query covers discovery). Unset hops/since/
limit omitted so the server applies config defaults (single source of truth).
Tier 1 (-race): request-mapping omit-when-unset + conversion, --query seed
resolution, surfaced-503 error, render (empty + followed/ignored table),
truncStr. Help renders.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(eventgraph-002): UATS contract spec for guidance-outcome federation (Epic 5)
6 cases, validated 6/6 live: happy-200 response shape (outcomes/
neighbor_node_ids/neighbor_constraint_codes arrays, graph_hops/tsdb_rows_scanned
numbers, truncated boolean), missing space_id/seed → 400 (empty-string override
under deep-merge), negative_hops → 400, hops_over_ceiling → 400, GET → 405.
Tagged 'tsdb' so CI skips it without TSDB (the EVENTGRAPH-CLI-001 lesson).
sha256 hashed + verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): Tier 3 live verification (Epic 6)
Real binary against the real stack. Key assertion: CLI --json output matches
direct constraint_outcomes SQL exactly (11 outcomes = 11, all followed) for the
no-direct-main-commits constraint. --seed/--query/--limit/--json/unknown-seed/
no-arg all verified live. The --query "0 outcomes" result was traced to SQL
ground truth — the 5 neighborhood codes genuinely have no feedback, so it's
correct (federation distinguishes "code in neighborhood" from "code has
outcomes"), not a join bug. Reinforcement endpoint un-regressed by the shared-
helper refactor (UATS 6/6).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-002): feature doc + CHANGELOG + CLAUDE.md + close (Epic 7)
Feature doc gains a Guidance-Outcome Federation section (why reuse
constraint_outcomes, why join on constraint_code, CLI usage) + forward-look
update. CHANGELOG Added (endpoint + CLI) + Changed (TSDB schema 22→23).
CLAUDE.md architecture note extended. post.md closes the sprint with UxTS
mapping + follow-ups (--constraint-code seeding, EVENTGRAPH-003, UOTS).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(metrics,backup): resolve docker binary robustly under minimal launchd PATH
The native server (launchd) inherits PATH=/usr/bin:/bin:/usr/sbin:/sbin, which
excludes the Docker Desktop symlink (/usr/local/bin/docker). So every
server-runtime `docker` shellout failed with "executable file not found in
$PATH": (1) Neo4j container CPU/mem stats (server.go) logged an ERROR every 60s
and left the neo4j_container_* gauges empty — so the neo4j_high_cpu/_memory
alert rules had no data; (2) the TSDB backup scheduler's `docker compose
pg_dump` (backup.go) failed with only a slog.Warn. The DATA PLANE was never
affected — Neo4j (Bolt) + TSDB (pgx) connect over mapped TCP ports, not the
docker CLI.
Fix (durable, configurable, single-source): new internal/dockerbin resolver —
MDEMG_DOCKER_BIN env override → exec.LookPath → well-known install locations
(/usr/local/bin, /opt/homebrew/bin, /usr/bin) → graceful unavailable. Wired
into server.go (stats) + backup.go (both compose calls). The perpetual 60s
ERROR is downgraded to a one-shot WARN when docker is genuinely absent (it's
optional telemetry). Added a sane PATH to the launchd server plist template as
defense-in-depth.
Live-verified: after restart, mdemg_neo4j_container_cpu_percent=0.59 /
mem_percent=29.13 now land in metric_samples (were absent); no more docker-stats
ERROR; `docker stats` + backup resolve docker under a simulated minimal PATH.
Note: `mdemg data export-auto` (training-export job) was NOT a victim — it
exports via network SQL, not docker (corrected from an earlier assumption).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): sprint plan — fail-loud scheduled jobs (Epic 0)
Triggered by a live-discovered silent failure: the TSDB backup scheduler was
failing every 24h run (docker-under-launchd-PATH) with only a buried slog.Warn.
Docker cause fixed (4cc7608); this sprint fixes the class — scheduled jobs that
fail with no record + no alert. V0024 scheduled_job_events hypertable + writer,
jobhealth.Report (record + alert on failure), wire the 3 jobs (backup,
maintenance, export-auto), 2 evaluator rules (backup staleness + recent
failure) so the server catches "job failed OR never ran". Config-driven, 3
testing tiers, live Tier-3 induced-failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): V0024 scheduled_job_events + writer (Epic 1)
Hypertable (job_name, success, latency_ms, error_message, metadata jsonb,
recorded_at) + RecordJobEvent synchronous single-row writer (mirrors V0021
model_install pattern). Indexes: per-job freshness, partial failed, per-space.
One row per scheduled-job run so the alert evaluator can detect "job failed"
AND "job never ran" (staleness). Schema 23->24; TSDB_REQUIRED_SCHEMA_VERSION
bumped to match migration count (CI check 24=24).
Tier 1 (-race): field mapping, optional-nulls, error truncation, nil-pool
no-op, insert-error propagation. Tier 2 (live TSDB): round-trip + the staleness
(recent successes) + failure (recent failures) query shapes the rules will use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): record + alert on scheduled-job outcomes (Epic 2)
New internal/jobhealth.Report — the single policy point: record a
scheduled_job_events row and fire a high-severity "scheduled-job" alert on
failure (both pool + dispatcher nil-safe). Wired into all three jobs:
- TSDB backup scheduler (internal/tsdb/backup.go): decoupled JobResultFunc hook
(mutex-guarded, -race clean) so internal/tsdb stays free of internal/alert;
server.go::SetTSDBClient sets it with the pool + s.alertDispatcher. A failed
or never-run backup now records + alerts instead of a silent slog.Warn.
- export-auto + maintenance (CLI): deferred reportScheduledJob on the named
return error — opens a short-lived pool + a file-backed dispatcher (same
~/.mdemg/alerts/current.json the hooks surface) so a separate-process CLI job
still alerts the operator.
Tier 1 (-race): jobhealth fires alert only on failure (real file-backend
dispatcher), nil-safe. Live smoke: export-auto recorded success=t latency=3050ms.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): scheduled-job staleness + failure alert rules (Epic 3)
Two server-native evaluator rules over V0024 scheduled_job_events:
- scheduled_job_recent_failure (always on): any job failure in the last
JOB_FAILURE_LOOKBACK_MIN (default 60) → high alert.
- backup_no_recent_success (gated on TSDB_BACKUP_ENABLED): zero successful
tsdb-backups within the staleness window → high alert. THIS is the "job
never ran" guarantee — it fires from the server observing ABSENT success, so
a backup that silently died or never started is caught, not just one that
errored.
Window derives from the real backup interval × 2 (JOB_BACKUP_STALENESS_HOURS
override; no hardcoded literal). JOB_HEALTH_ALERT_ENABLED master gate (default
true). Appended after DefaultRules() in serve.go.
Tier 1: failure rule always present (gt 0), staleness gated on backups-enabled
(lt 0.5), windows reflect config, non-positive fallback. Build/lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): distinct services for job rules so neither masks the other
Caught in live Tier-3 testing: both evaluator rules used Service="scheduled-jobs",
and the dispatcher cooldown key is (Service, Severity) — so the failure alert's
cooldown SUPPRESSED the staleness alert (only the failure fired). One alarm
masking another is the exact silent-failure class this sprint kills. Fixed:
scheduled-job-failure / scheduled-job-staleness distinct services. Re-verified
live — both fire independently and land as distinct alert-file entries. Tier-1
assertion pins that the two services differ. Includes Tier-3 verification.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): feature doc + CHANGELOG + CLAUDE.md + close (Epic 4)
Feature doc docs/features/scheduled-job-health.md (why / two mechanisms /
operator view / config). CHANGELOG Added (NOSILENT-001) + Changed (schema 23→24)
+ Fixed (docker-under-launchd-PATH). CLAUDE.md Service Alert System extended
with the scheduled-job-health note + the distinct-Service-per-rule cooldown
caveat. post.md closes the sprint (UxTS mapping + follow-ups).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): sync embedded launchd server plist with source (CI)
CI "Verify embedded launchd templates match source" diffs packaging/launchd/*
against internal/cli/launchd_templates/* (the embed.FS copy mdemg service
install uses). The PATH addition landed only in the source copy; sync the
embedded copy so they match byte-for-byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(roadmap): add jiminy-governance skill build-out (Workstream C, Action 7)
Records the jiminy-governance Claude Code skill on the active forward roadmap
(SPRINT_ROADMAP_POST_FT_LORA.md, cross-cutting governance) + brings the source
spec into the repo (docs/development/jiminy-governance-skill/SKILL.md, out of
~/Downloads). The skill makes Jiminy the deterministic source of context +
governance over J17, enforced by the PreToolUse hook — a routing/handshake shim,
not a rulebook. Build-out scope notes the wire-up placeholders that must be
resolved against the real instance (Jiminy MCP/endpoint, PreToolUse hook, J17
ack/RetireCode/GUIDANCE_OUTCOME calls). Aligns with the now-live guidance loop
(RRF-SCALE-001 / JIMINY-OUTCOME-001 / GUIDANCE-SYNTH-001).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): resolve skill wire-up against the real instance
Step 1 of the jiminy-governance build-out (roadmap Workstream C Action 7).
Resolved all five placeholders from the running MDEMG instance, verified live:
- Jiminy query: MCP `mdemg mcp` (stdio) → jiminy_guide/validate_changes; HTTP
/v1/jiminy/guide (returns guidance_id) + /bootstrap glossary (j17v1) + /latest.
- PreToolUse: pre-bash-check.py (Bash, fail-closed) + pre-write-check.py
(Write/Edit → /v1/jiminy/classify, /strict-only, fail-open).
- SessionID: claude-core convention.
- Comprehension ack: /v1/jiminy/protocol/feedback (verified ingested 1).
- GUIDANCE_OUTCOME: /v1/jiminy/feedback {guidance_id,…}.
- RetireCode: internal-only by design (RSIC/APE protocol-evolution) — no
agent-facing call; agent must never self-retire a constraint.
Also surfaced the two real integration gaps the build-out must close (the work,
not the prose): (a) the MDEMG MCP server is NOT registered (.mcp.json absent —
context is pushed by prompt-context.sh, not pulled by the agent); (b) PreToolUse
enforcement is /strict-gated + fail-open, so not deterministic-by-default.
Roadmap Action 7 updated to reflect step 1 done + the two gaps as next steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy-governance): ship the J17 governance skill + register MDEMG MCP
Build-out of roadmap Workstream C Action 7 (steps 2-4), per the resolved
wire-up. Closes the two integration gaps found while resolving:
- gap 1 (MCP not registered): .mcp.json registers `mdemg mcp` (stdio).
Live-probed: 20 tools incl. jiminy_guide + validate_changes — the agent can
now PULL guidance, not only receive the hook push.
- gap 2 (enforcement /strict-gated + fail-open): policy set — Write/Edit J17
gate kept fail-open (hard server dependency on every edit is too brittle);
the skill's handshake auto-enables /strict so the gate is active per session.
Bash gate already fail-closed (demonstrated live when a test payload with a
destructive force-push string was blocked by pre-bash-check.py).
Skill authored at the canonical .claude/skills/jiminy-governance/SKILL.md
(frontmatter valid; concrete wire-up inline — MCP tools + HTTP endpoints +
SessionID claude-core + the 5-step handshake). Kept a routing/handshake shim,
not a rulebook (rules stay in the graph).
Live Tier-3 PASSED: full handshake identify->request->comprehend->act->report
ran against the real instance; GUIDANCE_OUTCOME edges 906->909 (one per coded
constraint). Verification: docs/development/jiminy-governance-skill/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): commit install-ready skill + install README
.claude/ is gitignored (per-developer local config), so the installed skill at
.claude/skills/jiminy-governance/SKILL.md is local-only by convention. Commit
the reproducible, install-ready copy (jiminy-governance.skill.md) + a README
with the one-line install (cp into .claude/skills/) so the skill propagates via
the repo. The MCP server it uses is registered in the tracked repo-root
.mcp.json.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID across hooks + skill
Hooks and the jiminy-governance skill hardcoded session_id="claude-core", so
trust/escalation/observations from ALL Claude Code conversations collapsed into
one shared MDEMG session. Claude Code already passes a per-conversation
session_id on stdin to every hook — the implementation just never used it.
Resolver precedence (single rule everywhere): MDEMG_SESSION_ID env (stable-
identity escape hatch) > Claude Code stdin session_id (per-conversation default,
race-free per hook) > ~/.mdemg/.claude-session (published by SessionStart +
UserPromptSubmit for the agent / stdin-less contexts) > claude-core (fallback).
Realizes J17's intended per-(session,constraint) isolation.
Tracked templates updated (internal/cli/hook_templates/): session-start.sh,
prompt-context.sh, post-tool-observe.py, pre-compact.sh + Windows .ps1 variants
— every hardcoded claude-core in MDEMG calls replaced with the resolved id;
session-start/prompt-context publish the session file. Skill SessionID
instruction + handshake steps now resolve <SessionID> instead of claude-core.
Live-verified: a hook resolved a stdin session_id and published the session
file; post-tool-observe wrote an observation keyed to the per-conversation id in
Neo4j (not claude-core). bash -n / py_compile clean; go build + hooks test pass.
Note: pre-write-check.py is local-only (no tracked installer template); its fix
is applied on this machine but won't propagate via `mdemg hooks install` until
it's added to the tracked hooks (follow-up). .claude/* is gitignored, so the
live hook copies aren't committed — the tracked templates propagate via install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID — installed hook copies
The .claude/hooks/ installed copies are tracked (committed before the .claude/*
ignore), so apply the same SessionID resolver here as in the embedded templates
(prior commit). session-start.sh, prompt-context.sh, post-tool-observe.py,
pre-compact.sh now resolve MDEMG_SESSION_ID env > stdin session_id >
~/.mdemg/.claude-session > claude-core. (pre-write-check.py is untracked /
local-only — its fix lives on this machine only; tracking it is the follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcement,guidance-outcome}-neighborhood) + TOC entry;
these were the only two missing endpoints (audited the full route table).
(3) docs/user/cli-reference.md — add mdemg eventgraph {reinforcement,guidance-
outcome}-neighborhood, model run, watchdog status, migrate context-fingerprint,
data curate/validate/clean; fix the stale `model pull --adapter` description
(MODEL-DIST-002 shipped — no longer "deferred/errors"); update the Command Tree
Summary to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch budget
Caught live during the HIDDEN-WEIGHT-001 corpus reingest: the post-ingest
/v1/memory/consolidate call used the shared batch-ingest client
(--timeout, 300s); consolidating a ~10k-node space exceeds that, so the
client reported failure while the server completed the work — the
GUIDANCE-SYNTH-001 bug class (long graph/LLM work needs its own budget).
New --consolidate-timeout flag / INGEST_CONSOLIDATE_TIMEOUT_SEC env
(default 1800s) with a dedicated client. Live-verified: "running
consolidation timeout_sec=1800" → complete.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): Tier 3 verification + corpus restoration + UVTS harness audit + close (Epics 4-5)
Tier 3: real consolidation minted edges with varied cosine weights
(0.83-0.94) + CUIDv2 ids; at-scale via the corpus reingest (9,500 edges,
0 NULL, mean 0.923); gauge holds 0; evaluator rules 16→17.
UVTS harness: corpus space lnl-demo-whk had been deleted with zero trace
(no UVTS run since 2026-05-04 measured anything real); restored by
operator-directed full reingest. A fresh baseline NUMBER remains blocked
by further live-found harness rot — grader/persist breakage, expected-
path format drift, vector post-filter dilution (service.go:1137 global
top-K then space filter) amplified by the duplicate whk-wms space —
complete defect inventory handed to UXTS-CI-001. Retrieval ranking on
the restored corpus verified correct (expected files at ranks 1-4).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): sprint plan — scheduled maintenance actually runs (Epic 0)
Roadmap Q3 Phase 1 rank #4. Weekly decay+prune has never executed
(--dry-run defaults true; plist passes no override) while reporting
success — NOSILENT's blind spot. Tonight's Memory Bloat alerts (79k+
nodes) are the accumulated backlog. Safety verified in code before
planning: nodes are tombstoned (never deleted) with abstraction-chain/
degree/recency protections; edge deletion is the designed near-zero-
weight lifecycle, meaningful now that HIDDEN-WEIGHT made weights real.
Plan: live-by-default plist (+installed refresh), dry_run in job-event
metadata (no schema change — disclosed), maintenance_no_live_run
evaluator rule, darwin upgrade refreshes plists/hooks, first-ever live
run with preview-first protocol.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(maint-live-001): scheduled maintenance runs live — plist passes --dry-run=false (Epic 1)
The weekly LaunchAgent ran `mdemg maintenance` with no dry-run override;
the CLI defaults --dry-run=true, so every scheduled cycle previewed and
reported success — decay+prune NEVER executed (the 79k-node Memory
Bloat backlog). Both plist copies now pass --dry-run=false (the CLI
default stays true for safe manual previews — the SCHEDULE is what must
not silently no-op); installed plist refreshed + agent reloaded.
reportScheduledJobMeta threads job metadata into V0024; maintenance
records dry_run so the only-ever-dry-runs pattern is queryable
(metadata JSONB — no schema change, disclosed in the plan).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): maintenance_no_live_run evaluator rule (Epic 2)
Fires when maintenance rows exist in MAINT_LIVE_LOOKBACK_DAYS (default
8) but none ran live (success + metadata dry_run=false) — the only-
ever-dry-runs pattern self-reports instead of hiding inside "the job
ran". Distinct service maintenance-liveness per the cooldown rule.
Config: MAINT_LIVE_ALERT_ENABLED (true), MAINT_LIVE_LOOKBACK_DAYS (8).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): mdemg upgrade refreshes installed LaunchAgents + hooks (darwin) (Epic 3)
Plist/hook fixes shipped in releases but never reached installed
machines — the maintenance dry-run override would have sat unreachable
next to upgraded binaries forever. Upgrade now re-renders ALREADY-
INSTALLED mdemg LaunchAgents from the new binary's embedded templates
(refresh-only — never installs new services) + re-syncs mdemg-managed
Claude hooks in the current project (marker-checked). Substitution
logic single-sourced into renderLaunchdTemplate (Install + Refresh —
the drift class that exit-78'd the sidecar during HOOKSYNC live smoke).
Mirrors the existing Linux systemd-unit refresh.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): context-dependent orphan policy — --exclude-role-types (Epic 4a)
Orphan disposition is context-dependent (operator, 2026-06-11): a
uniform degree/age rule conflates governance constraints, conversation
history, test junk, and hierarchy debris. New --exclude-role-types on
prune + maintenance (env PRUNE_EXCLUDE_ROLE_TYPES) makes the policy
expressible; the scheduled plist ships
constraint,conversation_observation excluded per the operator's call
(constraints are load-bearing governance rules at any degree;
conversation observations differ by SESSION which the knob can't
express yet). Aged hierarchy debris stays eligible — that's the
lifecycle working. Candidate census that drove the decision: 5,388
conv-obs (9 eligible tonight under the 90d shield), 11 constraints,
238 hierarchy nodes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(prune): orphan sweeps use implicit transactions for batched deletes
Caught by the FIRST-EVER live maintenance run (MAINT-LIVE-001 Tier 3):
Neo4j raises TransactionStartFailed when a batched CALL-IN-TRANSACTIONS
statement executes inside an explicit transaction. Both orphan sweeps
(SymbolNode + Observation) ran their batched delete via ExecuteWrite;
the dry-run path never executes the deleting statement, so no preview
or unit test could surface it — only live execution. Switched to
session.Run (implicit tx). The failure ALSO proved the NOSILENT chain
live: the run fired "Scheduled job failed: maintenance" before exiting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): first live run verification + feature doc + CHANGELOG + close (Epics 4b-5)
First live maintenance in MDEMG history: 20,236 orphan SymbolNodes
deleted; all 5,010 tombstone candidates protected (recency + operator
exclusions); liveness rule born-firing → silenced by the real run; the
3-row job-event story (preview/true → failure/false alerted →
success/false) proves the dry_run plumbing through every path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md architecture note for MAINT-LIVE-001
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): sprint plan — embedder wiring + ingest exec resolution (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(embed-wire-001): breaker + recorder reach the real embedder through the wrapper chain (Epic 1)
The embedding circuit breaker was NEVER wired in any default deployment:
embeddings.New returns *CachedEmbedder when EMBEDDING_CACHE_ENABLED=true
(the default), so the server's emb.(*embeddings.OpenAI)/(*Ollama)
assertions on the OUTERMOST value failed silently (no else branch). The
recorder assertion had the inverse fragility (cache off → training-data
recording silently dies).
New: Unwrap() chain (CachedEmbedder joins RateLimitedEmbedder's existing
one) + embeddings.Base() / FindCached() interface-driven walkers — any
future wrapper joins by adding Unwrap(), no type lists. Wiring now walks
to the base for the breaker and to the cache layer for the recorder,
with LOUD warns when nothing matches. Tier 1 pins the production shape
(ratelimit(cache(provider))) plus cache-off and bare chains.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest-exec-001): server-triggered ingest resolves the mdemg binary — was hardcoded ./bin/mdemg (Epic 2)
Both ingest-job exec sites ran a relative "./bin/mdemg": broken in
Docker (the documented-primary deployment — binary at /usr/local/bin,
no repo checkout) and any CWD other than the repo root. New
resolveMdemgBin(): MDEMG_BIN env → os.Executable() (the server IS the
binary) → PATH → ./bin/mdemg legacy fallback; cached; Tier 1 pins the
order. Scheduled-sync jobs now report outcomes to scheduled_job_events
via jobhealth (job_name codebase-sync) — an unattended sync that keeps
failing is never silent; manual API jobs stay queue-visible only.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): live verification + CHANGELOG + CLAUDE.md + close (Epics 3-4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): sprint plan — documentation matches reality (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): CLAUDE.md FT section rewritten to post-pivot reality (Epic 1)
The section presented the abandoned Qwen3.6-35B-A3B MoE target, two-tier
MoE-Sieve strategy, and Sprint A→E critical path as CURRENT — all
superseded by the 2026-04-22 MoE→dense pivot; this stale text seeded the
Q3 roadmap audit with a dead architecture. Rewritten: shipped state
(dense Qwen3-14B mdemg-llm-v1, 0.8389, llama-server runtime), superseded
plan documented with the pivot rationale (never deleted — supersede-with-
pointer), guardrail llmclient exception marked CLOSED (re-verified in
code), memo-07 provenance break disclosed (the file never existed;
00_README_v2.md is canonical), open FT work named (FT-CLASSIFY-002 +
recursive-retraining trigger). Adapter env-name drift fixed
(MDEMG_ADAPTER_BASE, not MDEMG_MODEL_ADAPTER_BASE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(doc-truth-001): operator-facing text matches the Phase 13.5 reality (Epic 2)
preflight errors directed operators to start the DECOMMISSIONED
mlx_lm.server on :8101 — following them reintroduces the crash-looping
stack Phase 13.5 replaced. Now: llama-server :8102 guidance (managed
service install + manual command), backend-agnostic wording. model.go
help text dropped three stale "deferred to MODEL-DIST-002" mentions
(shipped 2026-05-25). Operationally (untracked .env): removed the
J17_SIDECAR_TIMEOUT_MS=200 override that re-pinned the exact value
DH-004 remediated — the 1000ms default now applies; server restarted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): 00_README STATUS block + AGENT_HANDOFF retired (Epic 3)
00_README_v2.md gains a top-of-file STATUS block: shipped-through-
cutover state, superseded MoE plan (FT-2 skip + FT-3 supersession +
R-LT-4 prototype-discipline adjudication recorded), the NOT-STARTED
recursive-retraining loop with its FT-CLASSIFY-002 trigger, and
provenance notes (memo-07 never existed; the spec is untracked pending
FG-2). AGENT_HANDOFF.md (stale since 2026-05-06) retired to a pointer
stub — handoff state lives in CLAUDE.md/roadmap/CHANGELOG/CMS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): grep-sweep proof + CHANGELOG + close (Epic 4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): last stale --adapter help string (sweep straggler)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): sprint plan — fail-closed self-improvement (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): honest criteria evaluation — populated keys + fail-closed mutations (Epics 1-2)
The cycle baseline populated 10 metric keys while task criteria
referenced ~15 others (only volatile_count + correction_rate
intersected) → missing_data → skip → ~16/17 actions validated
vacuously; criteria-driven rollback was unreachable. The
SelfAssessmentReport already carried nearly every needed key — they
were never copied into the maps.
New single source reportMetricsMap() feeds BOTH MetricsBefore and
MetricsAfter (the mismatch class cannot recur), resolving
edges_below_threshold, total_edges, consolidation_age_sec,
avg_edge_weight, guidance_health, protocol_health + 13 more. Fail-
closed rule: for MUTATING actions (15-entry registry) a criterion with
missing evidence counts as NOT met ("missing_data_failclosed") — an
unverifiable mutation must never be recorded as success; observational
actions keep advisory semantics. The prior test pinned the vacuous
pass as the contract — updated to the honest one + advisory companion.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): tombstone_stale scoped to correction-linked nodes; refresh_stale_edges decays for real (Epic 3)
tombstone_stale archived 50 ARBITRARY older observations whenever ANY
correction existed in the 7-day window — no relationship between
correction and target. Now requires linkage: same session as the
correction OR its 1-hop CO_ACTIVATED_WITH neighborhood. Live check:
0 corrections in the current 7-day window, so both old and new scopes
are 0 RIGHT NOW — the hazard was conditional (any future correction
re-armed the old query against thousands of unrelated observations;
the new query bounds it to genuinely related nodes).
refresh_stale_edges bumped last_activated BEFORE the weight expression
read it → staleness=0 → the decay term vanished → every refresh was a
pure +0.1·log(count+1) boost. Staleness now captured via WITH before
SET; weights can genuinely decay. Both statements EXPLAIN-validated.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): counter-free confidence calibration — RSIC stops polluting its own signal (Epic 4)
RSIC-SK1 injected synthetic "followed"/"ignored" outcomes through
UpdateConfidence, incrementing total_surfaced/total_followed/
total_ignored — the exact counters GetConstraintEffectiveness reads
next cycle: measured effectiveness drove synthetic outcomes which drove
measured effectiveness (circular self-reinforcement). New
AdjustConfidenceDirect applies the clamp+archive confidence delta with
ZERO counter writes; the outcome counters now belong exclusively to
real guidance feedback. Provider interface + adapter + dispatcher use
the direct path with the configured boost/decay magnitudes; test mock
maps deltas back to outcome labels so existing assertions keep meaning.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): Tier 3 verification + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(rsic-validate-001): integration seeds carry session linkage for the scoped tombstone contract
CI's TombstoneStaleEndToEnd + MultiActionDispatchAndMetrics failed
because the seeded observations had NO relationship to the seeded
corrections — under the old behavior they were archived anyway (the
memory-eroding bug the sprint removed); under the new correction-
linkage contract they are correctly spared. SeedObservationNodes now
stamps a per-space test session shared by corrections and their stale
peers, so the tests exercise the new contract. Query-level proof
against the exact seeded shape: 10/10 linked observations match the
scoped Cypher. (Local integration runs hit the 30s client timeout —
the loaded local stack's cycles take ~6 min; CI's arbitrates.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): sprint plan — finish the score-scale contract (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): persistent rerank clients — failure alerting re-armed on the hottest LLM path (Epic 1)
doRerankWithOpenAI/doRerankWithOllama constructed a fresh llmclient per
call: the consecutive-failure counter reset every time, so
LLM_CONSECUTIVE_FAILURE_THRESHOLD could NEVER fire for
retrieval.rerank_cross / rerank_nli (a north-star distill task), and the
HTTP transport was discarded per call. Per-provider base clients now
init once (sync.Once); WithContext() shallow-copies and SHARES the
*atomic counter + breaker, so per-call contexts keep failure accounting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): config-driven score thresholds — suggest revival, MCP tiers, guardrail floor (Epic 2)
Three score-literal leftovers from the RRF-SCALE-001 audit instruction:
(1) /v1/memory/suggest's hardcoded 0.5 min-confidence default filtered
nearly everything on a scale topping out ~0.58 → CONSULTING_SUGGEST_
MIN_CONFIDENCE (default 0.45, RRF-calibrated); (2) MCP memory_reflect
tiers 0.7/0.4 (high tier unreachable) → MCP_REFLECT_SCORE_HIGH/_MEDIUM
(0.45/0.25); (3) guardrail constraint-retrieval Cypher's hardcoded
sim > 0.3 → GUARDRAIL_CONSTRAINT_SIM_FLOOR via GuardrailConfig
(cosine-stable today but inside the class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): CacheKey covers ALL result-affecting fields + two forcing functions (Epics 3-4)
CACHE-KEY-002: the key omitted result-affecting RetrieveRequest fields —
the audit named 5 (include/exclude_extensions, temporal_after/before,
policy_context); the new reflection forcing-function caught 8 MORE on
its first run: sparse-gate per-call overrides (SparseEnabled/
SparsePercentile/SparseOverridePresent/Category — the ?sparse= URL
params), pagination (Cursor/Limit), and the context-fingerprint params
(QueryContextFingerprint/StrictContextMode). All now keyed, plus a
caller-supplied query-embedding hash. Two requests differing in any of
these no longer collide on one cache entry.
Forcing functions: (1) reflection test — every RetrieveRequest field
must be in CacheKey or explicitly classified result-neutral with
justification (new f…
* fix(metrics,backup): resolve docker binary robustly under minimal launchd PATH
The native server (launchd) inherits PATH=/usr/bin:/bin:/usr/sbin:/sbin, which
excludes the Docker Desktop symlink (/usr/local/bin/docker). So every
server-runtime `docker` shellout failed with "executable file not found in
$PATH": (1) Neo4j container CPU/mem stats (server.go) logged an ERROR every 60s
and left the neo4j_container_* gauges empty — so the neo4j_high_cpu/_memory
alert rules had no data; (2) the TSDB backup scheduler's `docker compose
pg_dump` (backup.go) failed with only a slog.Warn. The DATA PLANE was never
affected — Neo4j (Bolt) + TSDB (pgx) connect over mapped TCP ports, not the
docker CLI.
Fix (durable, configurable, single-source): new internal/dockerbin resolver —
MDEMG_DOCKER_BIN env override → exec.LookPath → well-known install locations
(/usr/local/bin, /opt/homebrew/bin, /usr/bin) → graceful unavailable. Wired
into server.go (stats) + backup.go (both compose calls). The perpetual 60s
ERROR is downgraded to a one-shot WARN when docker is genuinely absent (it's
optional telemetry). Added a sane PATH to the launchd server plist template as
defense-in-depth.
Live-verified: after restart, mdemg_neo4j_container_cpu_percent=0.59 /
mem_percent=29.13 now land in metric_samples (were absent); no more docker-stats
ERROR; `docker stats` + backup resolve docker under a simulated minimal PATH.
Note: `mdemg data export-auto` (training-export job) was NOT a victim — it
exports via network SQL, not docker (corrected from an earlier assumption).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): sprint plan — fail-loud scheduled jobs (Epic 0)
Triggered by a live-discovered silent failure: the TSDB backup scheduler was
failing every 24h run (docker-under-launchd-PATH) with only a buried slog.Warn.
Docker cause fixed (4cc7608); this sprint fixes the class — scheduled jobs that
fail with no record + no alert. V0024 scheduled_job_events hypertable + writer,
jobhealth.Report (record + alert on failure), wire the 3 jobs (backup,
maintenance, export-auto), 2 evaluator rules (backup staleness + recent
failure) so the server catches "job failed OR never ran". Config-driven, 3
testing tiers, live Tier-3 induced-failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): V0024 scheduled_job_events + writer (Epic 1)
Hypertable (job_name, success, latency_ms, error_message, metadata jsonb,
recorded_at) + RecordJobEvent synchronous single-row writer (mirrors V0021
model_install pattern). Indexes: per-job freshness, partial failed, per-space.
One row per scheduled-job run so the alert evaluator can detect "job failed"
AND "job never ran" (staleness). Schema 23->24; TSDB_REQUIRED_SCHEMA_VERSION
bumped to match migration count (CI check 24=24).
Tier 1 (-race): field mapping, optional-nulls, error truncation, nil-pool
no-op, insert-error propagation. Tier 2 (live TSDB): round-trip + the staleness
(recent successes) + failure (recent failures) query shapes the rules will use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): record + alert on scheduled-job outcomes (Epic 2)
New internal/jobhealth.Report — the single policy point: record a
scheduled_job_events row and fire a high-severity "scheduled-job" alert on
failure (both pool + dispatcher nil-safe). Wired into all three jobs:
- TSDB backup scheduler (internal/tsdb/backup.go): decoupled JobResultFunc hook
(mutex-guarded, -race clean) so internal/tsdb stays free of internal/alert;
server.go::SetTSDBClient sets it with the pool + s.alertDispatcher. A failed
or never-run backup now records + alerts instead of a silent slog.Warn.
- export-auto + maintenance (CLI): deferred reportScheduledJob on the named
return error — opens a short-lived pool + a file-backed dispatcher (same
~/.mdemg/alerts/current.json the hooks surface) so a separate-process CLI job
still alerts the operator.
Tier 1 (-race): jobhealth fires alert only on failure (real file-backend
dispatcher), nil-safe. Live smoke: export-auto recorded success=t latency=3050ms.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(nosilent-001): scheduled-job staleness + failure alert rules (Epic 3)
Two server-native evaluator rules over V0024 scheduled_job_events:
- scheduled_job_recent_failure (always on): any job failure in the last
JOB_FAILURE_LOOKBACK_MIN (default 60) → high alert.
- backup_no_recent_success (gated on TSDB_BACKUP_ENABLED): zero successful
tsdb-backups within the staleness window → high alert. THIS is the "job
never ran" guarantee — it fires from the server observing ABSENT success, so
a backup that silently died or never started is caught, not just one that
errored.
Window derives from the real backup interval × 2 (JOB_BACKUP_STALENESS_HOURS
override; no hardcoded literal). JOB_HEALTH_ALERT_ENABLED master gate (default
true). Appended after DefaultRules() in serve.go.
Tier 1: failure rule always present (gt 0), staleness gated on backups-enabled
(lt 0.5), windows reflect config, non-positive fallback. Build/lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): distinct services for job rules so neither masks the other
Caught in live Tier-3 testing: both evaluator rules used Service="scheduled-jobs",
and the dispatcher cooldown key is (Service, Severity) — so the failure alert's
cooldown SUPPRESSED the staleness alert (only the failure fired). One alarm
masking another is the exact silent-failure class this sprint kills. Fixed:
scheduled-job-failure / scheduled-job-staleness distinct services. Re-verified
live — both fire independently and land as distinct alert-file entries. Tier-1
assertion pins that the two services differ. Includes Tier-3 verification.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(nosilent-001): feature doc + CHANGELOG + CLAUDE.md + close (Epic 4)
Feature doc docs/features/scheduled-job-health.md (why / two mechanisms /
operator view / config). CHANGELOG Added (NOSILENT-001) + Changed (schema 23→24)
+ Fixed (docker-under-launchd-PATH). CLAUDE.md Service Alert System extended
with the scheduled-job-health note + the distinct-Service-per-rule cooldown
caveat. post.md closes the sprint (UxTS mapping + follow-ups).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nosilent-001): sync embedded launchd server plist with source (CI)
CI "Verify embedded launchd templates match source" diffs packaging/launchd/*
against internal/cli/launchd_templates/* (the embed.FS copy mdemg service
install uses). The PATH addition landed only in the source copy; sync the
embedded copy so they match byte-for-byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(roadmap): add jiminy-governance skill build-out (Workstream C, Action 7)
Records the jiminy-governance Claude Code skill on the active forward roadmap
(SPRINT_ROADMAP_POST_FT_LORA.md, cross-cutting governance) + brings the source
spec into the repo (docs/development/jiminy-governance-skill/SKILL.md, out of
~/Downloads). The skill makes Jiminy the deterministic source of context +
governance over J17, enforced by the PreToolUse hook — a routing/handshake shim,
not a rulebook. Build-out scope notes the wire-up placeholders that must be
resolved against the real instance (Jiminy MCP/endpoint, PreToolUse hook, J17
ack/RetireCode/GUIDANCE_OUTCOME calls). Aligns with the now-live guidance loop
(RRF-SCALE-001 / JIMINY-OUTCOME-001 / GUIDANCE-SYNTH-001).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): resolve skill wire-up against the real instance
Step 1 of the jiminy-governance build-out (roadmap Workstream C Action 7).
Resolved all five placeholders from the running MDEMG instance, verified live:
- Jiminy query: MCP `mdemg mcp` (stdio) → jiminy_guide/validate_changes; HTTP
/v1/jiminy/guide (returns guidance_id) + /bootstrap glossary (j17v1) + /latest.
- PreToolUse: pre-bash-check.py (Bash, fail-closed) + pre-write-check.py
(Write/Edit → /v1/jiminy/classify, /strict-only, fail-open).
- SessionID: claude-core convention.
- Comprehension ack: /v1/jiminy/protocol/feedback (verified ingested 1).
- GUIDANCE_OUTCOME: /v1/jiminy/feedback {guidance_id,…}.
- RetireCode: internal-only by design (RSIC/APE protocol-evolution) — no
agent-facing call; agent must never self-retire a constraint.
Also surfaced the two real integration gaps the build-out must close (the work,
not the prose): (a) the MDEMG MCP server is NOT registered (.mcp.json absent —
context is pushed by prompt-context.sh, not pulled by the agent); (b) PreToolUse
enforcement is /strict-gated + fail-open, so not deterministic-by-default.
Roadmap Action 7 updated to reflect step 1 done + the two gaps as next steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy-governance): ship the J17 governance skill + register MDEMG MCP
Build-out of roadmap Workstream C Action 7 (steps 2-4), per the resolved
wire-up. Closes the two integration gaps found while resolving:
- gap 1 (MCP not registered): .mcp.json registers `mdemg mcp` (stdio).
Live-probed: 20 tools incl. jiminy_guide + validate_changes — the agent can
now PULL guidance, not only receive the hook push.
- gap 2 (enforcement /strict-gated + fail-open): policy set — Write/Edit J17
gate kept fail-open (hard server dependency on every edit is too brittle);
the skill's handshake auto-enables /strict so the gate is active per session.
Bash gate already fail-closed (demonstrated live when a test payload with a
destructive force-push string was blocked by pre-bash-check.py).
Skill authored at the canonical .claude/skills/jiminy-governance/SKILL.md
(frontmatter valid; concrete wire-up inline — MCP tools + HTTP endpoints +
SessionID claude-core + the 5-step handshake). Kept a routing/handshake shim,
not a rulebook (rules stay in the graph).
Live Tier-3 PASSED: full handshake identify->request->comprehend->act->report
ran against the real instance; GUIDANCE_OUTCOME edges 906->909 (one per coded
constraint). Verification: docs/development/jiminy-governance-skill/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): commit install-ready skill + install README
.claude/ is gitignored (per-developer local config), so the installed skill at
.claude/skills/jiminy-governance/SKILL.md is local-only by convention. Commit
the reproducible, install-ready copy (jiminy-governance.skill.md) + a README
with the one-line install (cp into .claude/skills/) so the skill propagates via
the repo. The MCP server it uses is registered in the tracked repo-root
.mcp.json.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID across hooks + skill
Hooks and the jiminy-governance skill hardcoded session_id="claude-core", so
trust/escalation/observations from ALL Claude Code conversations collapsed into
one shared MDEMG session. Claude Code already passes a per-conversation
session_id on stdin to every hook — the implementation just never used it.
Resolver precedence (single rule everywhere): MDEMG_SESSION_ID env (stable-
identity escape hatch) > Claude Code stdin session_id (per-conversation default,
race-free per hook) > ~/.mdemg/.claude-session (published by SessionStart +
UserPromptSubmit for the agent / stdin-less contexts) > claude-core (fallback).
Realizes J17's intended per-(session,constraint) isolation.
Tracked templates updated (internal/cli/hook_templates/): session-start.sh,
prompt-context.sh, post-tool-observe.py, pre-compact.sh + Windows .ps1 variants
— every hardcoded claude-core in MDEMG calls replaced with the resolved id;
session-start/prompt-context publish the session file. Skill SessionID
instruction + handshake steps now resolve <SessionID> instead of claude-core.
Live-verified: a hook resolved a stdin session_id and published the session
file; post-tool-observe wrote an observation keyed to the per-conversation id in
Neo4j (not claude-core). bash -n / py_compile clean; go build + hooks test pass.
Note: pre-write-check.py is local-only (no tracked installer template); its fix
is applied on this machine but won't propagate via `mdemg hooks install` until
it's added to the tracked hooks (follow-up). .claude/* is gitignored, so the
live hook copies aren't committed — the tracked templates propagate via install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID — installed hook copies
The .claude/hooks/ installed copies are tracked (committed before the .claude/*
ignore), so apply the same SessionID resolver here as in the embedded templates
(prior commit). session-start.sh, prompt-context.sh, post-tool-observe.py,
pre-compact.sh now resolve MDEMG_SESSION_ID env > stdin session_id >
~/.mdemg/.claude-session > claude-core. (pre-write-check.py is untracked /
local-only — its fix lives on this machine only; tracking it is the follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcement,guidance-outcome}-neighborhood) + TOC entry;
these were the only two missing endpoints (audited the full route table).
(3) docs/user/cli-reference.md — add mdemg eventgraph {reinforcement,guidance-
outcome}-neighborhood, model run, watchdog status, migrate context-fingerprint,
data curate/validate/clean; fix the stale `model pull --adapter` description
(MODEL-DIST-002 shipped — no longer "deferred/errors"); update the Command Tree
Summary to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch budget
Caught live during the HIDDEN-WEIGHT-001 corpus reingest: the post-ingest
/v1/memory/consolidate call used the shared batch-ingest client
(--timeout, 300s); consolidating a ~10k-node space exceeds that, so the
client reported failure while the server completed the work — the
GUIDANCE-SYNTH-001 bug class (long graph/LLM work needs its own budget).
New --consolidate-timeout flag / INGEST_CONSOLIDATE_TIMEOUT_SEC env
(default 1800s) with a dedicated client. Live-verified: "running
consolidation timeout_sec=1800" → complete.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): Tier 3 verification + corpus restoration + UVTS harness audit + close (Epics 4-5)
Tier 3: real consolidation minted edges with varied cosine weights
(0.83-0.94) + CUIDv2 ids; at-scale via the corpus reingest (9,500 edges,
0 NULL, mean 0.923); gauge holds 0; evaluator rules 16→17.
UVTS harness: corpus space lnl-demo-whk had been deleted with zero trace
(no UVTS run since 2026-05-04 measured anything real); restored by
operator-directed full reingest. A fresh baseline NUMBER remains blocked
by further live-found harness rot — grader/persist breakage, expected-
path format drift, vector post-filter dilution (service.go:1137 global
top-K then space filter) amplified by the duplicate whk-wms space —
complete defect inventory handed to UXTS-CI-001. Retrieval ranking on
the restored corpus verified correct (expected files at ranks 1-4).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): sprint plan — scheduled maintenance actually runs (Epic 0)
Roadmap Q3 Phase 1 rank #4. Weekly decay+prune has never executed
(--dry-run defaults true; plist passes no override) while reporting
success — NOSILENT's blind spot. Tonight's Memory Bloat alerts (79k+
nodes) are the accumulated backlog. Safety verified in code before
planning: nodes are tombstoned (never deleted) with abstraction-chain/
degree/recency protections; edge deletion is the designed near-zero-
weight lifecycle, meaningful now that HIDDEN-WEIGHT made weights real.
Plan: live-by-default plist (+installed refresh), dry_run in job-event
metadata (no schema change — disclosed), maintenance_no_live_run
evaluator rule, darwin upgrade refreshes plists/hooks, first-ever live
run with preview-first protocol.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(maint-live-001): scheduled maintenance runs live — plist passes --dry-run=false (Epic 1)
The weekly LaunchAgent ran `mdemg maintenance` with no dry-run override;
the CLI defaults --dry-run=true, so every scheduled cycle previewed and
reported success — decay+prune NEVER executed (the 79k-node Memory
Bloat backlog). Both plist copies now pass --dry-run=false (the CLI
default stays true for safe manual previews — the SCHEDULE is what must
not silently no-op); installed plist refreshed + agent reloaded.
reportScheduledJobMeta threads job metadata into V0024; maintenance
records dry_run so the only-ever-dry-runs pattern is queryable
(metadata JSONB — no schema change, disclosed in the plan).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): maintenance_no_live_run evaluator rule (Epic 2)
Fires when maintenance rows exist in MAINT_LIVE_LOOKBACK_DAYS (default
8) but none ran live (success + metadata dry_run=false) — the only-
ever-dry-runs pattern self-reports instead of hiding inside "the job
ran". Distinct service maintenance-liveness per the cooldown rule.
Config: MAINT_LIVE_ALERT_ENABLED (true), MAINT_LIVE_LOOKBACK_DAYS (8).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): mdemg upgrade refreshes installed LaunchAgents + hooks (darwin) (Epic 3)
Plist/hook fixes shipped in releases but never reached installed
machines — the maintenance dry-run override would have sat unreachable
next to upgraded binaries forever. Upgrade now re-renders ALREADY-
INSTALLED mdemg LaunchAgents from the new binary's embedded templates
(refresh-only — never installs new services) + re-syncs mdemg-managed
Claude hooks in the current project (marker-checked). Substitution
logic single-sourced into renderLaunchdTemplate (Install + Refresh —
the drift class that exit-78'd the sidecar during HOOKSYNC live smoke).
Mirrors the existing Linux systemd-unit refresh.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): context-dependent orphan policy — --exclude-role-types (Epic 4a)
Orphan disposition is context-dependent (operator, 2026-06-11): a
uniform degree/age rule conflates governance constraints, conversation
history, test junk, and hierarchy debris. New --exclude-role-types on
prune + maintenance (env PRUNE_EXCLUDE_ROLE_TYPES) makes the policy
expressible; the scheduled plist ships
constraint,conversation_observation excluded per the operator's call
(constraints are load-bearing governance rules at any degree;
conversation observations differ by SESSION which the knob can't
express yet). Aged hierarchy debris stays eligible — that's the
lifecycle working. Candidate census that drove the decision: 5,388
conv-obs (9 eligible tonight under the 90d shield), 11 constraints,
238 hierarchy nodes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(prune): orphan sweeps use implicit transactions for batched deletes
Caught by the FIRST-EVER live maintenance run (MAINT-LIVE-001 Tier 3):
Neo4j raises TransactionStartFailed when a batched CALL-IN-TRANSACTIONS
statement executes inside an explicit transaction. Both orphan sweeps
(SymbolNode + Observation) ran their batched delete via ExecuteWrite;
the dry-run path never executes the deleting statement, so no preview
or unit test could surface it — only live execution. Switched to
session.Run (implicit tx). The failure ALSO proved the NOSILENT chain
live: the run fired "Scheduled job failed: maintenance" before exiting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): first live run verification + feature doc + CHANGELOG + close (Epics 4b-5)
First live maintenance in MDEMG history: 20,236 orphan SymbolNodes
deleted; all 5,010 tombstone candidates protected (recency + operator
exclusions); liveness rule born-firing → silenced by the real run; the
3-row job-event story (preview/true → failure/false alerted →
success/false) proves the dry_run plumbing through every path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md architecture note for MAINT-LIVE-001
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): sprint plan — embedder wiring + ingest exec resolution (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(embed-wire-001): breaker + recorder reach the real embedder through the wrapper chain (Epic 1)
The embedding circuit breaker was NEVER wired in any default deployment:
embeddings.New returns *CachedEmbedder when EMBEDDING_CACHE_ENABLED=true
(the default), so the server's emb.(*embeddings.OpenAI)/(*Ollama)
assertions on the OUTERMOST value failed silently (no else branch). The
recorder assertion had the inverse fragility (cache off → training-data
recording silently dies).
New: Unwrap() chain (CachedEmbedder joins RateLimitedEmbedder's existing
one) + embeddings.Base() / FindCached() interface-driven walkers — any
future wrapper joins by adding Unwrap(), no type lists. Wiring now walks
to the base for the breaker and to the cache layer for the recorder,
with LOUD warns when nothing matches. Tier 1 pins the production shape
(ratelimit(cache(provider))) plus cache-off and bare chains.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest-exec-001): server-triggered ingest resolves the mdemg binary — was hardcoded ./bin/mdemg (Epic 2)
Both ingest-job exec sites ran a relative "./bin/mdemg": broken in
Docker (the documented-primary deployment — binary at /usr/local/bin,
no repo checkout) and any CWD other than the repo root. New
resolveMdemgBin(): MDEMG_BIN env → os.Executable() (the server IS the
binary) → PATH → ./bin/mdemg legacy fallback; cached; Tier 1 pins the
order. Scheduled-sync jobs now report outcomes to scheduled_job_events
via jobhealth (job_name codebase-sync) — an unattended sync that keeps
failing is never silent; manual API jobs stay queue-visible only.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): live verification + CHANGELOG + CLAUDE.md + close (Epics 3-4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): sprint plan — documentation matches reality (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): CLAUDE.md FT section rewritten to post-pivot reality (Epic 1)
The section presented the abandoned Qwen3.6-35B-A3B MoE target, two-tier
MoE-Sieve strategy, and Sprint A→E critical path as CURRENT — all
superseded by the 2026-04-22 MoE→dense pivot; this stale text seeded the
Q3 roadmap audit with a dead architecture. Rewritten: shipped state
(dense Qwen3-14B mdemg-llm-v1, 0.8389, llama-server runtime), superseded
plan documented with the pivot rationale (never deleted — supersede-with-
pointer), guardrail llmclient exception marked CLOSED (re-verified in
code), memo-07 provenance break disclosed (the file never existed;
00_README_v2.md is canonical), open FT work named (FT-CLASSIFY-002 +
recursive-retraining trigger). Adapter env-name drift fixed
(MDEMG_ADAPTER_BASE, not MDEMG_MODEL_ADAPTER_BASE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(doc-truth-001): operator-facing text matches the Phase 13.5 reality (Epic 2)
preflight errors directed operators to start the DECOMMISSIONED
mlx_lm.server on :8101 — following them reintroduces the crash-looping
stack Phase 13.5 replaced. Now: llama-server :8102 guidance (managed
service install + manual command), backend-agnostic wording. model.go
help text dropped three stale "deferred to MODEL-DIST-002" mentions
(shipped 2026-05-25). Operationally (untracked .env): removed the
J17_SIDECAR_TIMEOUT_MS=200 override that re-pinned the exact value
DH-004 remediated — the 1000ms default now applies; server restarted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): 00_README STATUS block + AGENT_HANDOFF retired (Epic 3)
00_README_v2.md gains a top-of-file STATUS block: shipped-through-
cutover state, superseded MoE plan (FT-2 skip + FT-3 supersession +
R-LT-4 prototype-discipline adjudication recorded), the NOT-STARTED
recursive-retraining loop with its FT-CLASSIFY-002 trigger, and
provenance notes (memo-07 never existed; the spec is untracked pending
FG-2). AGENT_HANDOFF.md (stale since 2026-05-06) retired to a pointer
stub — handoff state lives in CLAUDE.md/roadmap/CHANGELOG/CMS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): grep-sweep proof + CHANGELOG + close (Epic 4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): last stale --adapter help string (sweep straggler)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): sprint plan — fail-closed self-improvement (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): honest criteria evaluation — populated keys + fail-closed mutations (Epics 1-2)
The cycle baseline populated 10 metric keys while task criteria
referenced ~15 others (only volatile_count + correction_rate
intersected) → missing_data → skip → ~16/17 actions validated
vacuously; criteria-driven rollback was unreachable. The
SelfAssessmentReport already carried nearly every needed key — they
were never copied into the maps.
New single source reportMetricsMap() feeds BOTH MetricsBefore and
MetricsAfter (the mismatch class cannot recur), resolving
edges_below_threshold, total_edges, consolidation_age_sec,
avg_edge_weight, guidance_health, protocol_health + 13 more. Fail-
closed rule: for MUTATING actions (15-entry registry) a criterion with
missing evidence counts as NOT met ("missing_data_failclosed") — an
unverifiable mutation must never be recorded as success; observational
actions keep advisory semantics. The prior test pinned the vacuous
pass as the contract — updated to the honest one + advisory companion.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): tombstone_stale scoped to correction-linked nodes; refresh_stale_edges decays for real (Epic 3)
tombstone_stale archived 50 ARBITRARY older observations whenever ANY
correction existed in the 7-day window — no relationship between
correction and target. Now requires linkage: same session as the
correction OR its 1-hop CO_ACTIVATED_WITH neighborhood. Live check:
0 corrections in the current 7-day window, so both old and new scopes
are 0 RIGHT NOW — the hazard was conditional (any future correction
re-armed the old query against thousands of unrelated observations;
the new query bounds it to genuinely related nodes).
refresh_stale_edges bumped last_activated BEFORE the weight expression
read it → staleness=0 → the decay term vanished → every refresh was a
pure +0.1·log(count+1) boost. Staleness now captured via WITH before
SET; weights can genuinely decay. Both statements EXPLAIN-validated.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): counter-free confidence calibration — RSIC stops polluting its own signal (Epic 4)
RSIC-SK1 injected synthetic "followed"/"ignored" outcomes through
UpdateConfidence, incrementing total_surfaced/total_followed/
total_ignored — the exact counters GetConstraintEffectiveness reads
next cycle: measured effectiveness drove synthetic outcomes which drove
measured effectiveness (circular self-reinforcement). New
AdjustConfidenceDirect applies the clamp+archive confidence delta with
ZERO counter writes; the outcome counters now belong exclusively to
real guidance feedback. Provider interface + adapter + dispatcher use
the direct path with the configured boost/decay magnitudes; test mock
maps deltas back to outcome labels so existing assertions keep meaning.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): Tier 3 verification + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(rsic-validate-001): integration seeds carry session linkage for the scoped tombstone contract
CI's TombstoneStaleEndToEnd + MultiActionDispatchAndMetrics failed
because the seeded observations had NO relationship to the seeded
corrections — under the old behavior they were archived anyway (the
memory-eroding bug the sprint removed); under the new correction-
linkage contract they are correctly spared. SeedObservationNodes now
stamps a per-space test session shared by corrections and their stale
peers, so the tests exercise the new contract. Query-level proof
against the exact seeded shape: 10/10 linked observations match the
scoped Cypher. (Local integration runs hit the 30s client timeout —
the loaded local stack's cycles take ~6 min; CI's arbitrates.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): sprint plan — finish the score-scale contract (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): persistent rerank clients — failure alerting re-armed on the hottest LLM path (Epic 1)
doRerankWithOpenAI/doRerankWithOllama constructed a fresh llmclient per
call: the consecutive-failure counter reset every time, so
LLM_CONSECUTIVE_FAILURE_THRESHOLD could NEVER fire for
retrieval.rerank_cross / rerank_nli (a north-star distill task), and the
HTTP transport was discarded per call. Per-provider base clients now
init once (sync.Once); WithContext() shallow-copies and SHARES the
*atomic counter + breaker, so per-call contexts keep failure accounting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): config-driven score thresholds — suggest revival, MCP tiers, guardrail floor (Epic 2)
Three score-literal leftovers from the RRF-SCALE-001 audit instruction:
(1) /v1/memory/suggest's hardcoded 0.5 min-confidence default filtered
nearly everything on a scale topping out ~0.58 → CONSULTING_SUGGEST_
MIN_CONFIDENCE (default 0.45, RRF-calibrated); (2) MCP memory_reflect
tiers 0.7/0.4 (high tier unreachable) → MCP_REFLECT_SCORE_HIGH/_MEDIUM
(0.45/0.25); (3) guardrail constraint-retrieval Cypher's hardcoded
sim > 0.3 → GUARDRAIL_CONSTRAINT_SIM_FLOOR via GuardrailConfig
(cosine-stable today but inside the class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): CacheKey covers ALL result-affecting fields + two forcing functions (Epics 3-4)
CACHE-KEY-002: the key omitted result-affecting RetrieveRequest fields —
the audit named 5 (include/exclude_extensions, temporal_after/before,
policy_context); the new reflection forcing-function caught 8 MORE on
its first run: sparse-gate per-call overrides (SparseEnabled/
SparsePercentile/SparseOverridePresent/Category — the ?sparse= URL
params), pagination (Cursor/Limit), and the context-fingerprint params
(QueryContextFingerprint/StrictContextMode). All now keyed, plus a
caller-supplied query-embedding hash. Two requests differing in any of
these no longer collide on one cache entry.
Forcing functions: (1) reflection test — every RetrieveRequest field
must be in CacheKey or explicitly classified result-neutral with
justification (new fields fail until classified); (2) score-literal
scan — flags `.Score/score <op> 0.x` comparisons repo-wide outside a
justified allowlist (first run triaged 3 scale-local sites; clamp
guards excluded by pattern). The RRF-SCALE bug class is now CI-caught.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): live-calibrated suggest floor + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): sprint plan — stable concept identity, two-PR delivery (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): automated consolidation no longer skips LLM emergence (Epic A1)
dynamicEmergenceStep registers at phase 22, but RunConsolidation ran
hardcoded ranges (10,20) + (25,30) — phase 22 fell in the gap, so with
EMERGENCE_ENABLED=true the AUTOMATED path silently skipped LLM concept
emergence while the manual path (RunNodeCreationPipeline, 10–22 with an
emergence gate) ran it. RunConsolidation now delegates to
RunNodeCreationPipeline(cfg.EmergenceEnabled) — single range source; a
pin test fails if the step's phase ever leaves the range.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): stable theme identity — centroid match-or-create replaces the 5-minute churn (Epic A2)
ClusterConversations detached EVERY observation→theme edge, deleted
childless themes, and recreated all themes from scratch each ~5-min
cycle: new node_ids every run, evidence chains destroyed continuously,
recall flooded with stacks of near-identical concepts (observed live in
this session's own prompt headers).
New flow: cluster first → match each cluster to an EXISTING theme by
centroid cosine (HIDDEN_THEME_IDENTITY_SIM_THRESHOLD, default 0.90,
greedy with per-run claiming) → matched themes UPDATE in place
(props + theme-scoped member-edge rewire; node_id and all inbound
references survive) → unmatched clusters create as before → only themes
claimed by NO cluster are deleted. The global detach is gone.
ThemesUpdated added to the result. Tier 1: match/threshold/claimed/
best-of selection.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* chore(hidden-churn-001): remove the dead global-detach helper — the churn mechanism itself
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-A verification + CHANGELOG (Epics A3-A4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): PR-B coverage retune — config ratio, density assignment, gauge + rule (Epic B1)
maxThemes was an inline ceil(n/10) equation → HIDDEN_THEME_TARGET_RATIO
(default preserves it). NOISE observations (previously dropped from the
hierarchy forever — the 94% coverage gap's mechanism) now density-assign
to their nearest theme when cosine ≥ HIDDEN_THEME_ASSIGN_SIM_THRESHOLD
(default 0.70; edges only, no new themes; below-floor stays unthemed
honestly). New per-space coverage gauge
mdemg_neo4j_conversation_coverage_ratio (collector Query 5) + evaluator
rule low_conversation_coverage (CONVERSATION_COVERAGE_ALERT_FLOOR 0.2,
6h ForDuration for convergence).
Audit bonus: caught WeightIntegrityRules querying metric_samples with
recorded_at — the column is `time`; the null-weight rule had been
silently erroring every evaluation since it shipped (Debug-only logging
— the SUPERVISOR-002 finding in action). Both rules fixed + a pin test
bans recorded_at against metric_samples.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): mdemg concepts repair + trace — grounding audit CLI (Epic B2)
repair: tombstones childless layer>=2 abstraction nodes (no inbound
ABSTRACTS_TO|GENERALIZES|GENERALIZES_TO — 10,395 live in mdemg-dev,
churn-era debris). Recoverable (is_archived=true + archived_reason),
batched, dry-run default, --limit for small-batch-first verification.
trace: per-node grounding audit — direct children, transitive per-layer
census, grounded/ungrounded verdict, sample path to L0.
Live data note: GENERALIZES alone over-counts (19,147) — ABSTRACTS_TO
is the hidden layer's actual child edge; pin test guards the predicate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): surface themes_updated + noise_assigned in consolidate API + periodic log (Epic B3)
/v1/conversation/consolidate now reports themes_updated and
noise_assigned alongside themes_created. The periodic-consolidation
log condition also gains both — with stable theme identity (PR-A),
created is usually 0 on healthy cycles, which would have silenced the
success log entirely (the silent-success bug class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): live-smoke fixes — noise pool was structurally empty, clustering included archived debris, coverage gauge gated on min-obs
Three defects only the live run surfaced (Tier 3 forcing function):
1. KMeans never emits label -1, so the density-assignment hook received
an always-empty noise list; the min-samples/max-themes/nil-centroid
drops now feed their members into the noise pool instead of silently
excluding them from the hierarchy.
2. fetchClusterableConversationObservations had no is_archived filter —
it clustered 4,838 observations of which only 183 were live (MAINT-LIVE
tombstones), building themes on archived debris. Both fetch variants
now exclude archived. Live effect: 24 debris themes swept to 5 clean
ones; second cycle themes_updated=5/created=0 (stable identity on
real data).
3. Coverage gauge gated on CONVERSATION_COVERAGE_MIN_OBS (default 50,
DH-005 confidence-threshold pattern) — tiny scratch/test spaces
(2-13 observations) emitted 0.000 and would have alarmed forever
(born-firing alert hazard). Sentinel -1 skips emission.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-B verification + CHANGELOG + CLAUDE.md — sprint complete (Epic B5)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): sprint plan + background loop inventory (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): sliding-window restart budget + late registration (Epic 1)
The restart counter only ever incremented — a once-a-week transient
permanently killed a worker after 3 weeks. Budget is now a sliding
window (restarts older than the window are forgotten); permanent
failure requires >SUPERVISOR_MAX_RESTARTS within
SUPERVISOR_RESTART_WINDOW_MIN. New Go() registers+launches workers
after Start (the API server starts its loops late); nil return without
ctx cancellation now means intentional completion, not a restart.
Start() outlives dead workers so late workers stay supervised.
Config: SUPERVISOR_MAX_RESTARTS (3), SUPERVISOR_RESTART_WINDOW_MIN
(60), SUPERVISOR_BACKOFF_BASE_SEC (5).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): register the 12 unsupervised background loops (Epic 2)
Every scheduler/loop goroutine now runs under the goroutine supervisor
(panic recovery + sliding-window restart budget) instead of as a bare
go func() whose panic silently killed the subsystem forever:
- api.Server (6): periodic-consolidation, context-cooler,
space-prune-scheduler, weekly-gap-interviews, scheduled-sync,
rsic-macro-cron — via injected SetSupervisor(sup.Go) + goSupervised
helper (bgWg brackets each run; stop channels remain the graceful
path and return nil = no restart)
- …
* fix(nosilent-001): sync embedded launchd server plist with source (CI)
CI "Verify embedded launchd templates match source" diffs packaging/launchd/*
against internal/cli/launchd_templates/* (the embed.FS copy mdemg service
install uses). The PATH addition landed only in the source copy; sync the
embedded copy so they match byte-for-byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(roadmap): add jiminy-governance skill build-out (Workstream C, Action 7)
Records the jiminy-governance Claude Code skill on the active forward roadmap
(SPRINT_ROADMAP_POST_FT_LORA.md, cross-cutting governance) + brings the source
spec into the repo (docs/development/jiminy-governance-skill/SKILL.md, out of
~/Downloads). The skill makes Jiminy the deterministic source of context +
governance over J17, enforced by the PreToolUse hook — a routing/handshake shim,
not a rulebook. Build-out scope notes the wire-up placeholders that must be
resolved against the real instance (Jiminy MCP/endpoint, PreToolUse hook, J17
ack/RetireCode/GUIDANCE_OUTCOME calls). Aligns with the now-live guidance loop
(RRF-SCALE-001 / JIMINY-OUTCOME-001 / GUIDANCE-SYNTH-001).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): resolve skill wire-up against the real instance
Step 1 of the jiminy-governance build-out (roadmap Workstream C Action 7).
Resolved all five placeholders from the running MDEMG instance, verified live:
- Jiminy query: MCP `mdemg mcp` (stdio) → jiminy_guide/validate_changes; HTTP
/v1/jiminy/guide (returns guidance_id) + /bootstrap glossary (j17v1) + /latest.
- PreToolUse: pre-bash-check.py (Bash, fail-closed) + pre-write-check.py
(Write/Edit → /v1/jiminy/classify, /strict-only, fail-open).
- SessionID: claude-core convention.
- Comprehension ack: /v1/jiminy/protocol/feedback (verified ingested 1).
- GUIDANCE_OUTCOME: /v1/jiminy/feedback {guidance_id,…}.
- RetireCode: internal-only by design (RSIC/APE protocol-evolution) — no
agent-facing call; agent must never self-retire a constraint.
Also surfaced the two real integration gaps the build-out must close (the work,
not the prose): (a) the MDEMG MCP server is NOT registered (.mcp.json absent —
context is pushed by prompt-context.sh, not pulled by the agent); (b) PreToolUse
enforcement is /strict-gated + fail-open, so not deterministic-by-default.
Roadmap Action 7 updated to reflect step 1 done + the two gaps as next steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(jiminy-governance): ship the J17 governance skill + register MDEMG MCP
Build-out of roadmap Workstream C Action 7 (steps 2-4), per the resolved
wire-up. Closes the two integration gaps found while resolving:
- gap 1 (MCP not registered): .mcp.json registers `mdemg mcp` (stdio).
Live-probed: 20 tools incl. jiminy_guide + validate_changes — the agent can
now PULL guidance, not only receive the hook push.
- gap 2 (enforcement /strict-gated + fail-open): policy set — Write/Edit J17
gate kept fail-open (hard server dependency on every edit is too brittle);
the skill's handshake auto-enables /strict so the gate is active per session.
Bash gate already fail-closed (demonstrated live when a test payload with a
destructive force-push string was blocked by pre-bash-check.py).
Skill authored at the canonical .claude/skills/jiminy-governance/SKILL.md
(frontmatter valid; concrete wire-up inline — MCP tools + HTTP endpoints +
SessionID claude-core + the 5-step handshake). Kept a routing/handshake shim,
not a rulebook (rules stay in the graph).
Live Tier-3 PASSED: full handshake identify->request->comprehend->act->report
ran against the real instance; GUIDANCE_OUTCOME edges 906->909 (one per coded
constraint). Verification: docs/development/jiminy-governance-skill/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(jiminy-governance): commit install-ready skill + install README
.claude/ is gitignored (per-developer local config), so the installed skill at
.claude/skills/jiminy-governance/SKILL.md is local-only by convention. Commit
the reproducible, install-ready copy (jiminy-governance.skill.md) + a README
with the one-line install (cp into .claude/skills/) so the skill propagates via
the repo. The MCP server it uses is registered in the tracked repo-root
.mcp.json.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID across hooks + skill
Hooks and the jiminy-governance skill hardcoded session_id="claude-core", so
trust/escalation/observations from ALL Claude Code conversations collapsed into
one shared MDEMG session. Claude Code already passes a per-conversation
session_id on stdin to every hook — the implementation just never used it.
Resolver precedence (single rule everywhere): MDEMG_SESSION_ID env (stable-
identity escape hatch) > Claude Code stdin session_id (per-conversation default,
race-free per hook) > ~/.mdemg/.claude-session (published by SessionStart +
UserPromptSubmit for the agent / stdin-less contexts) > claude-core (fallback).
Realizes J17's intended per-(session,constraint) isolation.
Tracked templates updated (internal/cli/hook_templates/): session-start.sh,
prompt-context.sh, post-tool-observe.py, pre-compact.sh + Windows .ps1 variants
— every hardcoded claude-core in MDEMG calls replaced with the resolved id;
session-start/prompt-context publish the session file. Skill SessionID
instruction + handshake steps now resolve <SessionID> instead of claude-core.
Live-verified: a hook resolved a stdin session_id and published the session
file; post-tool-observe wrote an observation keyed to the per-conversation id in
Neo4j (not claude-core). bash -n / py_compile clean; go build + hooks test pass.
Note: pre-write-check.py is local-only (no tracked installer template); its fix
is applied on this machine but won't propagate via `mdemg hooks install` until
it's added to the tracked hooks (follow-up). .claude/* is gitignored, so the
live hook copies aren't committed — the tracked templates propagate via install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID — installed hook copies
The .claude/hooks/ installed copies are tracked (committed before the .claude/*
ignore), so apply the same SessionID resolver here as in the embedded templates
(prior commit). session-start.sh, prompt-context.sh, post-tool-observe.py,
pre-compact.sh now resolve MDEMG_SESSION_ID env > stdin session_id >
~/.mdemg/.claude-session > claude-core. (pre-write-check.py is untracked /
local-only — its fix lives on this machine only; tracking it is the follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcement,guidance-outcome}-neighborhood) + TOC entry;
these were the only two missing endpoints (audited the full route table).
(3) docs/user/cli-reference.md — add mdemg eventgraph {reinforcement,guidance-
outcome}-neighborhood, model run, watchdog status, migrate context-fingerprint,
data curate/validate/clean; fix the stale `model pull --adapter` description
(MODEL-DIST-002 shipped — no longer "deferred/errors"); update the Command Tree
Summary to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch budget
Caught live during the HIDDEN-WEIGHT-001 corpus reingest: the post-ingest
/v1/memory/consolidate call used the shared batch-ingest client
(--timeout, 300s); consolidating a ~10k-node space exceeds that, so the
client reported failure while the server completed the work — the
GUIDANCE-SYNTH-001 bug class (long graph/LLM work needs its own budget).
New --consolidate-timeout flag / INGEST_CONSOLIDATE_TIMEOUT_SEC env
(default 1800s) with a dedicated client. Live-verified: "running
consolidation timeout_sec=1800" → complete.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): Tier 3 verification + corpus restoration + UVTS harness audit + close (Epics 4-5)
Tier 3: real consolidation minted edges with varied cosine weights
(0.83-0.94) + CUIDv2 ids; at-scale via the corpus reingest (9,500 edges,
0 NULL, mean 0.923); gauge holds 0; evaluator rules 16→17.
UVTS harness: corpus space lnl-demo-whk had been deleted with zero trace
(no UVTS run since 2026-05-04 measured anything real); restored by
operator-directed full reingest. A fresh baseline NUMBER remains blocked
by further live-found harness rot — grader/persist breakage, expected-
path format drift, vector post-filter dilution (service.go:1137 global
top-K then space filter) amplified by the duplicate whk-wms space —
complete defect inventory handed to UXTS-CI-001. Retrieval ranking on
the restored corpus verified correct (expected files at ranks 1-4).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): sprint plan — scheduled maintenance actually runs (Epic 0)
Roadmap Q3 Phase 1 rank #4. Weekly decay+prune has never executed
(--dry-run defaults true; plist passes no override) while reporting
success — NOSILENT's blind spot. Tonight's Memory Bloat alerts (79k+
nodes) are the accumulated backlog. Safety verified in code before
planning: nodes are tombstoned (never deleted) with abstraction-chain/
degree/recency protections; edge deletion is the designed near-zero-
weight lifecycle, meaningful now that HIDDEN-WEIGHT made weights real.
Plan: live-by-default plist (+installed refresh), dry_run in job-event
metadata (no schema change — disclosed), maintenance_no_live_run
evaluator rule, darwin upgrade refreshes plists/hooks, first-ever live
run with preview-first protocol.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(maint-live-001): scheduled maintenance runs live — plist passes --dry-run=false (Epic 1)
The weekly LaunchAgent ran `mdemg maintenance` with no dry-run override;
the CLI defaults --dry-run=true, so every scheduled cycle previewed and
reported success — decay+prune NEVER executed (the 79k-node Memory
Bloat backlog). Both plist copies now pass --dry-run=false (the CLI
default stays true for safe manual previews — the SCHEDULE is what must
not silently no-op); installed plist refreshed + agent reloaded.
reportScheduledJobMeta threads job metadata into V0024; maintenance
records dry_run so the only-ever-dry-runs pattern is queryable
(metadata JSONB — no schema change, disclosed in the plan).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): maintenance_no_live_run evaluator rule (Epic 2)
Fires when maintenance rows exist in MAINT_LIVE_LOOKBACK_DAYS (default
8) but none ran live (success + metadata dry_run=false) — the only-
ever-dry-runs pattern self-reports instead of hiding inside "the job
ran". Distinct service maintenance-liveness per the cooldown rule.
Config: MAINT_LIVE_ALERT_ENABLED (true), MAINT_LIVE_LOOKBACK_DAYS (8).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): mdemg upgrade refreshes installed LaunchAgents + hooks (darwin) (Epic 3)
Plist/hook fixes shipped in releases but never reached installed
machines — the maintenance dry-run override would have sat unreachable
next to upgraded binaries forever. Upgrade now re-renders ALREADY-
INSTALLED mdemg LaunchAgents from the new binary's embedded templates
(refresh-only — never installs new services) + re-syncs mdemg-managed
Claude hooks in the current project (marker-checked). Substitution
logic single-sourced into renderLaunchdTemplate (Install + Refresh —
the drift class that exit-78'd the sidecar during HOOKSYNC live smoke).
Mirrors the existing Linux systemd-unit refresh.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): context-dependent orphan policy — --exclude-role-types (Epic 4a)
Orphan disposition is context-dependent (operator, 2026-06-11): a
uniform degree/age rule conflates governance constraints, conversation
history, test junk, and hierarchy debris. New --exclude-role-types on
prune + maintenance (env PRUNE_EXCLUDE_ROLE_TYPES) makes the policy
expressible; the scheduled plist ships
constraint,conversation_observation excluded per the operator's call
(constraints are load-bearing governance rules at any degree;
conversation observations differ by SESSION which the knob can't
express yet). Aged hierarchy debris stays eligible — that's the
lifecycle working. Candidate census that drove the decision: 5,388
conv-obs (9 eligible tonight under the 90d shield), 11 constraints,
238 hierarchy nodes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(prune): orphan sweeps use implicit transactions for batched deletes
Caught by the FIRST-EVER live maintenance run (MAINT-LIVE-001 Tier 3):
Neo4j raises TransactionStartFailed when a batched CALL-IN-TRANSACTIONS
statement executes inside an explicit transaction. Both orphan sweeps
(SymbolNode + Observation) ran their batched delete via ExecuteWrite;
the dry-run path never executes the deleting statement, so no preview
or unit test could surface it — only live execution. Switched to
session.Run (implicit tx). The failure ALSO proved the NOSILENT chain
live: the run fired "Scheduled job failed: maintenance" before exiting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): first live run verification + feature doc + CHANGELOG + close (Epics 4b-5)
First live maintenance in MDEMG history: 20,236 orphan SymbolNodes
deleted; all 5,010 tombstone candidates protected (recency + operator
exclusions); liveness rule born-firing → silenced by the real run; the
3-row job-event story (preview/true → failure/false alerted →
success/false) proves the dry_run plumbing through every path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md architecture note for MAINT-LIVE-001
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): sprint plan — embedder wiring + ingest exec resolution (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(embed-wire-001): breaker + recorder reach the real embedder through the wrapper chain (Epic 1)
The embedding circuit breaker was NEVER wired in any default deployment:
embeddings.New returns *CachedEmbedder when EMBEDDING_CACHE_ENABLED=true
(the default), so the server's emb.(*embeddings.OpenAI)/(*Ollama)
assertions on the OUTERMOST value failed silently (no else branch). The
recorder assertion had the inverse fragility (cache off → training-data
recording silently dies).
New: Unwrap() chain (CachedEmbedder joins RateLimitedEmbedder's existing
one) + embeddings.Base() / FindCached() interface-driven walkers — any
future wrapper joins by adding Unwrap(), no type lists. Wiring now walks
to the base for the breaker and to the cache layer for the recorder,
with LOUD warns when nothing matches. Tier 1 pins the production shape
(ratelimit(cache(provider))) plus cache-off and bare chains.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest-exec-001): server-triggered ingest resolves the mdemg binary — was hardcoded ./bin/mdemg (Epic 2)
Both ingest-job exec sites ran a relative "./bin/mdemg": broken in
Docker (the documented-primary deployment — binary at /usr/local/bin,
no repo checkout) and any CWD other than the repo root. New
resolveMdemgBin(): MDEMG_BIN env → os.Executable() (the server IS the
binary) → PATH → ./bin/mdemg legacy fallback; cached; Tier 1 pins the
order. Scheduled-sync jobs now report outcomes to scheduled_job_events
via jobhealth (job_name codebase-sync) — an unattended sync that keeps
failing is never silent; manual API jobs stay queue-visible only.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): live verification + CHANGELOG + CLAUDE.md + close (Epics 3-4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): sprint plan — documentation matches reality (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): CLAUDE.md FT section rewritten to post-pivot reality (Epic 1)
The section presented the abandoned Qwen3.6-35B-A3B MoE target, two-tier
MoE-Sieve strategy, and Sprint A→E critical path as CURRENT — all
superseded by the 2026-04-22 MoE→dense pivot; this stale text seeded the
Q3 roadmap audit with a dead architecture. Rewritten: shipped state
(dense Qwen3-14B mdemg-llm-v1, 0.8389, llama-server runtime), superseded
plan documented with the pivot rationale (never deleted — supersede-with-
pointer), guardrail llmclient exception marked CLOSED (re-verified in
code), memo-07 provenance break disclosed (the file never existed;
00_README_v2.md is canonical), open FT work named (FT-CLASSIFY-002 +
recursive-retraining trigger). Adapter env-name drift fixed
(MDEMG_ADAPTER_BASE, not MDEMG_MODEL_ADAPTER_BASE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(doc-truth-001): operator-facing text matches the Phase 13.5 reality (Epic 2)
preflight errors directed operators to start the DECOMMISSIONED
mlx_lm.server on :8101 — following them reintroduces the crash-looping
stack Phase 13.5 replaced. Now: llama-server :8102 guidance (managed
service install + manual command), backend-agnostic wording. model.go
help text dropped three stale "deferred to MODEL-DIST-002" mentions
(shipped 2026-05-25). Operationally (untracked .env): removed the
J17_SIDECAR_TIMEOUT_MS=200 override that re-pinned the exact value
DH-004 remediated — the 1000ms default now applies; server restarted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): 00_README STATUS block + AGENT_HANDOFF retired (Epic 3)
00_README_v2.md gains a top-of-file STATUS block: shipped-through-
cutover state, superseded MoE plan (FT-2 skip + FT-3 supersession +
R-LT-4 prototype-discipline adjudication recorded), the NOT-STARTED
recursive-retraining loop with its FT-CLASSIFY-002 trigger, and
provenance notes (memo-07 never existed; the spec is untracked pending
FG-2). AGENT_HANDOFF.md (stale since 2026-05-06) retired to a pointer
stub — handoff state lives in CLAUDE.md/roadmap/CHANGELOG/CMS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): grep-sweep proof + CHANGELOG + close (Epic 4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): last stale --adapter help string (sweep straggler)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): sprint plan — fail-closed self-improvement (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): honest criteria evaluation — populated keys + fail-closed mutations (Epics 1-2)
The cycle baseline populated 10 metric keys while task criteria
referenced ~15 others (only volatile_count + correction_rate
intersected) → missing_data → skip → ~16/17 actions validated
vacuously; criteria-driven rollback was unreachable. The
SelfAssessmentReport already carried nearly every needed key — they
were never copied into the maps.
New single source reportMetricsMap() feeds BOTH MetricsBefore and
MetricsAfter (the mismatch class cannot recur), resolving
edges_below_threshold, total_edges, consolidation_age_sec,
avg_edge_weight, guidance_health, protocol_health + 13 more. Fail-
closed rule: for MUTATING actions (15-entry registry) a criterion with
missing evidence counts as NOT met ("missing_data_failclosed") — an
unverifiable mutation must never be recorded as success; observational
actions keep advisory semantics. The prior test pinned the vacuous
pass as the contract — updated to the honest one + advisory companion.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): tombstone_stale scoped to correction-linked nodes; refresh_stale_edges decays for real (Epic 3)
tombstone_stale archived 50 ARBITRARY older observations whenever ANY
correction existed in the 7-day window — no relationship between
correction and target. Now requires linkage: same session as the
correction OR its 1-hop CO_ACTIVATED_WITH neighborhood. Live check:
0 corrections in the current 7-day window, so both old and new scopes
are 0 RIGHT NOW — the hazard was conditional (any future correction
re-armed the old query against thousands of unrelated observations;
the new query bounds it to genuinely related nodes).
refresh_stale_edges bumped last_activated BEFORE the weight expression
read it → staleness=0 → the decay term vanished → every refresh was a
pure +0.1·log(count+1) boost. Staleness now captured via WITH before
SET; weights can genuinely decay. Both statements EXPLAIN-validated.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): counter-free confidence calibration — RSIC stops polluting its own signal (Epic 4)
RSIC-SK1 injected synthetic "followed"/"ignored" outcomes through
UpdateConfidence, incrementing total_surfaced/total_followed/
total_ignored — the exact counters GetConstraintEffectiveness reads
next cycle: measured effectiveness drove synthetic outcomes which drove
measured effectiveness (circular self-reinforcement). New
AdjustConfidenceDirect applies the clamp+archive confidence delta with
ZERO counter writes; the outcome counters now belong exclusively to
real guidance feedback. Provider interface + adapter + dispatcher use
the direct path with the configured boost/decay magnitudes; test mock
maps deltas back to outcome labels so existing assertions keep meaning.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): Tier 3 verification + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(rsic-validate-001): integration seeds carry session linkage for the scoped tombstone contract
CI's TombstoneStaleEndToEnd + MultiActionDispatchAndMetrics failed
because the seeded observations had NO relationship to the seeded
corrections — under the old behavior they were archived anyway (the
memory-eroding bug the sprint removed); under the new correction-
linkage contract they are correctly spared. SeedObservationNodes now
stamps a per-space test session shared by corrections and their stale
peers, so the tests exercise the new contract. Query-level proof
against the exact seeded shape: 10/10 linked observations match the
scoped Cypher. (Local integration runs hit the 30s client timeout —
the loaded local stack's cycles take ~6 min; CI's arbitrates.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): sprint plan — finish the score-scale contract (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): persistent rerank clients — failure alerting re-armed on the hottest LLM path (Epic 1)
doRerankWithOpenAI/doRerankWithOllama constructed a fresh llmclient per
call: the consecutive-failure counter reset every time, so
LLM_CONSECUTIVE_FAILURE_THRESHOLD could NEVER fire for
retrieval.rerank_cross / rerank_nli (a north-star distill task), and the
HTTP transport was discarded per call. Per-provider base clients now
init once (sync.Once); WithContext() shallow-copies and SHARES the
*atomic counter + breaker, so per-call contexts keep failure accounting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): config-driven score thresholds — suggest revival, MCP tiers, guardrail floor (Epic 2)
Three score-literal leftovers from the RRF-SCALE-001 audit instruction:
(1) /v1/memory/suggest's hardcoded 0.5 min-confidence default filtered
nearly everything on a scale topping out ~0.58 → CONSULTING_SUGGEST_
MIN_CONFIDENCE (default 0.45, RRF-calibrated); (2) MCP memory_reflect
tiers 0.7/0.4 (high tier unreachable) → MCP_REFLECT_SCORE_HIGH/_MEDIUM
(0.45/0.25); (3) guardrail constraint-retrieval Cypher's hardcoded
sim > 0.3 → GUARDRAIL_CONSTRAINT_SIM_FLOOR via GuardrailConfig
(cosine-stable today but inside the class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): CacheKey covers ALL result-affecting fields + two forcing functions (Epics 3-4)
CACHE-KEY-002: the key omitted result-affecting RetrieveRequest fields —
the audit named 5 (include/exclude_extensions, temporal_after/before,
policy_context); the new reflection forcing-function caught 8 MORE on
its first run: sparse-gate per-call overrides (SparseEnabled/
SparsePercentile/SparseOverridePresent/Category — the ?sparse= URL
params), pagination (Cursor/Limit), and the context-fingerprint params
(QueryContextFingerprint/StrictContextMode). All now keyed, plus a
caller-supplied query-embedding hash. Two requests differing in any of
these no longer collide on one cache entry.
Forcing functions: (1) reflection test — every RetrieveRequest field
must be in CacheKey or explicitly classified result-neutral with
justification (new fields fail until classified); (2) score-literal
scan — flags `.Score/score <op> 0.x` comparisons repo-wide outside a
justified allowlist (first run triaged 3 scale-local sites; clamp
guards excluded by pattern). The RRF-SCALE bug class is now CI-caught.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): live-calibrated suggest floor + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): sprint plan — stable concept identity, two-PR delivery (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): automated consolidation no longer skips LLM emergence (Epic A1)
dynamicEmergenceStep registers at phase 22, but RunConsolidation ran
hardcoded ranges (10,20) + (25,30) — phase 22 fell in the gap, so with
EMERGENCE_ENABLED=true the AUTOMATED path silently skipped LLM concept
emergence while the manual path (RunNodeCreationPipeline, 10–22 with an
emergence gate) ran it. RunConsolidation now delegates to
RunNodeCreationPipeline(cfg.EmergenceEnabled) — single range source; a
pin test fails if the step's phase ever leaves the range.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): stable theme identity — centroid match-or-create replaces the 5-minute churn (Epic A2)
ClusterConversations detached EVERY observation→theme edge, deleted
childless themes, and recreated all themes from scratch each ~5-min
cycle: new node_ids every run, evidence chains destroyed continuously,
recall flooded with stacks of near-identical concepts (observed live in
this session's own prompt headers).
New flow: cluster first → match each cluster to an EXISTING theme by
centroid cosine (HIDDEN_THEME_IDENTITY_SIM_THRESHOLD, default 0.90,
greedy with per-run claiming) → matched themes UPDATE in place
(props + theme-scoped member-edge rewire; node_id and all inbound
references survive) → unmatched clusters create as before → only themes
claimed by NO cluster are deleted. The global detach is gone.
ThemesUpdated added to the result. Tier 1: match/threshold/claimed/
best-of selection.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* chore(hidden-churn-001): remove the dead global-detach helper — the churn mechanism itself
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-A verification + CHANGELOG (Epics A3-A4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): PR-B coverage retune — config ratio, density assignment, gauge + rule (Epic B1)
maxThemes was an inline ceil(n/10) equation → HIDDEN_THEME_TARGET_RATIO
(default preserves it). NOISE observations (previously dropped from the
hierarchy forever — the 94% coverage gap's mechanism) now density-assign
to their nearest theme when cosine ≥ HIDDEN_THEME_ASSIGN_SIM_THRESHOLD
(default 0.70; edges only, no new themes; below-floor stays unthemed
honestly). New per-space coverage gauge
mdemg_neo4j_conversation_coverage_ratio (collector Query 5) + evaluator
rule low_conversation_coverage (CONVERSATION_COVERAGE_ALERT_FLOOR 0.2,
6h ForDuration for convergence).
Audit bonus: caught WeightIntegrityRules querying metric_samples with
recorded_at — the column is `time`; the null-weight rule had been
silently erroring every evaluation since it shipped (Debug-only logging
— the SUPERVISOR-002 finding in action). Both rules fixed + a pin test
bans recorded_at against metric_samples.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): mdemg concepts repair + trace — grounding audit CLI (Epic B2)
repair: tombstones childless layer>=2 abstraction nodes (no inbound
ABSTRACTS_TO|GENERALIZES|GENERALIZES_TO — 10,395 live in mdemg-dev,
churn-era debris). Recoverable (is_archived=true + archived_reason),
batched, dry-run default, --limit for small-batch-first verification.
trace: per-node grounding audit — direct children, transitive per-layer
census, grounded/ungrounded verdict, sample path to L0.
Live data note: GENERALIZES alone over-counts (19,147) — ABSTRACTS_TO
is the hidden layer's actual child edge; pin test guards the predicate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): surface themes_updated + noise_assigned in consolidate API + periodic log (Epic B3)
/v1/conversation/consolidate now reports themes_updated and
noise_assigned alongside themes_created. The periodic-consolidation
log condition also gains both — with stable theme identity (PR-A),
created is usually 0 on healthy cycles, which would have silenced the
success log entirely (the silent-success bug class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): live-smoke fixes — noise pool was structurally empty, clustering included archived debris, coverage gauge gated on min-obs
Three defects only the live run surfaced (Tier 3 forcing function):
1. KMeans never emits label -1, so the density-assignment hook received
an always-empty noise list; the min-samples/max-themes/nil-centroid
drops now feed their members into the noise pool instead of silently
excluding them from the hierarchy.
2. fetchClusterableConversationObservations had no is_archived filter —
it clustered 4,838 observations of which only 183 were live (MAINT-LIVE
tombstones), building themes on archived debris. Both fetch variants
now exclude archived. Live effect: 24 debris themes swept to 5 clean
ones; second cycle themes_updated=5/created=0 (stable identity on
real data).
3. Coverage gauge gated on CONVERSATION_COVERAGE_MIN_OBS (default 50,
DH-005 confidence-threshold pattern) — tiny scratch/test spaces
(2-13 observations) emitted 0.000 and would have alarmed forever
(born-firing alert hazard). Sentinel -1 skips emission.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-B verification + CHANGELOG + CLAUDE.md — sprint complete (Epic B5)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): sprint plan + background loop inventory (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): sliding-window restart budget + late registration (Epic 1)
The restart counter only ever incremented — a once-a-week transient
permanently killed a worker after 3 weeks. Budget is now a sliding
window (restarts older than the window are forgotten); permanent
failure requires >SUPERVISOR_MAX_RESTARTS within
SUPERVISOR_RESTART_WINDOW_MIN. New Go() registers+launches workers
after Start (the API server starts its loops late); nil return without
ctx cancellation now means intentional completion, not a restart.
Start() outlives dead workers so late workers stay supervised.
Config: SUPERVISOR_MAX_RESTARTS (3), SUPERVISOR_RESTART_WINDOW_MIN
(60), SUPERVISOR_BACKOFF_BASE_SEC (5).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): register the 12 unsupervised background loops (Epic 2)
Every scheduler/loop goroutine now runs under the goroutine supervisor
(panic recovery + sliding-window restart budget) instead of as a bare
go func() whose panic silently killed the subsystem forever:
- api.Server (6): periodic-consolidation, context-cooler,
space-prune-scheduler, weekly-gap-interviews, scheduled-sync,
rsic-macro-cron — via injected SetSupervisor(sup.Go) + goSupervised
helper (bgWg brackets each run; stop channels remain the graceful
path and return nil = no restart)
- ape (3): rsic-watchdog, rsic-store-flush, signal-learner-flush
- backup schedulers (2): neo4j-backup-scheduler, tsdb-backup-scheduler
— their NewServer construction-time Start() moved to
StartSupervisedBackground() (serve.go) so the hook exists first
- serve.go (1): llm-fastfail-burst-flush via sup.Go
All owners keep a nil-hook fallback (legacy bare goroutine), so tests
and non-server callers are unchanged. Buffered TSDB writers are
explicitly out of scope (TSDB-CONSUME-001 owns flush observability).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): rule-health meta-alert on evaluator query failures (Epic 3)
A rule whose SQL errors was a silently-disabled alert: failures were
logged at Debug and nothing watched the watcher — bitten twice in one
week (HIDDEN-WEIGHT-001 null-weight rule + the recorded_at column bug,
both found by accident in later sprints). Query failures now log at
Warn, and after ALERT_RULE_FAILURE_THRESHOLD (default 3) consecutive
failures a high-severity meta-alert fires directly via the dispatcher
(not via a rule — the meta-channel must not depend on the failing
mechanism). Service label is rule-health-<rule-id> so concurrent
failing rules don't cooldown-suppress each other; success re-arms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): recency-gate the RSIC llm_error_rate_spike insight (Epic 4)
Insight 26 computes the error rate over a 24h window with no recency
requirement, so a 35-min jiminy.synthesize timeout burst at 02:00 UTC
kept re-firing HIGH 'LLM error rate spike' (and escalating to CRITICAL
'Jiminy Pipeline Critical') every RSIC micro-cycle for 12+ hours after
the incident self-resolved (live, 2026-06-11). LLMPerformanceSummary
now carries LastErrorAt (MAX(time) over errored rows); the spike
insight fires only when the most recent error is within
RSIC_LLM_ERROR_RECENCY_MIN (default 60; 0 disables the gate). A zero
LastErrorAt (older data source) keeps legacy behavior — the gate never
widens silently.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): nolint G118 on legacy-fallback loop launches
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): aggregate evaluator-degraded alert for global outages
A TSDB-level outage fails every rule at once; per-rule meta-alerts
would storm ~19 alerts duplicating the health prober's signal. At the
failure threshold the evaluator now distinguishes: other rules
succeeding recently → per-rule rule-health alert (broken SQL class);
nothing succeeding within threshold×interval → ONE
alert-evaluator-degraded alert per outage. Success re-arms both.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(jiminy): detach feedback outcome processing from the hook's connection lifetime
Live-smoke surprise during SUPERVISOR-002 Epic 5 (own fix-commit per
policy): jiminy.evaluate_llm was failing at 94.9% (657 'context
canceled' rows/24h). The post-tool-observe hook POSTs
/v1/jiminy/feedback with curl --max-time 5, but per-item Tier-2
outcome classification routinely outlives the connection — the request
ctx then cancelled every in-flight LLM call and outcomes silently
degraded to the keyword heuristic. Same defect class as
GUIDANCE-SYNTH-001's warm-path budget.
handleJiminyFeedback now uses context.WithoutCancel(r.Context()) with
its own server-side budget JIMINY_FEEDBACK_TIMEOUT_MS (default 60000,
0 = unbounded). The hook keeps its fire-and-forget 5s curl.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): streak-relative global-outage discriminator (drill-caught)
The Epic 5 TSDB-stop drill caught the freshness-window heuristic
misclassifying outage ONSET: rules were succeeding seconds before the
stop, so lastAnySuccess was fresh when the first rules hit threshold —
2 per-rule alerts leaked before the aggregate fired. The discriminator
is now streak-relative: per-rule only when some other rule succeeded
AFTER this rule's failure streak began; otherwise global, once per
outage. Unit-pinned with the drill scenario
(TestRuleFailureStreak_OutageOnsetIsGlobal).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): harden the restore path (Epics 1-4)
Epics 1-4 land together — they share the restore-path signatures.
1. Checksum gate (Epic 1): the manifest SHA-256 was written at backup
time and never read; a corrupted .mdemg restored silently. The gate
now fails closed before import; legacy manifests without a checksum
warn and proceed.
2. Snapshot completion polling (Epic 2): the pre-restore safety
snapshot was raced with time.Sleep(2s) against an async backup
goroutine. waitForBackupJob now polls the jobs queue until
completed, failing closed on failure/cancel/vanish/timeout
(BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC, default 300).
3. Count validation (Epic 3): manifest NodeCount/EdgeCount are
whole-database counts and cannot validate file contents (they
diverge on partial backups) — new additive file_node_count/
file_edge_count/file_observation_count manifest fields are counted
from the exported chunks; restore re-counts the file and hard-fails
on mismatch (truncation class). Importer accounting divergence
under CONFLICT_SKIP is warn-only, surfaced in a job-result
validation block.
4. dockerbin routing (Epic 4): the legacy .dump restore shelled out to
bare "docker" (the launchd-minimal-PATH class NOSILENT-001 fixed
for TSDB); now routes via dockerbin unless the operator set a
non-default FullCmd.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): neo4j-backup jobhealth + generalized staleness rules (Epic 5)
The default-ON Neo4j backup scheduler had zero jobhealth coverage —
the inverse of NOSILENT-001 (which wired only tsdb-backup). The
scheduler now waits on each triggered job (its Trigger is queue-async;
a fire-and-forget report would always claim success) and reports
outcome via SetResultHook → jobhealth.Report with
job_name='neo4j-backup' (wired in SetTSDBClient next to th…
* docs(jiminy-governance): commit install-ready skill + install README
.claude/ is gitignored (per-developer local config), so the installed skill at
.claude/skills/jiminy-governance/SKILL.md is local-only by convention. Commit
the reproducible, install-ready copy (jiminy-governance.skill.md) + a README
with the one-line install (cp into .claude/skills/) so the skill propagates via
the repo. The MCP server it uses is registered in the tracked repo-root
.mcp.json.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID across hooks + skill
Hooks and the jiminy-governance skill hardcoded session_id="claude-core", so
trust/escalation/observations from ALL Claude Code conversations collapsed into
one shared MDEMG session. Claude Code already passes a per-conversation
session_id on stdin to every hook — the implementation just never used it.
Resolver precedence (single rule everywhere): MDEMG_SESSION_ID env (stable-
identity escape hatch) > Claude Code stdin session_id (per-conversation default,
race-free per hook) > ~/.mdemg/.claude-session (published by SessionStart +
UserPromptSubmit for the agent / stdin-less contexts) > claude-core (fallback).
Realizes J17's intended per-(session,constraint) isolation.
Tracked templates updated (internal/cli/hook_templates/): session-start.sh,
prompt-context.sh, post-tool-observe.py, pre-compact.sh + Windows .ps1 variants
— every hardcoded claude-core in MDEMG calls replaced with the resolved id;
session-start/prompt-context publish the session file. Skill SessionID
instruction + handshake steps now resolve <SessionID> instead of claude-core.
Live-verified: a hook resolved a stdin session_id and published the session
file; post-tool-observe wrote an observation keyed to the per-conversation id in
Neo4j (not claude-core). bash -n / py_compile clean; go build + hooks test pass.
Note: pre-write-check.py is local-only (no tracked installer template); its fix
is applied on this machine but won't propagate via `mdemg hooks install` until
it's added to the tracked hooks (follow-up). .claude/* is gitignored, so the
live hook copies aren't committed — the tracked templates propagate via install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): per-conversation SessionID — installed hook copies
The .claude/hooks/ installed copies are tracked (committed before the .claude/*
ignore), so apply the same SessionID resolver here as in the embedded templates
(prior commit). session-start.sh, prompt-context.sh, post-tool-observe.py,
pre-compact.sh now resolve MDEMG_SESSION_ID env > stdin session_id >
~/.mdemg/.claude-session > claude-core. (pre-write-check.py is untracked /
local-only — its fix lives on this machine only; tracking it is the follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcement,guidance-outcome}-neighborhood) + TOC entry;
these were the only two missing endpoints (audited the full route table).
(3) docs/user/cli-reference.md — add mdemg eventgraph {reinforcement,guidance-
outcome}-neighborhood, model run, watchdog status, migrate context-fingerprint,
data curate/validate/clean; fix the stale `model pull --adapter` description
(MODEL-DIST-002 shipped — no longer "deferred/errors"); update the Command Tree
Summary to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch budget
Caught live during the HIDDEN-WEIGHT-001 corpus reingest: the post-ingest
/v1/memory/consolidate call used the shared batch-ingest client
(--timeout, 300s); consolidating a ~10k-node space exceeds that, so the
client reported failure while the server completed the work — the
GUIDANCE-SYNTH-001 bug class (long graph/LLM work needs its own budget).
New --consolidate-timeout flag / INGEST_CONSOLIDATE_TIMEOUT_SEC env
(default 1800s) with a dedicated client. Live-verified: "running
consolidation timeout_sec=1800" → complete.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): Tier 3 verification + corpus restoration + UVTS harness audit + close (Epics 4-5)
Tier 3: real consolidation minted edges with varied cosine weights
(0.83-0.94) + CUIDv2 ids; at-scale via the corpus reingest (9,500 edges,
0 NULL, mean 0.923); gauge holds 0; evaluator rules 16→17.
UVTS harness: corpus space lnl-demo-whk had been deleted with zero trace
(no UVTS run since 2026-05-04 measured anything real); restored by
operator-directed full reingest. A fresh baseline NUMBER remains blocked
by further live-found harness rot — grader/persist breakage, expected-
path format drift, vector post-filter dilution (service.go:1137 global
top-K then space filter) amplified by the duplicate whk-wms space —
complete defect inventory handed to UXTS-CI-001. Retrieval ranking on
the restored corpus verified correct (expected files at ranks 1-4).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): sprint plan — scheduled maintenance actually runs (Epic 0)
Roadmap Q3 Phase 1 rank #4. Weekly decay+prune has never executed
(--dry-run defaults true; plist passes no override) while reporting
success — NOSILENT's blind spot. Tonight's Memory Bloat alerts (79k+
nodes) are the accumulated backlog. Safety verified in code before
planning: nodes are tombstoned (never deleted) with abstraction-chain/
degree/recency protections; edge deletion is the designed near-zero-
weight lifecycle, meaningful now that HIDDEN-WEIGHT made weights real.
Plan: live-by-default plist (+installed refresh), dry_run in job-event
metadata (no schema change — disclosed), maintenance_no_live_run
evaluator rule, darwin upgrade refreshes plists/hooks, first-ever live
run with preview-first protocol.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(maint-live-001): scheduled maintenance runs live — plist passes --dry-run=false (Epic 1)
The weekly LaunchAgent ran `mdemg maintenance` with no dry-run override;
the CLI defaults --dry-run=true, so every scheduled cycle previewed and
reported success — decay+prune NEVER executed (the 79k-node Memory
Bloat backlog). Both plist copies now pass --dry-run=false (the CLI
default stays true for safe manual previews — the SCHEDULE is what must
not silently no-op); installed plist refreshed + agent reloaded.
reportScheduledJobMeta threads job metadata into V0024; maintenance
records dry_run so the only-ever-dry-runs pattern is queryable
(metadata JSONB — no schema change, disclosed in the plan).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): maintenance_no_live_run evaluator rule (Epic 2)
Fires when maintenance rows exist in MAINT_LIVE_LOOKBACK_DAYS (default
8) but none ran live (success + metadata dry_run=false) — the only-
ever-dry-runs pattern self-reports instead of hiding inside "the job
ran". Distinct service maintenance-liveness per the cooldown rule.
Config: MAINT_LIVE_ALERT_ENABLED (true), MAINT_LIVE_LOOKBACK_DAYS (8).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): mdemg upgrade refreshes installed LaunchAgents + hooks (darwin) (Epic 3)
Plist/hook fixes shipped in releases but never reached installed
machines — the maintenance dry-run override would have sat unreachable
next to upgraded binaries forever. Upgrade now re-renders ALREADY-
INSTALLED mdemg LaunchAgents from the new binary's embedded templates
(refresh-only — never installs new services) + re-syncs mdemg-managed
Claude hooks in the current project (marker-checked). Substitution
logic single-sourced into renderLaunchdTemplate (Install + Refresh —
the drift class that exit-78'd the sidecar during HOOKSYNC live smoke).
Mirrors the existing Linux systemd-unit refresh.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): context-dependent orphan policy — --exclude-role-types (Epic 4a)
Orphan disposition is context-dependent (operator, 2026-06-11): a
uniform degree/age rule conflates governance constraints, conversation
history, test junk, and hierarchy debris. New --exclude-role-types on
prune + maintenance (env PRUNE_EXCLUDE_ROLE_TYPES) makes the policy
expressible; the scheduled plist ships
constraint,conversation_observation excluded per the operator's call
(constraints are load-bearing governance rules at any degree;
conversation observations differ by SESSION which the knob can't
express yet). Aged hierarchy debris stays eligible — that's the
lifecycle working. Candidate census that drove the decision: 5,388
conv-obs (9 eligible tonight under the 90d shield), 11 constraints,
238 hierarchy nodes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(prune): orphan sweeps use implicit transactions for batched deletes
Caught by the FIRST-EVER live maintenance run (MAINT-LIVE-001 Tier 3):
Neo4j raises TransactionStartFailed when a batched CALL-IN-TRANSACTIONS
statement executes inside an explicit transaction. Both orphan sweeps
(SymbolNode + Observation) ran their batched delete via ExecuteWrite;
the dry-run path never executes the deleting statement, so no preview
or unit test could surface it — only live execution. Switched to
session.Run (implicit tx). The failure ALSO proved the NOSILENT chain
live: the run fired "Scheduled job failed: maintenance" before exiting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): first live run verification + feature doc + CHANGELOG + close (Epics 4b-5)
First live maintenance in MDEMG history: 20,236 orphan SymbolNodes
deleted; all 5,010 tombstone candidates protected (recency + operator
exclusions); liveness rule born-firing → silenced by the real run; the
3-row job-event story (preview/true → failure/false alerted →
success/false) proves the dry_run plumbing through every path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md architecture note for MAINT-LIVE-001
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): sprint plan — embedder wiring + ingest exec resolution (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(embed-wire-001): breaker + recorder reach the real embedder through the wrapper chain (Epic 1)
The embedding circuit breaker was NEVER wired in any default deployment:
embeddings.New returns *CachedEmbedder when EMBEDDING_CACHE_ENABLED=true
(the default), so the server's emb.(*embeddings.OpenAI)/(*Ollama)
assertions on the OUTERMOST value failed silently (no else branch). The
recorder assertion had the inverse fragility (cache off → training-data
recording silently dies).
New: Unwrap() chain (CachedEmbedder joins RateLimitedEmbedder's existing
one) + embeddings.Base() / FindCached() interface-driven walkers — any
future wrapper joins by adding Unwrap(), no type lists. Wiring now walks
to the base for the breaker and to the cache layer for the recorder,
with LOUD warns when nothing matches. Tier 1 pins the production shape
(ratelimit(cache(provider))) plus cache-off and bare chains.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest-exec-001): server-triggered ingest resolves the mdemg binary — was hardcoded ./bin/mdemg (Epic 2)
Both ingest-job exec sites ran a relative "./bin/mdemg": broken in
Docker (the documented-primary deployment — binary at /usr/local/bin,
no repo checkout) and any CWD other than the repo root. New
resolveMdemgBin(): MDEMG_BIN env → os.Executable() (the server IS the
binary) → PATH → ./bin/mdemg legacy fallback; cached; Tier 1 pins the
order. Scheduled-sync jobs now report outcomes to scheduled_job_events
via jobhealth (job_name codebase-sync) — an unattended sync that keeps
failing is never silent; manual API jobs stay queue-visible only.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): live verification + CHANGELOG + CLAUDE.md + close (Epics 3-4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): sprint plan — documentation matches reality (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): CLAUDE.md FT section rewritten to post-pivot reality (Epic 1)
The section presented the abandoned Qwen3.6-35B-A3B MoE target, two-tier
MoE-Sieve strategy, and Sprint A→E critical path as CURRENT — all
superseded by the 2026-04-22 MoE→dense pivot; this stale text seeded the
Q3 roadmap audit with a dead architecture. Rewritten: shipped state
(dense Qwen3-14B mdemg-llm-v1, 0.8389, llama-server runtime), superseded
plan documented with the pivot rationale (never deleted — supersede-with-
pointer), guardrail llmclient exception marked CLOSED (re-verified in
code), memo-07 provenance break disclosed (the file never existed;
00_README_v2.md is canonical), open FT work named (FT-CLASSIFY-002 +
recursive-retraining trigger). Adapter env-name drift fixed
(MDEMG_ADAPTER_BASE, not MDEMG_MODEL_ADAPTER_BASE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(doc-truth-001): operator-facing text matches the Phase 13.5 reality (Epic 2)
preflight errors directed operators to start the DECOMMISSIONED
mlx_lm.server on :8101 — following them reintroduces the crash-looping
stack Phase 13.5 replaced. Now: llama-server :8102 guidance (managed
service install + manual command), backend-agnostic wording. model.go
help text dropped three stale "deferred to MODEL-DIST-002" mentions
(shipped 2026-05-25). Operationally (untracked .env): removed the
J17_SIDECAR_TIMEOUT_MS=200 override that re-pinned the exact value
DH-004 remediated — the 1000ms default now applies; server restarted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): 00_README STATUS block + AGENT_HANDOFF retired (Epic 3)
00_README_v2.md gains a top-of-file STATUS block: shipped-through-
cutover state, superseded MoE plan (FT-2 skip + FT-3 supersession +
R-LT-4 prototype-discipline adjudication recorded), the NOT-STARTED
recursive-retraining loop with its FT-CLASSIFY-002 trigger, and
provenance notes (memo-07 never existed; the spec is untracked pending
FG-2). AGENT_HANDOFF.md (stale since 2026-05-06) retired to a pointer
stub — handoff state lives in CLAUDE.md/roadmap/CHANGELOG/CMS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): grep-sweep proof + CHANGELOG + close (Epic 4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): last stale --adapter help string (sweep straggler)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): sprint plan — fail-closed self-improvement (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): honest criteria evaluation — populated keys + fail-closed mutations (Epics 1-2)
The cycle baseline populated 10 metric keys while task criteria
referenced ~15 others (only volatile_count + correction_rate
intersected) → missing_data → skip → ~16/17 actions validated
vacuously; criteria-driven rollback was unreachable. The
SelfAssessmentReport already carried nearly every needed key — they
were never copied into the maps.
New single source reportMetricsMap() feeds BOTH MetricsBefore and
MetricsAfter (the mismatch class cannot recur), resolving
edges_below_threshold, total_edges, consolidation_age_sec,
avg_edge_weight, guidance_health, protocol_health + 13 more. Fail-
closed rule: for MUTATING actions (15-entry registry) a criterion with
missing evidence counts as NOT met ("missing_data_failclosed") — an
unverifiable mutation must never be recorded as success; observational
actions keep advisory semantics. The prior test pinned the vacuous
pass as the contract — updated to the honest one + advisory companion.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): tombstone_stale scoped to correction-linked nodes; refresh_stale_edges decays for real (Epic 3)
tombstone_stale archived 50 ARBITRARY older observations whenever ANY
correction existed in the 7-day window — no relationship between
correction and target. Now requires linkage: same session as the
correction OR its 1-hop CO_ACTIVATED_WITH neighborhood. Live check:
0 corrections in the current 7-day window, so both old and new scopes
are 0 RIGHT NOW — the hazard was conditional (any future correction
re-armed the old query against thousands of unrelated observations;
the new query bounds it to genuinely related nodes).
refresh_stale_edges bumped last_activated BEFORE the weight expression
read it → staleness=0 → the decay term vanished → every refresh was a
pure +0.1·log(count+1) boost. Staleness now captured via WITH before
SET; weights can genuinely decay. Both statements EXPLAIN-validated.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): counter-free confidence calibration — RSIC stops polluting its own signal (Epic 4)
RSIC-SK1 injected synthetic "followed"/"ignored" outcomes through
UpdateConfidence, incrementing total_surfaced/total_followed/
total_ignored — the exact counters GetConstraintEffectiveness reads
next cycle: measured effectiveness drove synthetic outcomes which drove
measured effectiveness (circular self-reinforcement). New
AdjustConfidenceDirect applies the clamp+archive confidence delta with
ZERO counter writes; the outcome counters now belong exclusively to
real guidance feedback. Provider interface + adapter + dispatcher use
the direct path with the configured boost/decay magnitudes; test mock
maps deltas back to outcome labels so existing assertions keep meaning.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): Tier 3 verification + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(rsic-validate-001): integration seeds carry session linkage for the scoped tombstone contract
CI's TombstoneStaleEndToEnd + MultiActionDispatchAndMetrics failed
because the seeded observations had NO relationship to the seeded
corrections — under the old behavior they were archived anyway (the
memory-eroding bug the sprint removed); under the new correction-
linkage contract they are correctly spared. SeedObservationNodes now
stamps a per-space test session shared by corrections and their stale
peers, so the tests exercise the new contract. Query-level proof
against the exact seeded shape: 10/10 linked observations match the
scoped Cypher. (Local integration runs hit the 30s client timeout —
the loaded local stack's cycles take ~6 min; CI's arbitrates.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): sprint plan — finish the score-scale contract (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): persistent rerank clients — failure alerting re-armed on the hottest LLM path (Epic 1)
doRerankWithOpenAI/doRerankWithOllama constructed a fresh llmclient per
call: the consecutive-failure counter reset every time, so
LLM_CONSECUTIVE_FAILURE_THRESHOLD could NEVER fire for
retrieval.rerank_cross / rerank_nli (a north-star distill task), and the
HTTP transport was discarded per call. Per-provider base clients now
init once (sync.Once); WithContext() shallow-copies and SHARES the
*atomic counter + breaker, so per-call contexts keep failure accounting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): config-driven score thresholds — suggest revival, MCP tiers, guardrail floor (Epic 2)
Three score-literal leftovers from the RRF-SCALE-001 audit instruction:
(1) /v1/memory/suggest's hardcoded 0.5 min-confidence default filtered
nearly everything on a scale topping out ~0.58 → CONSULTING_SUGGEST_
MIN_CONFIDENCE (default 0.45, RRF-calibrated); (2) MCP memory_reflect
tiers 0.7/0.4 (high tier unreachable) → MCP_REFLECT_SCORE_HIGH/_MEDIUM
(0.45/0.25); (3) guardrail constraint-retrieval Cypher's hardcoded
sim > 0.3 → GUARDRAIL_CONSTRAINT_SIM_FLOOR via GuardrailConfig
(cosine-stable today but inside the class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): CacheKey covers ALL result-affecting fields + two forcing functions (Epics 3-4)
CACHE-KEY-002: the key omitted result-affecting RetrieveRequest fields —
the audit named 5 (include/exclude_extensions, temporal_after/before,
policy_context); the new reflection forcing-function caught 8 MORE on
its first run: sparse-gate per-call overrides (SparseEnabled/
SparsePercentile/SparseOverridePresent/Category — the ?sparse= URL
params), pagination (Cursor/Limit), and the context-fingerprint params
(QueryContextFingerprint/StrictContextMode). All now keyed, plus a
caller-supplied query-embedding hash. Two requests differing in any of
these no longer collide on one cache entry.
Forcing functions: (1) reflection test — every RetrieveRequest field
must be in CacheKey or explicitly classified result-neutral with
justification (new fields fail until classified); (2) score-literal
scan — flags `.Score/score <op> 0.x` comparisons repo-wide outside a
justified allowlist (first run triaged 3 scale-local sites; clamp
guards excluded by pattern). The RRF-SCALE bug class is now CI-caught.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): live-calibrated suggest floor + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): sprint plan — stable concept identity, two-PR delivery (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): automated consolidation no longer skips LLM emergence (Epic A1)
dynamicEmergenceStep registers at phase 22, but RunConsolidation ran
hardcoded ranges (10,20) + (25,30) — phase 22 fell in the gap, so with
EMERGENCE_ENABLED=true the AUTOMATED path silently skipped LLM concept
emergence while the manual path (RunNodeCreationPipeline, 10–22 with an
emergence gate) ran it. RunConsolidation now delegates to
RunNodeCreationPipeline(cfg.EmergenceEnabled) — single range source; a
pin test fails if the step's phase ever leaves the range.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): stable theme identity — centroid match-or-create replaces the 5-minute churn (Epic A2)
ClusterConversations detached EVERY observation→theme edge, deleted
childless themes, and recreated all themes from scratch each ~5-min
cycle: new node_ids every run, evidence chains destroyed continuously,
recall flooded with stacks of near-identical concepts (observed live in
this session's own prompt headers).
New flow: cluster first → match each cluster to an EXISTING theme by
centroid cosine (HIDDEN_THEME_IDENTITY_SIM_THRESHOLD, default 0.90,
greedy with per-run claiming) → matched themes UPDATE in place
(props + theme-scoped member-edge rewire; node_id and all inbound
references survive) → unmatched clusters create as before → only themes
claimed by NO cluster are deleted. The global detach is gone.
ThemesUpdated added to the result. Tier 1: match/threshold/claimed/
best-of selection.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* chore(hidden-churn-001): remove the dead global-detach helper — the churn mechanism itself
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-A verification + CHANGELOG (Epics A3-A4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): PR-B coverage retune — config ratio, density assignment, gauge + rule (Epic B1)
maxThemes was an inline ceil(n/10) equation → HIDDEN_THEME_TARGET_RATIO
(default preserves it). NOISE observations (previously dropped from the
hierarchy forever — the 94% coverage gap's mechanism) now density-assign
to their nearest theme when cosine ≥ HIDDEN_THEME_ASSIGN_SIM_THRESHOLD
(default 0.70; edges only, no new themes; below-floor stays unthemed
honestly). New per-space coverage gauge
mdemg_neo4j_conversation_coverage_ratio (collector Query 5) + evaluator
rule low_conversation_coverage (CONVERSATION_COVERAGE_ALERT_FLOOR 0.2,
6h ForDuration for convergence).
Audit bonus: caught WeightIntegrityRules querying metric_samples with
recorded_at — the column is `time`; the null-weight rule had been
silently erroring every evaluation since it shipped (Debug-only logging
— the SUPERVISOR-002 finding in action). Both rules fixed + a pin test
bans recorded_at against metric_samples.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): mdemg concepts repair + trace — grounding audit CLI (Epic B2)
repair: tombstones childless layer>=2 abstraction nodes (no inbound
ABSTRACTS_TO|GENERALIZES|GENERALIZES_TO — 10,395 live in mdemg-dev,
churn-era debris). Recoverable (is_archived=true + archived_reason),
batched, dry-run default, --limit for small-batch-first verification.
trace: per-node grounding audit — direct children, transitive per-layer
census, grounded/ungrounded verdict, sample path to L0.
Live data note: GENERALIZES alone over-counts (19,147) — ABSTRACTS_TO
is the hidden layer's actual child edge; pin test guards the predicate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): surface themes_updated + noise_assigned in consolidate API + periodic log (Epic B3)
/v1/conversation/consolidate now reports themes_updated and
noise_assigned alongside themes_created. The periodic-consolidation
log condition also gains both — with stable theme identity (PR-A),
created is usually 0 on healthy cycles, which would have silenced the
success log entirely (the silent-success bug class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): live-smoke fixes — noise pool was structurally empty, clustering included archived debris, coverage gauge gated on min-obs
Three defects only the live run surfaced (Tier 3 forcing function):
1. KMeans never emits label -1, so the density-assignment hook received
an always-empty noise list; the min-samples/max-themes/nil-centroid
drops now feed their members into the noise pool instead of silently
excluding them from the hierarchy.
2. fetchClusterableConversationObservations had no is_archived filter —
it clustered 4,838 observations of which only 183 were live (MAINT-LIVE
tombstones), building themes on archived debris. Both fetch variants
now exclude archived. Live effect: 24 debris themes swept to 5 clean
ones; second cycle themes_updated=5/created=0 (stable identity on
real data).
3. Coverage gauge gated on CONVERSATION_COVERAGE_MIN_OBS (default 50,
DH-005 confidence-threshold pattern) — tiny scratch/test spaces
(2-13 observations) emitted 0.000 and would have alarmed forever
(born-firing alert hazard). Sentinel -1 skips emission.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-B verification + CHANGELOG + CLAUDE.md — sprint complete (Epic B5)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): sprint plan + background loop inventory (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): sliding-window restart budget + late registration (Epic 1)
The restart counter only ever incremented — a once-a-week transient
permanently killed a worker after 3 weeks. Budget is now a sliding
window (restarts older than the window are forgotten); permanent
failure requires >SUPERVISOR_MAX_RESTARTS within
SUPERVISOR_RESTART_WINDOW_MIN. New Go() registers+launches workers
after Start (the API server starts its loops late); nil return without
ctx cancellation now means intentional completion, not a restart.
Start() outlives dead workers so late workers stay supervised.
Config: SUPERVISOR_MAX_RESTARTS (3), SUPERVISOR_RESTART_WINDOW_MIN
(60), SUPERVISOR_BACKOFF_BASE_SEC (5).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): register the 12 unsupervised background loops (Epic 2)
Every scheduler/loop goroutine now runs under the goroutine supervisor
(panic recovery + sliding-window restart budget) instead of as a bare
go func() whose panic silently killed the subsystem forever:
- api.Server (6): periodic-consolidation, context-cooler,
space-prune-scheduler, weekly-gap-interviews, scheduled-sync,
rsic-macro-cron — via injected SetSupervisor(sup.Go) + goSupervised
helper (bgWg brackets each run; stop channels remain the graceful
path and return nil = no restart)
- ape (3): rsic-watchdog, rsic-store-flush, signal-learner-flush
- backup schedulers (2): neo4j-backup-scheduler, tsdb-backup-scheduler
— their NewServer construction-time Start() moved to
StartSupervisedBackground() (serve.go) so the hook exists first
- serve.go (1): llm-fastfail-burst-flush via sup.Go
All owners keep a nil-hook fallback (legacy bare goroutine), so tests
and non-server callers are unchanged. Buffered TSDB writers are
explicitly out of scope (TSDB-CONSUME-001 owns flush observability).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): rule-health meta-alert on evaluator query failures (Epic 3)
A rule whose SQL errors was a silently-disabled alert: failures were
logged at Debug and nothing watched the watcher — bitten twice in one
week (HIDDEN-WEIGHT-001 null-weight rule + the recorded_at column bug,
both found by accident in later sprints). Query failures now log at
Warn, and after ALERT_RULE_FAILURE_THRESHOLD (default 3) consecutive
failures a high-severity meta-alert fires directly via the dispatcher
(not via a rule — the meta-channel must not depend on the failing
mechanism). Service label is rule-health-<rule-id> so concurrent
failing rules don't cooldown-suppress each other; success re-arms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): recency-gate the RSIC llm_error_rate_spike insight (Epic 4)
Insight 26 computes the error rate over a 24h window with no recency
requirement, so a 35-min jiminy.synthesize timeout burst at 02:00 UTC
kept re-firing HIGH 'LLM error rate spike' (and escalating to CRITICAL
'Jiminy Pipeline Critical') every RSIC micro-cycle for 12+ hours after
the incident self-resolved (live, 2026-06-11). LLMPerformanceSummary
now carries LastErrorAt (MAX(time) over errored rows); the spike
insight fires only when the most recent error is within
RSIC_LLM_ERROR_RECENCY_MIN (default 60; 0 disables the gate). A zero
LastErrorAt (older data source) keeps legacy behavior — the gate never
widens silently.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): nolint G118 on legacy-fallback loop launches
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): aggregate evaluator-degraded alert for global outages
A TSDB-level outage fails every rule at once; per-rule meta-alerts
would storm ~19 alerts duplicating the health prober's signal. At the
failure threshold the evaluator now distinguishes: other rules
succeeding recently → per-rule rule-health alert (broken SQL class);
nothing succeeding within threshold×interval → ONE
alert-evaluator-degraded alert per outage. Success re-arms both.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(jiminy): detach feedback outcome processing from the hook's connection lifetime
Live-smoke surprise during SUPERVISOR-002 Epic 5 (own fix-commit per
policy): jiminy.evaluate_llm was failing at 94.9% (657 'context
canceled' rows/24h). The post-tool-observe hook POSTs
/v1/jiminy/feedback with curl --max-time 5, but per-item Tier-2
outcome classification routinely outlives the connection — the request
ctx then cancelled every in-flight LLM call and outcomes silently
degraded to the keyword heuristic. Same defect class as
GUIDANCE-SYNTH-001's warm-path budget.
handleJiminyFeedback now uses context.WithoutCancel(r.Context()) with
its own server-side budget JIMINY_FEEDBACK_TIMEOUT_MS (default 60000,
0 = unbounded). The hook keeps its fire-and-forget 5s curl.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): streak-relative global-outage discriminator (drill-caught)
The Epic 5 TSDB-stop drill caught the freshness-window heuristic
misclassifying outage ONSET: rules were succeeding seconds before the
stop, so lastAnySuccess was fresh when the first rules hit threshold —
2 per-rule alerts leaked before the aggregate fired. The discriminator
is now streak-relative: per-rule only when some other rule succeeded
AFTER this rule's failure streak began; otherwise global, once per
outage. Unit-pinned with the drill scenario
(TestRuleFailureStreak_OutageOnsetIsGlobal).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): harden the restore path (Epics 1-4)
Epics 1-4 land together — they share the restore-path signatures.
1. Checksum gate (Epic 1): the manifest SHA-256 was written at backup
time and never read; a corrupted .mdemg restored silently. The gate
now fails closed before import; legacy manifests without a checksum
warn and proceed.
2. Snapshot completion polling (Epic 2): the pre-restore safety
snapshot was raced with time.Sleep(2s) against an async backup
goroutine. waitForBackupJob now polls the jobs queue until
completed, failing closed on failure/cancel/vanish/timeout
(BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC, default 300).
3. Count validation (Epic 3): manifest NodeCount/EdgeCount are
whole-database counts and cannot validate file contents (they
diverge on partial backups) — new additive file_node_count/
file_edge_count/file_observation_count manifest fields are counted
from the exported chunks; restore re-counts the file and hard-fails
on mismatch (truncation class). Importer accounting divergence
under CONFLICT_SKIP is warn-only, surfaced in a job-result
validation block.
4. dockerbin routing (Epic 4): the legacy .dump restore shelled out to
bare "docker" (the launchd-minimal-PATH class NOSILENT-001 fixed
for TSDB); now routes via dockerbin unless the operator set a
non-default FullCmd.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): neo4j-backup jobhealth + generalized staleness rules (Epic 5)
The default-ON Neo4j backup scheduler had zero jobhealth coverage —
the inverse of NOSILENT-001 (which wired only tsdb-backup). The
scheduler now waits on each triggered job (its Trigger is queue-async;
a fire-and-forget report would always claim success) and reports
outcome via SetResultHook → jobhealth.Report with
job_name='neo4j-backup' (wired in SetTSDBClient next to the tsdb
hook). The staleness rule is generalized into a jobStalenessRule
factory; Neo4jBackupStalenessRule (neo4j_backup_no_recent_success,
Service scheduled-job-staleness-neo4j, window = partial interval × 2
unless BACKUP_JOB_STALENESS_HOURS overrides) registers when
BACKUP_ENABLED. The existing tsdb rule is pinned unchanged through the
refactor.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): initial backup on start (rule honesty)
With the neo4j_backup_no_recent_success rule registered, a fresh
install would alarm honestly-but-noisily for up to 24h (the scheduler's
first tick). The scheduler now runs an initial partial backup
BACKUP_INITIAL_DELAY_MIN (default 5) minutes after start, so every
install has a backup — and a quiet staleness rule — within minutes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(backup-restore-verify-001): retention was deleting every backup it just made (drill-caught)
The Tier 3 round-trip exposed that the backup system was a no-op for
this database: BACKUP_RETENTION_MAX_STORAGE_GB had a comment/code
default drift (documented 50, code read 2), and with RunAfter=true the
quota pass deleted each 3-4 GB whole-database backup ~80 ms after it
completed (log: 'backup completed' → 'retention cleaned backups
deleted_count=1 freed_bytes=<exactly the new backup>'). Three fixes:
1. Quota retention NEVER deletes the newest backup of each type — a
quota smaller than one backup degrades to 'over quota, keep it'
with a loud warning, not 'delete the only backup'. Sparse-file unit
tests pin both the two-backup and only-backup-oversize shapes.
2. Default quota raised to the documented 50 GB.
3. BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC default 300 → 3600: the live
whole-database export runs ~15 min; the 5-min wait made the initial
scheduled run report failure (jobhealth correctly recorded it —
the wiring works) while the backup actually completed later.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(transfer): omit empty path/name on import — restores with observations always failed (drill-caught)
The Tier 3 round-trip's real restore failed with
ConstraintValidationFailed: conversation observations carry path=NULL
in Neo4j (which memorynode_path_unique (space_id, path) ignores), but
the exporter serializes NULL as the proto default "" and the importer
wrote the literal empty string unconditionally — so the second
observation node in any restore collided. Every restore containing 2+
observation nodes had always been broken; this was invisible because
no backup had ever been restore-tested (the sprint's premise,
demonstrated). nodeProps now omits empty path/name (null fidelity);
unit-pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 7)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-storm-001): sprint plan with corrected burst attribution (Epic 0)
Triage correction baked in: the 5,397-node burst was the Context
Cooler via the session-start hook's /v1/conversation/graduate (uncapped
backlog sweep of pre-DH-004 graduation-bug victims), NOT RSIC —
mis-attributed because tombstone_stale stamps no metadata and two
archive-reason property names coexist. RSIC's own issues stand:
trigger-race cycle storm (~20-30k/day) and snapshot/executor predicate
drift (rollback restores nothing).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): atomic trigger admission — reserve-on-allow (Epic A)
EvaluateTrigger checked activeCycles/lastTrigger, but both were written
only by RecordTrigger — which callers invoke AFTER RunCycle completes.
For a cycle's entire multi-second duration every concurrent trigger
passed every gate: ~20-30k …
* feat(hooks): add pre-write-check.py to the tracked installer
The /strict J17 Write/Edit classify gate (pre-write-check.py) was local-only —
no tracked template, so `mdemg hooks install` never installed it and its
SessionID fix wouldn't propagate. Add it as a tracked template
(internal/cli/hook_templates/pre-write-check.py, space_id → {{SPACE_ID}},
runtime URL discovery — no {{MDEMG_URL}} placeholder per the template
convention) and register it in claudeHookFiles() as
{PreToolUse, 8s, "Write|Edit"}. hooks_test expectations updated 5→6.
go build + hooks test + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* release: cut v0.10.1
Promote CHANGELOG [Unreleased] → [0.10.1] - 2026-06-08. Adds the
jiminy-governance skill + MDEMG MCP registration and the per-conversation
SessionID work to the release notes (alongside the already-logged EVENTGRAPH-002,
EVENTGRAPH-CLI-001, NOSILENT-001, the docker-PATH fix, and the TSDB schema
22→23→24 bumps). Fresh empty [Unreleased]; comparison link refs updated +
backfilled through v0.10.1.
The v0.10.1 git tag (triggers release.yml artifact build) + homebrew formula
bump are the operator release-cut step on main, post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs: governance system doc + bring cli/api references current
(1) New docs/features/jiminy-governance.md — detailed how-it-works + full file
inventory for the J17 agent-governance system (skill, hooks, SessionID, MCP,
enforcement, runtime state files, install/verify steps).
(2) docs/user/api-reference.md — add the Event Graph Federation section
(POST /v1/eventgraph/{reinforcement,guidance-outcome}-neighborhood) + TOC entry;
these were the only two missing endpoints (audited the full route table).
(3) docs/user/cli-reference.md — add mdemg eventgraph {reinforcement,guidance-
outcome}-neighborhood, model run, watchdog status, migrate context-fingerprint,
data curate/validate/clean; fix the stale `model pull --adapter` description
(MODEL-DIST-002 shipped — no longer "deferred/errors"); update the Command Tree
Summary to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch budget
Caught live during the HIDDEN-WEIGHT-001 corpus reingest: the post-ingest
/v1/memory/consolidate call used the shared batch-ingest client
(--timeout, 300s); consolidating a ~10k-node space exceeds that, so the
client reported failure while the server completed the work — the
GUIDANCE-SYNTH-001 bug class (long graph/LLM work needs its own budget).
New --consolidate-timeout flag / INGEST_CONSOLIDATE_TIMEOUT_SEC env
(default 1800s) with a dedicated client. Live-verified: "running
consolidation timeout_sec=1800" → complete.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): Tier 3 verification + corpus restoration + UVTS harness audit + close (Epics 4-5)
Tier 3: real consolidation minted edges with varied cosine weights
(0.83-0.94) + CUIDv2 ids; at-scale via the corpus reingest (9,500 edges,
0 NULL, mean 0.923); gauge holds 0; evaluator rules 16→17.
UVTS harness: corpus space lnl-demo-whk had been deleted with zero trace
(no UVTS run since 2026-05-04 measured anything real); restored by
operator-directed full reingest. A fresh baseline NUMBER remains blocked
by further live-found harness rot — grader/persist breakage, expected-
path format drift, vector post-filter dilution (service.go:1137 global
top-K then space filter) amplified by the duplicate whk-wms space —
complete defect inventory handed to UXTS-CI-001. Retrieval ranking on
the restored corpus verified correct (expected files at ranks 1-4).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): sprint plan — scheduled maintenance actually runs (Epic 0)
Roadmap Q3 Phase 1 rank #4. Weekly decay+prune has never executed
(--dry-run defaults true; plist passes no override) while reporting
success — NOSILENT's blind spot. Tonight's Memory Bloat alerts (79k+
nodes) are the accumulated backlog. Safety verified in code before
planning: nodes are tombstoned (never deleted) with abstraction-chain/
degree/recency protections; edge deletion is the designed near-zero-
weight lifecycle, meaningful now that HIDDEN-WEIGHT made weights real.
Plan: live-by-default plist (+installed refresh), dry_run in job-event
metadata (no schema change — disclosed), maintenance_no_live_run
evaluator rule, darwin upgrade refreshes plists/hooks, first-ever live
run with preview-first protocol.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(maint-live-001): scheduled maintenance runs live — plist passes --dry-run=false (Epic 1)
The weekly LaunchAgent ran `mdemg maintenance` with no dry-run override;
the CLI defaults --dry-run=true, so every scheduled cycle previewed and
reported success — decay+prune NEVER executed (the 79k-node Memory
Bloat backlog). Both plist copies now pass --dry-run=false (the CLI
default stays true for safe manual previews — the SCHEDULE is what must
not silently no-op); installed plist refreshed + agent reloaded.
reportScheduledJobMeta threads job metadata into V0024; maintenance
records dry_run so the only-ever-dry-runs pattern is queryable
(metadata JSONB — no schema change, disclosed in the plan).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): maintenance_no_live_run evaluator rule (Epic 2)
Fires when maintenance rows exist in MAINT_LIVE_LOOKBACK_DAYS (default
8) but none ran live (success + metadata dry_run=false) — the only-
ever-dry-runs pattern self-reports instead of hiding inside "the job
ran". Distinct service maintenance-liveness per the cooldown rule.
Config: MAINT_LIVE_ALERT_ENABLED (true), MAINT_LIVE_LOOKBACK_DAYS (8).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): mdemg upgrade refreshes installed LaunchAgents + hooks (darwin) (Epic 3)
Plist/hook fixes shipped in releases but never reached installed
machines — the maintenance dry-run override would have sat unreachable
next to upgraded binaries forever. Upgrade now re-renders ALREADY-
INSTALLED mdemg LaunchAgents from the new binary's embedded templates
(refresh-only — never installs new services) + re-syncs mdemg-managed
Claude hooks in the current project (marker-checked). Substitution
logic single-sourced into renderLaunchdTemplate (Install + Refresh —
the drift class that exit-78'd the sidecar during HOOKSYNC live smoke).
Mirrors the existing Linux systemd-unit refresh.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): context-dependent orphan policy — --exclude-role-types (Epic 4a)
Orphan disposition is context-dependent (operator, 2026-06-11): a
uniform degree/age rule conflates governance constraints, conversation
history, test junk, and hierarchy debris. New --exclude-role-types on
prune + maintenance (env PRUNE_EXCLUDE_ROLE_TYPES) makes the policy
expressible; the scheduled plist ships
constraint,conversation_observation excluded per the operator's call
(constraints are load-bearing governance rules at any degree;
conversation observations differ by SESSION which the knob can't
express yet). Aged hierarchy debris stays eligible — that's the
lifecycle working. Candidate census that drove the decision: 5,388
conv-obs (9 eligible tonight under the 90d shield), 11 constraints,
238 hierarchy nodes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(prune): orphan sweeps use implicit transactions for batched deletes
Caught by the FIRST-EVER live maintenance run (MAINT-LIVE-001 Tier 3):
Neo4j raises TransactionStartFailed when a batched CALL-IN-TRANSACTIONS
statement executes inside an explicit transaction. Both orphan sweeps
(SymbolNode + Observation) ran their batched delete via ExecuteWrite;
the dry-run path never executes the deleting statement, so no preview
or unit test could surface it — only live execution. Switched to
session.Run (implicit tx). The failure ALSO proved the NOSILENT chain
live: the run fired "Scheduled job failed: maintenance" before exiting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): first live run verification + feature doc + CHANGELOG + close (Epics 4b-5)
First live maintenance in MDEMG history: 20,236 orphan SymbolNodes
deleted; all 5,010 tombstone candidates protected (recency + operator
exclusions); liveness rule born-firing → silenced by the real run; the
3-row job-event story (preview/true → failure/false alerted →
success/false) proves the dry_run plumbing through every path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md architecture note for MAINT-LIVE-001
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): sprint plan — embedder wiring + ingest exec resolution (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(embed-wire-001): breaker + recorder reach the real embedder through the wrapper chain (Epic 1)
The embedding circuit breaker was NEVER wired in any default deployment:
embeddings.New returns *CachedEmbedder when EMBEDDING_CACHE_ENABLED=true
(the default), so the server's emb.(*embeddings.OpenAI)/(*Ollama)
assertions on the OUTERMOST value failed silently (no else branch). The
recorder assertion had the inverse fragility (cache off → training-data
recording silently dies).
New: Unwrap() chain (CachedEmbedder joins RateLimitedEmbedder's existing
one) + embeddings.Base() / FindCached() interface-driven walkers — any
future wrapper joins by adding Unwrap(), no type lists. Wiring now walks
to the base for the breaker and to the cache layer for the recorder,
with LOUD warns when nothing matches. Tier 1 pins the production shape
(ratelimit(cache(provider))) plus cache-off and bare chains.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest-exec-001): server-triggered ingest resolves the mdemg binary — was hardcoded ./bin/mdemg (Epic 2)
Both ingest-job exec sites ran a relative "./bin/mdemg": broken in
Docker (the documented-primary deployment — binary at /usr/local/bin,
no repo checkout) and any CWD other than the repo root. New
resolveMdemgBin(): MDEMG_BIN env → os.Executable() (the server IS the
binary) → PATH → ./bin/mdemg legacy fallback; cached; Tier 1 pins the
order. Scheduled-sync jobs now report outcomes to scheduled_job_events
via jobhealth (job_name codebase-sync) — an unattended sync that keeps
failing is never silent; manual API jobs stay queue-visible only.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): live verification + CHANGELOG + CLAUDE.md + close (Epics 3-4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): sprint plan — documentation matches reality (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): CLAUDE.md FT section rewritten to post-pivot reality (Epic 1)
The section presented the abandoned Qwen3.6-35B-A3B MoE target, two-tier
MoE-Sieve strategy, and Sprint A→E critical path as CURRENT — all
superseded by the 2026-04-22 MoE→dense pivot; this stale text seeded the
Q3 roadmap audit with a dead architecture. Rewritten: shipped state
(dense Qwen3-14B mdemg-llm-v1, 0.8389, llama-server runtime), superseded
plan documented with the pivot rationale (never deleted — supersede-with-
pointer), guardrail llmclient exception marked CLOSED (re-verified in
code), memo-07 provenance break disclosed (the file never existed;
00_README_v2.md is canonical), open FT work named (FT-CLASSIFY-002 +
recursive-retraining trigger). Adapter env-name drift fixed
(MDEMG_ADAPTER_BASE, not MDEMG_MODEL_ADAPTER_BASE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(doc-truth-001): operator-facing text matches the Phase 13.5 reality (Epic 2)
preflight errors directed operators to start the DECOMMISSIONED
mlx_lm.server on :8101 — following them reintroduces the crash-looping
stack Phase 13.5 replaced. Now: llama-server :8102 guidance (managed
service install + manual command), backend-agnostic wording. model.go
help text dropped three stale "deferred to MODEL-DIST-002" mentions
(shipped 2026-05-25). Operationally (untracked .env): removed the
J17_SIDECAR_TIMEOUT_MS=200 override that re-pinned the exact value
DH-004 remediated — the 1000ms default now applies; server restarted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): 00_README STATUS block + AGENT_HANDOFF retired (Epic 3)
00_README_v2.md gains a top-of-file STATUS block: shipped-through-
cutover state, superseded MoE plan (FT-2 skip + FT-3 supersession +
R-LT-4 prototype-discipline adjudication recorded), the NOT-STARTED
recursive-retraining loop with its FT-CLASSIFY-002 trigger, and
provenance notes (memo-07 never existed; the spec is untracked pending
FG-2). AGENT_HANDOFF.md (stale since 2026-05-06) retired to a pointer
stub — handoff state lives in CLAUDE.md/roadmap/CHANGELOG/CMS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): grep-sweep proof + CHANGELOG + close (Epic 4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): last stale --adapter help string (sweep straggler)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): sprint plan — fail-closed self-improvement (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): honest criteria evaluation — populated keys + fail-closed mutations (Epics 1-2)
The cycle baseline populated 10 metric keys while task criteria
referenced ~15 others (only volatile_count + correction_rate
intersected) → missing_data → skip → ~16/17 actions validated
vacuously; criteria-driven rollback was unreachable. The
SelfAssessmentReport already carried nearly every needed key — they
were never copied into the maps.
New single source reportMetricsMap() feeds BOTH MetricsBefore and
MetricsAfter (the mismatch class cannot recur), resolving
edges_below_threshold, total_edges, consolidation_age_sec,
avg_edge_weight, guidance_health, protocol_health + 13 more. Fail-
closed rule: for MUTATING actions (15-entry registry) a criterion with
missing evidence counts as NOT met ("missing_data_failclosed") — an
unverifiable mutation must never be recorded as success; observational
actions keep advisory semantics. The prior test pinned the vacuous
pass as the contract — updated to the honest one + advisory companion.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): tombstone_stale scoped to correction-linked nodes; refresh_stale_edges decays for real (Epic 3)
tombstone_stale archived 50 ARBITRARY older observations whenever ANY
correction existed in the 7-day window — no relationship between
correction and target. Now requires linkage: same session as the
correction OR its 1-hop CO_ACTIVATED_WITH neighborhood. Live check:
0 corrections in the current 7-day window, so both old and new scopes
are 0 RIGHT NOW — the hazard was conditional (any future correction
re-armed the old query against thousands of unrelated observations;
the new query bounds it to genuinely related nodes).
refresh_stale_edges bumped last_activated BEFORE the weight expression
read it → staleness=0 → the decay term vanished → every refresh was a
pure +0.1·log(count+1) boost. Staleness now captured via WITH before
SET; weights can genuinely decay. Both statements EXPLAIN-validated.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): counter-free confidence calibration — RSIC stops polluting its own signal (Epic 4)
RSIC-SK1 injected synthetic "followed"/"ignored" outcomes through
UpdateConfidence, incrementing total_surfaced/total_followed/
total_ignored — the exact counters GetConstraintEffectiveness reads
next cycle: measured effectiveness drove synthetic outcomes which drove
measured effectiveness (circular self-reinforcement). New
AdjustConfidenceDirect applies the clamp+archive confidence delta with
ZERO counter writes; the outcome counters now belong exclusively to
real guidance feedback. Provider interface + adapter + dispatcher use
the direct path with the configured boost/decay magnitudes; test mock
maps deltas back to outcome labels so existing assertions keep meaning.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): Tier 3 verification + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(rsic-validate-001): integration seeds carry session linkage for the scoped tombstone contract
CI's TombstoneStaleEndToEnd + MultiActionDispatchAndMetrics failed
because the seeded observations had NO relationship to the seeded
corrections — under the old behavior they were archived anyway (the
memory-eroding bug the sprint removed); under the new correction-
linkage contract they are correctly spared. SeedObservationNodes now
stamps a per-space test session shared by corrections and their stale
peers, so the tests exercise the new contract. Query-level proof
against the exact seeded shape: 10/10 linked observations match the
scoped Cypher. (Local integration runs hit the 30s client timeout —
the loaded local stack's cycles take ~6 min; CI's arbitrates.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): sprint plan — finish the score-scale contract (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): persistent rerank clients — failure alerting re-armed on the hottest LLM path (Epic 1)
doRerankWithOpenAI/doRerankWithOllama constructed a fresh llmclient per
call: the consecutive-failure counter reset every time, so
LLM_CONSECUTIVE_FAILURE_THRESHOLD could NEVER fire for
retrieval.rerank_cross / rerank_nli (a north-star distill task), and the
HTTP transport was discarded per call. Per-provider base clients now
init once (sync.Once); WithContext() shallow-copies and SHARES the
*atomic counter + breaker, so per-call contexts keep failure accounting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): config-driven score thresholds — suggest revival, MCP tiers, guardrail floor (Epic 2)
Three score-literal leftovers from the RRF-SCALE-001 audit instruction:
(1) /v1/memory/suggest's hardcoded 0.5 min-confidence default filtered
nearly everything on a scale topping out ~0.58 → CONSULTING_SUGGEST_
MIN_CONFIDENCE (default 0.45, RRF-calibrated); (2) MCP memory_reflect
tiers 0.7/0.4 (high tier unreachable) → MCP_REFLECT_SCORE_HIGH/_MEDIUM
(0.45/0.25); (3) guardrail constraint-retrieval Cypher's hardcoded
sim > 0.3 → GUARDRAIL_CONSTRAINT_SIM_FLOOR via GuardrailConfig
(cosine-stable today but inside the class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): CacheKey covers ALL result-affecting fields + two forcing functions (Epics 3-4)
CACHE-KEY-002: the key omitted result-affecting RetrieveRequest fields —
the audit named 5 (include/exclude_extensions, temporal_after/before,
policy_context); the new reflection forcing-function caught 8 MORE on
its first run: sparse-gate per-call overrides (SparseEnabled/
SparsePercentile/SparseOverridePresent/Category — the ?sparse= URL
params), pagination (Cursor/Limit), and the context-fingerprint params
(QueryContextFingerprint/StrictContextMode). All now keyed, plus a
caller-supplied query-embedding hash. Two requests differing in any of
these no longer collide on one cache entry.
Forcing functions: (1) reflection test — every RetrieveRequest field
must be in CacheKey or explicitly classified result-neutral with
justification (new fields fail until classified); (2) score-literal
scan — flags `.Score/score <op> 0.x` comparisons repo-wide outside a
justified allowlist (first run triaged 3 scale-local sites; clamp
guards excluded by pattern). The RRF-SCALE bug class is now CI-caught.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): live-calibrated suggest floor + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): sprint plan — stable concept identity, two-PR delivery (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): automated consolidation no longer skips LLM emergence (Epic A1)
dynamicEmergenceStep registers at phase 22, but RunConsolidation ran
hardcoded ranges (10,20) + (25,30) — phase 22 fell in the gap, so with
EMERGENCE_ENABLED=true the AUTOMATED path silently skipped LLM concept
emergence while the manual path (RunNodeCreationPipeline, 10–22 with an
emergence gate) ran it. RunConsolidation now delegates to
RunNodeCreationPipeline(cfg.EmergenceEnabled) — single range source; a
pin test fails if the step's phase ever leaves the range.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): stable theme identity — centroid match-or-create replaces the 5-minute churn (Epic A2)
ClusterConversations detached EVERY observation→theme edge, deleted
childless themes, and recreated all themes from scratch each ~5-min
cycle: new node_ids every run, evidence chains destroyed continuously,
recall flooded with stacks of near-identical concepts (observed live in
this session's own prompt headers).
New flow: cluster first → match each cluster to an EXISTING theme by
centroid cosine (HIDDEN_THEME_IDENTITY_SIM_THRESHOLD, default 0.90,
greedy with per-run claiming) → matched themes UPDATE in place
(props + theme-scoped member-edge rewire; node_id and all inbound
references survive) → unmatched clusters create as before → only themes
claimed by NO cluster are deleted. The global detach is gone.
ThemesUpdated added to the result. Tier 1: match/threshold/claimed/
best-of selection.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* chore(hidden-churn-001): remove the dead global-detach helper — the churn mechanism itself
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-A verification + CHANGELOG (Epics A3-A4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): PR-B coverage retune — config ratio, density assignment, gauge + rule (Epic B1)
maxThemes was an inline ceil(n/10) equation → HIDDEN_THEME_TARGET_RATIO
(default preserves it). NOISE observations (previously dropped from the
hierarchy forever — the 94% coverage gap's mechanism) now density-assign
to their nearest theme when cosine ≥ HIDDEN_THEME_ASSIGN_SIM_THRESHOLD
(default 0.70; edges only, no new themes; below-floor stays unthemed
honestly). New per-space coverage gauge
mdemg_neo4j_conversation_coverage_ratio (collector Query 5) + evaluator
rule low_conversation_coverage (CONVERSATION_COVERAGE_ALERT_FLOOR 0.2,
6h ForDuration for convergence).
Audit bonus: caught WeightIntegrityRules querying metric_samples with
recorded_at — the column is `time`; the null-weight rule had been
silently erroring every evaluation since it shipped (Debug-only logging
— the SUPERVISOR-002 finding in action). Both rules fixed + a pin test
bans recorded_at against metric_samples.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): mdemg concepts repair + trace — grounding audit CLI (Epic B2)
repair: tombstones childless layer>=2 abstraction nodes (no inbound
ABSTRACTS_TO|GENERALIZES|GENERALIZES_TO — 10,395 live in mdemg-dev,
churn-era debris). Recoverable (is_archived=true + archived_reason),
batched, dry-run default, --limit for small-batch-first verification.
trace: per-node grounding audit — direct children, transitive per-layer
census, grounded/ungrounded verdict, sample path to L0.
Live data note: GENERALIZES alone over-counts (19,147) — ABSTRACTS_TO
is the hidden layer's actual child edge; pin test guards the predicate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): surface themes_updated + noise_assigned in consolidate API + periodic log (Epic B3)
/v1/conversation/consolidate now reports themes_updated and
noise_assigned alongside themes_created. The periodic-consolidation
log condition also gains both — with stable theme identity (PR-A),
created is usually 0 on healthy cycles, which would have silenced the
success log entirely (the silent-success bug class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): live-smoke fixes — noise pool was structurally empty, clustering included archived debris, coverage gauge gated on min-obs
Three defects only the live run surfaced (Tier 3 forcing function):
1. KMeans never emits label -1, so the density-assignment hook received
an always-empty noise list; the min-samples/max-themes/nil-centroid
drops now feed their members into the noise pool instead of silently
excluding them from the hierarchy.
2. fetchClusterableConversationObservations had no is_archived filter —
it clustered 4,838 observations of which only 183 were live (MAINT-LIVE
tombstones), building themes on archived debris. Both fetch variants
now exclude archived. Live effect: 24 debris themes swept to 5 clean
ones; second cycle themes_updated=5/created=0 (stable identity on
real data).
3. Coverage gauge gated on CONVERSATION_COVERAGE_MIN_OBS (default 50,
DH-005 confidence-threshold pattern) — tiny scratch/test spaces
(2-13 observations) emitted 0.000 and would have alarmed forever
(born-firing alert hazard). Sentinel -1 skips emission.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-B verification + CHANGELOG + CLAUDE.md — sprint complete (Epic B5)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): sprint plan + background loop inventory (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): sliding-window restart budget + late registration (Epic 1)
The restart counter only ever incremented — a once-a-week transient
permanently killed a worker after 3 weeks. Budget is now a sliding
window (restarts older than the window are forgotten); permanent
failure requires >SUPERVISOR_MAX_RESTARTS within
SUPERVISOR_RESTART_WINDOW_MIN. New Go() registers+launches workers
after Start (the API server starts its loops late); nil return without
ctx cancellation now means intentional completion, not a restart.
Start() outlives dead workers so late workers stay supervised.
Config: SUPERVISOR_MAX_RESTARTS (3), SUPERVISOR_RESTART_WINDOW_MIN
(60), SUPERVISOR_BACKOFF_BASE_SEC (5).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): register the 12 unsupervised background loops (Epic 2)
Every scheduler/loop goroutine now runs under the goroutine supervisor
(panic recovery + sliding-window restart budget) instead of as a bare
go func() whose panic silently killed the subsystem forever:
- api.Server (6): periodic-consolidation, context-cooler,
space-prune-scheduler, weekly-gap-interviews, scheduled-sync,
rsic-macro-cron — via injected SetSupervisor(sup.Go) + goSupervised
helper (bgWg brackets each run; stop channels remain the graceful
path and return nil = no restart)
- ape (3): rsic-watchdog, rsic-store-flush, signal-learner-flush
- backup schedulers (2): neo4j-backup-scheduler, tsdb-backup-scheduler
— their NewServer construction-time Start() moved to
StartSupervisedBackground() (serve.go) so the hook exists first
- serve.go (1): llm-fastfail-burst-flush via sup.Go
All owners keep a nil-hook fallback (legacy bare goroutine), so tests
and non-server callers are unchanged. Buffered TSDB writers are
explicitly out of scope (TSDB-CONSUME-001 owns flush observability).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): rule-health meta-alert on evaluator query failures (Epic 3)
A rule whose SQL errors was a silently-disabled alert: failures were
logged at Debug and nothing watched the watcher — bitten twice in one
week (HIDDEN-WEIGHT-001 null-weight rule + the recorded_at column bug,
both found by accident in later sprints). Query failures now log at
Warn, and after ALERT_RULE_FAILURE_THRESHOLD (default 3) consecutive
failures a high-severity meta-alert fires directly via the dispatcher
(not via a rule — the meta-channel must not depend on the failing
mechanism). Service label is rule-health-<rule-id> so concurrent
failing rules don't cooldown-suppress each other; success re-arms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): recency-gate the RSIC llm_error_rate_spike insight (Epic 4)
Insight 26 computes the error rate over a 24h window with no recency
requirement, so a 35-min jiminy.synthesize timeout burst at 02:00 UTC
kept re-firing HIGH 'LLM error rate spike' (and escalating to CRITICAL
'Jiminy Pipeline Critical') every RSIC micro-cycle for 12+ hours after
the incident self-resolved (live, 2026-06-11). LLMPerformanceSummary
now carries LastErrorAt (MAX(time) over errored rows); the spike
insight fires only when the most recent error is within
RSIC_LLM_ERROR_RECENCY_MIN (default 60; 0 disables the gate). A zero
LastErrorAt (older data source) keeps legacy behavior — the gate never
widens silently.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): nolint G118 on legacy-fallback loop launches
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): aggregate evaluator-degraded alert for global outages
A TSDB-level outage fails every rule at once; per-rule meta-alerts
would storm ~19 alerts duplicating the health prober's signal. At the
failure threshold the evaluator now distinguishes: other rules
succeeding recently → per-rule rule-health alert (broken SQL class);
nothing succeeding within threshold×interval → ONE
alert-evaluator-degraded alert per outage. Success re-arms both.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(jiminy): detach feedback outcome processing from the hook's connection lifetime
Live-smoke surprise during SUPERVISOR-002 Epic 5 (own fix-commit per
policy): jiminy.evaluate_llm was failing at 94.9% (657 'context
canceled' rows/24h). The post-tool-observe hook POSTs
/v1/jiminy/feedback with curl --max-time 5, but per-item Tier-2
outcome classification routinely outlives the connection — the request
ctx then cancelled every in-flight LLM call and outcomes silently
degraded to the keyword heuristic. Same defect class as
GUIDANCE-SYNTH-001's warm-path budget.
handleJiminyFeedback now uses context.WithoutCancel(r.Context()) with
its own server-side budget JIMINY_FEEDBACK_TIMEOUT_MS (default 60000,
0 = unbounded). The hook keeps its fire-and-forget 5s curl.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): streak-relative global-outage discriminator (drill-caught)
The Epic 5 TSDB-stop drill caught the freshness-window heuristic
misclassifying outage ONSET: rules were succeeding seconds before the
stop, so lastAnySuccess was fresh when the first rules hit threshold —
2 per-rule alerts leaked before the aggregate fired. The discriminator
is now streak-relative: per-rule only when some other rule succeeded
AFTER this rule's failure streak began; otherwise global, once per
outage. Unit-pinned with the drill scenario
(TestRuleFailureStreak_OutageOnsetIsGlobal).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): harden the restore path (Epics 1-4)
Epics 1-4 land together — they share the restore-path signatures.
1. Checksum gate (Epic 1): the manifest SHA-256 was written at backup
time and never read; a corrupted .mdemg restored silently. The gate
now fails closed before import; legacy manifests without a checksum
warn and proceed.
2. Snapshot completion polling (Epic 2): the pre-restore safety
snapshot was raced with time.Sleep(2s) against an async backup
goroutine. waitForBackupJob now polls the jobs queue until
completed, failing closed on failure/cancel/vanish/timeout
(BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC, default 300).
3. Count validation (Epic 3): manifest NodeCount/EdgeCount are
whole-database counts and cannot validate file contents (they
diverge on partial backups) — new additive file_node_count/
file_edge_count/file_observation_count manifest fields are counted
from the exported chunks; restore re-counts the file and hard-fails
on mismatch (truncation class). Importer accounting divergence
under CONFLICT_SKIP is warn-only, surfaced in a job-result
validation block.
4. dockerbin routing (Epic 4): the legacy .dump restore shelled out to
bare "docker" (the launchd-minimal-PATH class NOSILENT-001 fixed
for TSDB); now routes via dockerbin unless the operator set a
non-default FullCmd.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): neo4j-backup jobhealth + generalized staleness rules (Epic 5)
The default-ON Neo4j backup scheduler had zero jobhealth coverage —
the inverse of NOSILENT-001 (which wired only tsdb-backup). The
scheduler now waits on each triggered job (its Trigger is queue-async;
a fire-and-forget report would always claim success) and reports
outcome via SetResultHook → jobhealth.Report with
job_name='neo4j-backup' (wired in SetTSDBClient next to the tsdb
hook). The staleness rule is generalized into a jobStalenessRule
factory; Neo4jBackupStalenessRule (neo4j_backup_no_recent_success,
Service scheduled-job-staleness-neo4j, window = partial interval × 2
unless BACKUP_JOB_STALENESS_HOURS overrides) registers when
BACKUP_ENABLED. The existing tsdb rule is pinned unchanged through the
refactor.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): initial backup on start (rule honesty)
With the neo4j_backup_no_recent_success rule registered, a fresh
install would alarm honestly-but-noisily for up to 24h (the scheduler's
first tick). The scheduler now runs an initial partial backup
BACKUP_INITIAL_DELAY_MIN (default 5) minutes after start, so every
install has a backup — and a quiet staleness rule — within minutes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(backup-restore-verify-001): retention was deleting every backup it just made (drill-caught)
The Tier 3 round-trip exposed that the backup system was a no-op for
this database: BACKUP_RETENTION_MAX_STORAGE_GB had a comment/code
default drift (documented 50, code read 2), and with RunAfter=true the
quota pass deleted each 3-4 GB whole-database backup ~80 ms after it
completed (log: 'backup completed' → 'retention cleaned backups
deleted_count=1 freed_bytes=<exactly the new backup>'). Three fixes:
1. Quota retention NEVER deletes the newest backup of each type — a
quota smaller than one backup degrades to 'over quota, keep it'
with a loud warning, not 'delete the only backup'. Sparse-file unit
tests pin both the two-backup and only-backup-oversize shapes.
2. Default quota raised to the documented 50 GB.
3. BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC default 300 → 3600: the live
whole-database export runs ~15 min; the 5-min wait made the initial
scheduled run report failure (jobhealth correctly recorded it —
the wiring works) while the backup actually completed later.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(transfer): omit empty path/name on import — restores with observations always failed (drill-caught)
The Tier 3 round-trip's real restore failed with
ConstraintValidationFailed: conversation observations carry path=NULL
in Neo4j (which memorynode_path_unique (space_id, path) ignores), but
the exporter serializes NULL as the proto default "" and the importer
wrote the literal empty string unconditionally — so the second
observation node in any restore collided. Every restore containing 2+
observation nodes had always been broken; this was invisible because
no backup had ever been restore-tested (the sprint's premise,
demonstrated). nodeProps now omits empty path/name (null fidelity);
unit-pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 7)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-storm-001): sprint plan with corrected burst attribution (Epic 0)
Triage correction baked in: the 5,397-node burst was the Context
Cooler via the session-start hook's /v1/conversation/graduate (uncapped
backlog sweep of pre-DH-004 graduation-bug victims), NOT RSIC —
mis-attributed because tombstone_stale stamps no metadata and two
archive-reason property names coexist. RSIC's own issues stand:
trigger-race cycle storm (~20-30k/day) and snapshot/executor predicate
drift (rollback restores nothing).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): atomic trigger admission — reserve-on-allow (Epic A)
EvaluateTrigger checked activeCycles/lastTrigger, but both were written
only by RecordTrigger — which callers invoke AFTER RunCycle completes.
For a cycle's entire multi-second duration every concurrent trigger
passed every gate: ~20-30k micro cycles/day live (4 spawning within
50ms of each tool-use burst), the 300s cooldown effectively
nonexistent, llama-server saturated (the recurring synthesize/
evaluate_llm/intent_translate timeout cascades), and RSIC actions
dispatched at storm frequency.
Admission now reserves the active + cooldown (+dedupe) records under
the same lock that performs the checks; RecordTrigger updates the
reservation with the real cycle ID; CompleteCycle clears the active
slot; a failed cycle still cools down. Unit-pinned: 50 concurrent
triggers admit exactly one; cooldown holds from admission through
completion and failure.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): attributable archival + unified tombstone predicate (Epics B+C)
Epic B — every archival is now attributable:
- tombstone_stale stamps archived_at + archive_reason
('rsic_tombstone_stale') + archived_cycle_id (bare is_archived made
the 2026-06-11 burst forensics mis-attribute the Context Cooler's
5,397-node sweep to RSIC for hours).
- Canonical property name is archive_reason; concepts.go (the one
archived_reason writer) migrates; historical rows keep the old name
(readers coalesce; no data migration).
- Context Cooler tombstone step capped per run
(COOLER_TOMBSTONE_MAX_PER_RUN, default 500; 0=unlimited) with a loud
cap-reached warning — the incident sweep was a single uncapped run
over the pre-DH-004 volatile backlog via the session-start hook's
graduate call.
Epic C — rollback restores the right nodes:
- The executor and the rollback snapshot now share ONE candidate
predicate (tombstoneStaleCandidates const). RSIC-VALIDATE-001 had
updated only the executor; the snapshot captured the old unlinked
set, so rollback restored nodes that were never archived
(restored_count=0 live). Drift class eliminated, pin-tested.
- Rollback also clears the new attribution fields on restore.
Epics share the tombstone Cypher — combined commit (disclosed).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-storm-001): feature doc + verification + rollback drill test + CHANGELOG + CLAUDE.md — sprint complete (Epic F)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): commit ExecuteTombstoneStaleForTest wrapper missed from Epic F
The rollback drill test (committed in 2534a28) references this
test-support wrapper, but the Epic F git-add listed the test file and
not internal/ape/task_dispatch.go — CI's integration build failed on
the already-merged PR #435 while local builds passed (the method
existed uncommitted in the working tree). 7-line addition, no behavior
change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(tsdb): initial backup on start — restart-resetting ticker meant zero backups ever
Alert triage on the (correctly firing) 'No …
* chore(submodule): bump homebrew-mdemg to v0.10.1 formula
Point the parent at the manually-published v0.10.1 homebrew formula
(reh3376/homebrew-mdemg@10c1843). The release artifacts published cleanly;
the formula update was manual because the CI HOMEBREW_TAP_TOKEN expired
(follow-up: rotate the secret so future releases auto-publish).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): sprint plan — reinforcement coverage for other Hebbian paths
Wire the 3 remaining Hebbian write paths (CoactivateSession,
ApplySymbolCoactivation, ApplyNegativeFeedback weaken-only) into the existing
reinforcement_events writer via distinct trigger_path values. No schema/writer/
wiring change (V0022 already has trigger_path + signed delta_weight +
created_new_edge; writer already injected). Contradict path deferred (CONTRADICTS
edges aren't traversed by the federation walk). RETURN-only Cypher edits; Tier-2
asserts unchanged weights. 5 epics, 3 tiers, live Tier-3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire CoactivateSession into reinforcement_events (Epic 1)
CoactivateSession (session-internal conversation-observation co-activation, full
Hebbian formula) now emits per-pair reinforcement events with
trigger_path=coactivate_session. RETURN-only Cypher change: replaced the
discarded `count(*)` with the standard 17-field per-pair RETURN (one row per
forward edge; reverse is a mirror). Weight SET untouched → update behavior
provably unchanged. Mirrors the proven ApplyCoactivation record loop; writer
already injected. EXPLAIN-validated (compiles, all RETURN vars in scope, no
writes); build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplySymbolCoactivation into reinforcement_events (Epic 2)
SymbolNode-pair co-activation now emits trigger_path=apply_symbol_coactivation
rows. Split the weight update out of the ON MATCH clause into a separate SET so
the pre-update weight (w) can be captured for prev/new/delta — createdNew
(evidence_count=1) keeps a fresh edge at 0.1 and increments matches by +0.05,
preserving the original ON-clause weight behavior exactly. eta/surprise/
activation/path_sim are NULL (N/A for symbols); roles default 'symbol_node'.
EXPLAIN-validated; build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch budget
Caught live during the HIDDEN-WEIGHT-001 corpus reingest: the post-ingest
/v1/memory/consolidate call used the shared batch-ingest client
(--timeout, 300s); consolidating a ~10k-node space exceeds that, so the
client reported failure while the server completed the work — the
GUIDANCE-SYNTH-001 bug class (long graph/LLM work needs its own budget).
New --consolidate-timeout flag / INGEST_CONSOLIDATE_TIMEOUT_SEC env
(default 1800s) with a dedicated client. Live-verified: "running
consolidation timeout_sec=1800" → complete.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): Tier 3 verification + corpus restoration + UVTS harness audit + close (Epics 4-5)
Tier 3: real consolidation minted edges with varied cosine weights
(0.83-0.94) + CUIDv2 ids; at-scale via the corpus reingest (9,500 edges,
0 NULL, mean 0.923); gauge holds 0; evaluator rules 16→17.
UVTS harness: corpus space lnl-demo-whk had been deleted with zero trace
(no UVTS run since 2026-05-04 measured anything real); restored by
operator-directed full reingest. A fresh baseline NUMBER remains blocked
by further live-found harness rot — grader/persist breakage, expected-
path format drift, vector post-filter dilution (service.go:1137 global
top-K then space filter) amplified by the duplicate whk-wms space —
complete defect inventory handed to UXTS-CI-001. Retrieval ranking on
the restored corpus verified correct (expected files at ranks 1-4).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): sprint plan — scheduled maintenance actually runs (Epic 0)
Roadmap Q3 Phase 1 rank #4. Weekly decay+prune has never executed
(--dry-run defaults true; plist passes no override) while reporting
success — NOSILENT's blind spot. Tonight's Memory Bloat alerts (79k+
nodes) are the accumulated backlog. Safety verified in code before
planning: nodes are tombstoned (never deleted) with abstraction-chain/
degree/recency protections; edge deletion is the designed near-zero-
weight lifecycle, meaningful now that HIDDEN-WEIGHT made weights real.
Plan: live-by-default plist (+installed refresh), dry_run in job-event
metadata (no schema change — disclosed), maintenance_no_live_run
evaluator rule, darwin upgrade refreshes plists/hooks, first-ever live
run with preview-first protocol.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(maint-live-001): scheduled maintenance runs live — plist passes --dry-run=false (Epic 1)
The weekly LaunchAgent ran `mdemg maintenance` with no dry-run override;
the CLI defaults --dry-run=true, so every scheduled cycle previewed and
reported success — decay+prune NEVER executed (the 79k-node Memory
Bloat backlog). Both plist copies now pass --dry-run=false (the CLI
default stays true for safe manual previews — the SCHEDULE is what must
not silently no-op); installed plist refreshed + agent reloaded.
reportScheduledJobMeta threads job metadata into V0024; maintenance
records dry_run so the only-ever-dry-runs pattern is queryable
(metadata JSONB — no schema change, disclosed in the plan).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): maintenance_no_live_run evaluator rule (Epic 2)
Fires when maintenance rows exist in MAINT_LIVE_LOOKBACK_DAYS (default
8) but none ran live (success + metadata dry_run=false) — the only-
ever-dry-runs pattern self-reports instead of hiding inside "the job
ran". Distinct service maintenance-liveness per the cooldown rule.
Config: MAINT_LIVE_ALERT_ENABLED (true), MAINT_LIVE_LOOKBACK_DAYS (8).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): mdemg upgrade refreshes installed LaunchAgents + hooks (darwin) (Epic 3)
Plist/hook fixes shipped in releases but never reached installed
machines — the maintenance dry-run override would have sat unreachable
next to upgraded binaries forever. Upgrade now re-renders ALREADY-
INSTALLED mdemg LaunchAgents from the new binary's embedded templates
(refresh-only — never installs new services) + re-syncs mdemg-managed
Claude hooks in the current project (marker-checked). Substitution
logic single-sourced into renderLaunchdTemplate (Install + Refresh —
the drift class that exit-78'd the sidecar during HOOKSYNC live smoke).
Mirrors the existing Linux systemd-unit refresh.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): context-dependent orphan policy — --exclude-role-types (Epic 4a)
Orphan disposition is context-dependent (operator, 2026-06-11): a
uniform degree/age rule conflates governance constraints, conversation
history, test junk, and hierarchy debris. New --exclude-role-types on
prune + maintenance (env PRUNE_EXCLUDE_ROLE_TYPES) makes the policy
expressible; the scheduled plist ships
constraint,conversation_observation excluded per the operator's call
(constraints are load-bearing governance rules at any degree;
conversation observations differ by SESSION which the knob can't
express yet). Aged hierarchy debris stays eligible — that's the
lifecycle working. Candidate census that drove the decision: 5,388
conv-obs (9 eligible tonight under the 90d shield), 11 constraints,
238 hierarchy nodes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(prune): orphan sweeps use implicit transactions for batched deletes
Caught by the FIRST-EVER live maintenance run (MAINT-LIVE-001 Tier 3):
Neo4j raises TransactionStartFailed when a batched CALL-IN-TRANSACTIONS
statement executes inside an explicit transaction. Both orphan sweeps
(SymbolNode + Observation) ran their batched delete via ExecuteWrite;
the dry-run path never executes the deleting statement, so no preview
or unit test could surface it — only live execution. Switched to
session.Run (implicit tx). The failure ALSO proved the NOSILENT chain
live: the run fired "Scheduled job failed: maintenance" before exiting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): first live run verification + feature doc + CHANGELOG + close (Epics 4b-5)
First live maintenance in MDEMG history: 20,236 orphan SymbolNodes
deleted; all 5,010 tombstone candidates protected (recency + operator
exclusions); liveness rule born-firing → silenced by the real run; the
3-row job-event story (preview/true → failure/false alerted →
success/false) proves the dry_run plumbing through every path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md architecture note for MAINT-LIVE-001
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): sprint plan — embedder wiring + ingest exec resolution (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(embed-wire-001): breaker + recorder reach the real embedder through the wrapper chain (Epic 1)
The embedding circuit breaker was NEVER wired in any default deployment:
embeddings.New returns *CachedEmbedder when EMBEDDING_CACHE_ENABLED=true
(the default), so the server's emb.(*embeddings.OpenAI)/(*Ollama)
assertions on the OUTERMOST value failed silently (no else branch). The
recorder assertion had the inverse fragility (cache off → training-data
recording silently dies).
New: Unwrap() chain (CachedEmbedder joins RateLimitedEmbedder's existing
one) + embeddings.Base() / FindCached() interface-driven walkers — any
future wrapper joins by adding Unwrap(), no type lists. Wiring now walks
to the base for the breaker and to the cache layer for the recorder,
with LOUD warns when nothing matches. Tier 1 pins the production shape
(ratelimit(cache(provider))) plus cache-off and bare chains.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest-exec-001): server-triggered ingest resolves the mdemg binary — was hardcoded ./bin/mdemg (Epic 2)
Both ingest-job exec sites ran a relative "./bin/mdemg": broken in
Docker (the documented-primary deployment — binary at /usr/local/bin,
no repo checkout) and any CWD other than the repo root. New
resolveMdemgBin(): MDEMG_BIN env → os.Executable() (the server IS the
binary) → PATH → ./bin/mdemg legacy fallback; cached; Tier 1 pins the
order. Scheduled-sync jobs now report outcomes to scheduled_job_events
via jobhealth (job_name codebase-sync) — an unattended sync that keeps
failing is never silent; manual API jobs stay queue-visible only.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): live verification + CHANGELOG + CLAUDE.md + close (Epics 3-4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): sprint plan — documentation matches reality (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): CLAUDE.md FT section rewritten to post-pivot reality (Epic 1)
The section presented the abandoned Qwen3.6-35B-A3B MoE target, two-tier
MoE-Sieve strategy, and Sprint A→E critical path as CURRENT — all
superseded by the 2026-04-22 MoE→dense pivot; this stale text seeded the
Q3 roadmap audit with a dead architecture. Rewritten: shipped state
(dense Qwen3-14B mdemg-llm-v1, 0.8389, llama-server runtime), superseded
plan documented with the pivot rationale (never deleted — supersede-with-
pointer), guardrail llmclient exception marked CLOSED (re-verified in
code), memo-07 provenance break disclosed (the file never existed;
00_README_v2.md is canonical), open FT work named (FT-CLASSIFY-002 +
recursive-retraining trigger). Adapter env-name drift fixed
(MDEMG_ADAPTER_BASE, not MDEMG_MODEL_ADAPTER_BASE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(doc-truth-001): operator-facing text matches the Phase 13.5 reality (Epic 2)
preflight errors directed operators to start the DECOMMISSIONED
mlx_lm.server on :8101 — following them reintroduces the crash-looping
stack Phase 13.5 replaced. Now: llama-server :8102 guidance (managed
service install + manual command), backend-agnostic wording. model.go
help text dropped three stale "deferred to MODEL-DIST-002" mentions
(shipped 2026-05-25). Operationally (untracked .env): removed the
J17_SIDECAR_TIMEOUT_MS=200 override that re-pinned the exact value
DH-004 remediated — the 1000ms default now applies; server restarted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): 00_README STATUS block + AGENT_HANDOFF retired (Epic 3)
00_README_v2.md gains a top-of-file STATUS block: shipped-through-
cutover state, superseded MoE plan (FT-2 skip + FT-3 supersession +
R-LT-4 prototype-discipline adjudication recorded), the NOT-STARTED
recursive-retraining loop with its FT-CLASSIFY-002 trigger, and
provenance notes (memo-07 never existed; the spec is untracked pending
FG-2). AGENT_HANDOFF.md (stale since 2026-05-06) retired to a pointer
stub — handoff state lives in CLAUDE.md/roadmap/CHANGELOG/CMS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): grep-sweep proof + CHANGELOG + close (Epic 4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): last stale --adapter help string (sweep straggler)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): sprint plan — fail-closed self-improvement (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): honest criteria evaluation — populated keys + fail-closed mutations (Epics 1-2)
The cycle baseline populated 10 metric keys while task criteria
referenced ~15 others (only volatile_count + correction_rate
intersected) → missing_data → skip → ~16/17 actions validated
vacuously; criteria-driven rollback was unreachable. The
SelfAssessmentReport already carried nearly every needed key — they
were never copied into the maps.
New single source reportMetricsMap() feeds BOTH MetricsBefore and
MetricsAfter (the mismatch class cannot recur), resolving
edges_below_threshold, total_edges, consolidation_age_sec,
avg_edge_weight, guidance_health, protocol_health + 13 more. Fail-
closed rule: for MUTATING actions (15-entry registry) a criterion with
missing evidence counts as NOT met ("missing_data_failclosed") — an
unverifiable mutation must never be recorded as success; observational
actions keep advisory semantics. The prior test pinned the vacuous
pass as the contract — updated to the honest one + advisory companion.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): tombstone_stale scoped to correction-linked nodes; refresh_stale_edges decays for real (Epic 3)
tombstone_stale archived 50 ARBITRARY older observations whenever ANY
correction existed in the 7-day window — no relationship between
correction and target. Now requires linkage: same session as the
correction OR its 1-hop CO_ACTIVATED_WITH neighborhood. Live check:
0 corrections in the current 7-day window, so both old and new scopes
are 0 RIGHT NOW — the hazard was conditional (any future correction
re-armed the old query against thousands of unrelated observations;
the new query bounds it to genuinely related nodes).
refresh_stale_edges bumped last_activated BEFORE the weight expression
read it → staleness=0 → the decay term vanished → every refresh was a
pure +0.1·log(count+1) boost. Staleness now captured via WITH before
SET; weights can genuinely decay. Both statements EXPLAIN-validated.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): counter-free confidence calibration — RSIC stops polluting its own signal (Epic 4)
RSIC-SK1 injected synthetic "followed"/"ignored" outcomes through
UpdateConfidence, incrementing total_surfaced/total_followed/
total_ignored — the exact counters GetConstraintEffectiveness reads
next cycle: measured effectiveness drove synthetic outcomes which drove
measured effectiveness (circular self-reinforcement). New
AdjustConfidenceDirect applies the clamp+archive confidence delta with
ZERO counter writes; the outcome counters now belong exclusively to
real guidance feedback. Provider interface + adapter + dispatcher use
the direct path with the configured boost/decay magnitudes; test mock
maps deltas back to outcome labels so existing assertions keep meaning.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): Tier 3 verification + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(rsic-validate-001): integration seeds carry session linkage for the scoped tombstone contract
CI's TombstoneStaleEndToEnd + MultiActionDispatchAndMetrics failed
because the seeded observations had NO relationship to the seeded
corrections — under the old behavior they were archived anyway (the
memory-eroding bug the sprint removed); under the new correction-
linkage contract they are correctly spared. SeedObservationNodes now
stamps a per-space test session shared by corrections and their stale
peers, so the tests exercise the new contract. Query-level proof
against the exact seeded shape: 10/10 linked observations match the
scoped Cypher. (Local integration runs hit the 30s client timeout —
the loaded local stack's cycles take ~6 min; CI's arbitrates.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): sprint plan — finish the score-scale contract (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): persistent rerank clients — failure alerting re-armed on the hottest LLM path (Epic 1)
doRerankWithOpenAI/doRerankWithOllama constructed a fresh llmclient per
call: the consecutive-failure counter reset every time, so
LLM_CONSECUTIVE_FAILURE_THRESHOLD could NEVER fire for
retrieval.rerank_cross / rerank_nli (a north-star distill task), and the
HTTP transport was discarded per call. Per-provider base clients now
init once (sync.Once); WithContext() shallow-copies and SHARES the
*atomic counter + breaker, so per-call contexts keep failure accounting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): config-driven score thresholds — suggest revival, MCP tiers, guardrail floor (Epic 2)
Three score-literal leftovers from the RRF-SCALE-001 audit instruction:
(1) /v1/memory/suggest's hardcoded 0.5 min-confidence default filtered
nearly everything on a scale topping out ~0.58 → CONSULTING_SUGGEST_
MIN_CONFIDENCE (default 0.45, RRF-calibrated); (2) MCP memory_reflect
tiers 0.7/0.4 (high tier unreachable) → MCP_REFLECT_SCORE_HIGH/_MEDIUM
(0.45/0.25); (3) guardrail constraint-retrieval Cypher's hardcoded
sim > 0.3 → GUARDRAIL_CONSTRAINT_SIM_FLOOR via GuardrailConfig
(cosine-stable today but inside the class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): CacheKey covers ALL result-affecting fields + two forcing functions (Epics 3-4)
CACHE-KEY-002: the key omitted result-affecting RetrieveRequest fields —
the audit named 5 (include/exclude_extensions, temporal_after/before,
policy_context); the new reflection forcing-function caught 8 MORE on
its first run: sparse-gate per-call overrides (SparseEnabled/
SparsePercentile/SparseOverridePresent/Category — the ?sparse= URL
params), pagination (Cursor/Limit), and the context-fingerprint params
(QueryContextFingerprint/StrictContextMode). All now keyed, plus a
caller-supplied query-embedding hash. Two requests differing in any of
these no longer collide on one cache entry.
Forcing functions: (1) reflection test — every RetrieveRequest field
must be in CacheKey or explicitly classified result-neutral with
justification (new fields fail until classified); (2) score-literal
scan — flags `.Score/score <op> 0.x` comparisons repo-wide outside a
justified allowlist (first run triaged 3 scale-local sites; clamp
guards excluded by pattern). The RRF-SCALE bug class is now CI-caught.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): live-calibrated suggest floor + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): sprint plan — stable concept identity, two-PR delivery (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): automated consolidation no longer skips LLM emergence (Epic A1)
dynamicEmergenceStep registers at phase 22, but RunConsolidation ran
hardcoded ranges (10,20) + (25,30) — phase 22 fell in the gap, so with
EMERGENCE_ENABLED=true the AUTOMATED path silently skipped LLM concept
emergence while the manual path (RunNodeCreationPipeline, 10–22 with an
emergence gate) ran it. RunConsolidation now delegates to
RunNodeCreationPipeline(cfg.EmergenceEnabled) — single range source; a
pin test fails if the step's phase ever leaves the range.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): stable theme identity — centroid match-or-create replaces the 5-minute churn (Epic A2)
ClusterConversations detached EVERY observation→theme edge, deleted
childless themes, and recreated all themes from scratch each ~5-min
cycle: new node_ids every run, evidence chains destroyed continuously,
recall flooded with stacks of near-identical concepts (observed live in
this session's own prompt headers).
New flow: cluster first → match each cluster to an EXISTING theme by
centroid cosine (HIDDEN_THEME_IDENTITY_SIM_THRESHOLD, default 0.90,
greedy with per-run claiming) → matched themes UPDATE in place
(props + theme-scoped member-edge rewire; node_id and all inbound
references survive) → unmatched clusters create as before → only themes
claimed by NO cluster are deleted. The global detach is gone.
ThemesUpdated added to the result. Tier 1: match/threshold/claimed/
best-of selection.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* chore(hidden-churn-001): remove the dead global-detach helper — the churn mechanism itself
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-A verification + CHANGELOG (Epics A3-A4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): PR-B coverage retune — config ratio, density assignment, gauge + rule (Epic B1)
maxThemes was an inline ceil(n/10) equation → HIDDEN_THEME_TARGET_RATIO
(default preserves it). NOISE observations (previously dropped from the
hierarchy forever — the 94% coverage gap's mechanism) now density-assign
to their nearest theme when cosine ≥ HIDDEN_THEME_ASSIGN_SIM_THRESHOLD
(default 0.70; edges only, no new themes; below-floor stays unthemed
honestly). New per-space coverage gauge
mdemg_neo4j_conversation_coverage_ratio (collector Query 5) + evaluator
rule low_conversation_coverage (CONVERSATION_COVERAGE_ALERT_FLOOR 0.2,
6h ForDuration for convergence).
Audit bonus: caught WeightIntegrityRules querying metric_samples with
recorded_at — the column is `time`; the null-weight rule had been
silently erroring every evaluation since it shipped (Debug-only logging
— the SUPERVISOR-002 finding in action). Both rules fixed + a pin test
bans recorded_at against metric_samples.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): mdemg concepts repair + trace — grounding audit CLI (Epic B2)
repair: tombstones childless layer>=2 abstraction nodes (no inbound
ABSTRACTS_TO|GENERALIZES|GENERALIZES_TO — 10,395 live in mdemg-dev,
churn-era debris). Recoverable (is_archived=true + archived_reason),
batched, dry-run default, --limit for small-batch-first verification.
trace: per-node grounding audit — direct children, transitive per-layer
census, grounded/ungrounded verdict, sample path to L0.
Live data note: GENERALIZES alone over-counts (19,147) — ABSTRACTS_TO
is the hidden layer's actual child edge; pin test guards the predicate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): surface themes_updated + noise_assigned in consolidate API + periodic log (Epic B3)
/v1/conversation/consolidate now reports themes_updated and
noise_assigned alongside themes_created. The periodic-consolidation
log condition also gains both — with stable theme identity (PR-A),
created is usually 0 on healthy cycles, which would have silenced the
success log entirely (the silent-success bug class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): live-smoke fixes — noise pool was structurally empty, clustering included archived debris, coverage gauge gated on min-obs
Three defects only the live run surfaced (Tier 3 forcing function):
1. KMeans never emits label -1, so the density-assignment hook received
an always-empty noise list; the min-samples/max-themes/nil-centroid
drops now feed their members into the noise pool instead of silently
excluding them from the hierarchy.
2. fetchClusterableConversationObservations had no is_archived filter —
it clustered 4,838 observations of which only 183 were live (MAINT-LIVE
tombstones), building themes on archived debris. Both fetch variants
now exclude archived. Live effect: 24 debris themes swept to 5 clean
ones; second cycle themes_updated=5/created=0 (stable identity on
real data).
3. Coverage gauge gated on CONVERSATION_COVERAGE_MIN_OBS (default 50,
DH-005 confidence-threshold pattern) — tiny scratch/test spaces
(2-13 observations) emitted 0.000 and would have alarmed forever
(born-firing alert hazard). Sentinel -1 skips emission.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-B verification + CHANGELOG + CLAUDE.md — sprint complete (Epic B5)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): sprint plan + background loop inventory (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): sliding-window restart budget + late registration (Epic 1)
The restart counter only ever incremented — a once-a-week transient
permanently killed a worker after 3 weeks. Budget is now a sliding
window (restarts older than the window are forgotten); permanent
failure requires >SUPERVISOR_MAX_RESTARTS within
SUPERVISOR_RESTART_WINDOW_MIN. New Go() registers+launches workers
after Start (the API server starts its loops late); nil return without
ctx cancellation now means intentional completion, not a restart.
Start() outlives dead workers so late workers stay supervised.
Config: SUPERVISOR_MAX_RESTARTS (3), SUPERVISOR_RESTART_WINDOW_MIN
(60), SUPERVISOR_BACKOFF_BASE_SEC (5).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): register the 12 unsupervised background loops (Epic 2)
Every scheduler/loop goroutine now runs under the goroutine supervisor
(panic recovery + sliding-window restart budget) instead of as a bare
go func() whose panic silently killed the subsystem forever:
- api.Server (6): periodic-consolidation, context-cooler,
space-prune-scheduler, weekly-gap-interviews, scheduled-sync,
rsic-macro-cron — via injected SetSupervisor(sup.Go) + goSupervised
helper (bgWg brackets each run; stop channels remain the graceful
path and return nil = no restart)
- ape (3): rsic-watchdog, rsic-store-flush, signal-learner-flush
- backup schedulers (2): neo4j-backup-scheduler, tsdb-backup-scheduler
— their NewServer construction-time Start() moved to
StartSupervisedBackground() (serve.go) so the hook exists first
- serve.go (1): llm-fastfail-burst-flush via sup.Go
All owners keep a nil-hook fallback (legacy bare goroutine), so tests
and non-server callers are unchanged. Buffered TSDB writers are
explicitly out of scope (TSDB-CONSUME-001 owns flush observability).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): rule-health meta-alert on evaluator query failures (Epic 3)
A rule whose SQL errors was a silently-disabled alert: failures were
logged at Debug and nothing watched the watcher — bitten twice in one
week (HIDDEN-WEIGHT-001 null-weight rule + the recorded_at column bug,
both found by accident in later sprints). Query failures now log at
Warn, and after ALERT_RULE_FAILURE_THRESHOLD (default 3) consecutive
failures a high-severity meta-alert fires directly via the dispatcher
(not via a rule — the meta-channel must not depend on the failing
mechanism). Service label is rule-health-<rule-id> so concurrent
failing rules don't cooldown-suppress each other; success re-arms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): recency-gate the RSIC llm_error_rate_spike insight (Epic 4)
Insight 26 computes the error rate over a 24h window with no recency
requirement, so a 35-min jiminy.synthesize timeout burst at 02:00 UTC
kept re-firing HIGH 'LLM error rate spike' (and escalating to CRITICAL
'Jiminy Pipeline Critical') every RSIC micro-cycle for 12+ hours after
the incident self-resolved (live, 2026-06-11). LLMPerformanceSummary
now carries LastErrorAt (MAX(time) over errored rows); the spike
insight fires only when the most recent error is within
RSIC_LLM_ERROR_RECENCY_MIN (default 60; 0 disables the gate). A zero
LastErrorAt (older data source) keeps legacy behavior — the gate never
widens silently.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): nolint G118 on legacy-fallback loop launches
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): aggregate evaluator-degraded alert for global outages
A TSDB-level outage fails every rule at once; per-rule meta-alerts
would storm ~19 alerts duplicating the health prober's signal. At the
failure threshold the evaluator now distinguishes: other rules
succeeding recently → per-rule rule-health alert (broken SQL class);
nothing succeeding within threshold×interval → ONE
alert-evaluator-degraded alert per outage. Success re-arms both.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(jiminy): detach feedback outcome processing from the hook's connection lifetime
Live-smoke surprise during SUPERVISOR-002 Epic 5 (own fix-commit per
policy): jiminy.evaluate_llm was failing at 94.9% (657 'context
canceled' rows/24h). The post-tool-observe hook POSTs
/v1/jiminy/feedback with curl --max-time 5, but per-item Tier-2
outcome classification routinely outlives the connection — the request
ctx then cancelled every in-flight LLM call and outcomes silently
degraded to the keyword heuristic. Same defect class as
GUIDANCE-SYNTH-001's warm-path budget.
handleJiminyFeedback now uses context.WithoutCancel(r.Context()) with
its own server-side budget JIMINY_FEEDBACK_TIMEOUT_MS (default 60000,
0 = unbounded). The hook keeps its fire-and-forget 5s curl.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): streak-relative global-outage discriminator (drill-caught)
The Epic 5 TSDB-stop drill caught the freshness-window heuristic
misclassifying outage ONSET: rules were succeeding seconds before the
stop, so lastAnySuccess was fresh when the first rules hit threshold —
2 per-rule alerts leaked before the aggregate fired. The discriminator
is now streak-relative: per-rule only when some other rule succeeded
AFTER this rule's failure streak began; otherwise global, once per
outage. Unit-pinned with the drill scenario
(TestRuleFailureStreak_OutageOnsetIsGlobal).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): harden the restore path (Epics 1-4)
Epics 1-4 land together — they share the restore-path signatures.
1. Checksum gate (Epic 1): the manifest SHA-256 was written at backup
time and never read; a corrupted .mdemg restored silently. The gate
now fails closed before import; legacy manifests without a checksum
warn and proceed.
2. Snapshot completion polling (Epic 2): the pre-restore safety
snapshot was raced with time.Sleep(2s) against an async backup
goroutine. waitForBackupJob now polls the jobs queue until
completed, failing closed on failure/cancel/vanish/timeout
(BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC, default 300).
3. Count validation (Epic 3): manifest NodeCount/EdgeCount are
whole-database counts and cannot validate file contents (they
diverge on partial backups) — new additive file_node_count/
file_edge_count/file_observation_count manifest fields are counted
from the exported chunks; restore re-counts the file and hard-fails
on mismatch (truncation class). Importer accounting divergence
under CONFLICT_SKIP is warn-only, surfaced in a job-result
validation block.
4. dockerbin routing (Epic 4): the legacy .dump restore shelled out to
bare "docker" (the launchd-minimal-PATH class NOSILENT-001 fixed
for TSDB); now routes via dockerbin unless the operator set a
non-default FullCmd.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): neo4j-backup jobhealth + generalized staleness rules (Epic 5)
The default-ON Neo4j backup scheduler had zero jobhealth coverage —
the inverse of NOSILENT-001 (which wired only tsdb-backup). The
scheduler now waits on each triggered job (its Trigger is queue-async;
a fire-and-forget report would always claim success) and reports
outcome via SetResultHook → jobhealth.Report with
job_name='neo4j-backup' (wired in SetTSDBClient next to the tsdb
hook). The staleness rule is generalized into a jobStalenessRule
factory; Neo4jBackupStalenessRule (neo4j_backup_no_recent_success,
Service scheduled-job-staleness-neo4j, window = partial interval × 2
unless BACKUP_JOB_STALENESS_HOURS overrides) registers when
BACKUP_ENABLED. The existing tsdb rule is pinned unchanged through the
refactor.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): initial backup on start (rule honesty)
With the neo4j_backup_no_recent_success rule registered, a fresh
install would alarm honestly-but-noisily for up to 24h (the scheduler's
first tick). The scheduler now runs an initial partial backup
BACKUP_INITIAL_DELAY_MIN (default 5) minutes after start, so every
install has a backup — and a quiet staleness rule — within minutes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(backup-restore-verify-001): retention was deleting every backup it just made (drill-caught)
The Tier 3 round-trip exposed that the backup system was a no-op for
this database: BACKUP_RETENTION_MAX_STORAGE_GB had a comment/code
default drift (documented 50, code read 2), and with RunAfter=true the
quota pass deleted each 3-4 GB whole-database backup ~80 ms after it
completed (log: 'backup completed' → 'retention cleaned backups
deleted_count=1 freed_bytes=<exactly the new backup>'). Three fixes:
1. Quota retention NEVER deletes the newest backup of each type — a
quota smaller than one backup degrades to 'over quota, keep it'
with a loud warning, not 'delete the only backup'. Sparse-file unit
tests pin both the two-backup and only-backup-oversize shapes.
2. Default quota raised to the documented 50 GB.
3. BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC default 300 → 3600: the live
whole-database export runs ~15 min; the 5-min wait made the initial
scheduled run report failure (jobhealth correctly recorded it —
the wiring works) while the backup actually completed later.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(transfer): omit empty path/name on import — restores with observations always failed (drill-caught)
The Tier 3 round-trip's real restore failed with
ConstraintValidationFailed: conversation observations carry path=NULL
in Neo4j (which memorynode_path_unique (space_id, path) ignores), but
the exporter serializes NULL as the proto default "" and the importer
wrote the literal empty string unconditionally — so the second
observation node in any restore collided. Every restore containing 2+
observation nodes had always been broken; this was invisible because
no backup had ever been restore-tested (the sprint's premise,
demonstrated). nodeProps now omits empty path/name (null fidelity);
unit-pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 7)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-storm-001): sprint plan with corrected burst attribution (Epic 0)
Triage correction baked in: the 5,397-node burst was the Context
Cooler via the session-start hook's /v1/conversation/graduate (uncapped
backlog sweep of pre-DH-004 graduation-bug victims), NOT RSIC —
mis-attributed because tombstone_stale stamps no metadata and two
archive-reason property names coexist. RSIC's own issues stand:
trigger-race cycle storm (~20-30k/day) and snapshot/executor predicate
drift (rollback restores nothing).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): atomic trigger admission — reserve-on-allow (Epic A)
EvaluateTrigger checked activeCycles/lastTrigger, but both were written
only by RecordTrigger — which callers invoke AFTER RunCycle completes.
For a cycle's entire multi-second duration every concurrent trigger
passed every gate: ~20-30k micro cycles/day live (4 spawning within
50ms of each tool-use burst), the 300s cooldown effectively
nonexistent, llama-server saturated (the recurring synthesize/
evaluate_llm/intent_translate timeout cascades), and RSIC actions
dispatched at storm frequency.
Admission now reserves the active + cooldown (+dedupe) records under
the same lock that performs the checks; RecordTrigger updates the
reservation with the real cycle ID; CompleteCycle clears the active
slot; a failed cycle still cools down. Unit-pinned: 50 concurrent
triggers admit exactly one; cooldown holds from admission through
completion and failure.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): attributable archival + unified tombstone predicate (Epics B+C)
Epic B — every archival is now attributable:
- tombstone_stale stamps archived_at + archive_reason
('rsic_tombstone_stale') + archived_cycle_id (bare is_archived made
the 2026-06-11 burst forensics mis-attribute the Context Cooler's
5,397-node sweep to RSIC for hours).
- Canonical property name is archive_reason; concepts.go (the one
archived_reason writer) migrates; historical rows keep the old name
(readers coalesce; no data migration).
- Context Cooler tombstone step capped per run
(COOLER_TOMBSTONE_MAX_PER_RUN, default 500; 0=unlimited) with a loud
cap-reached warning — the incident sweep was a single uncapped run
over the pre-DH-004 volatile backlog via the session-start hook's
graduate call.
Epic C — rollback restores the right nodes:
- The executor and the rollback snapshot now share ONE candidate
predicate (tombstoneStaleCandidates const). RSIC-VALIDATE-001 had
updated only the executor; the snapshot captured the old unlinked
set, so rollback restored nodes that were never archived
(restored_count=0 live). Drift class eliminated, pin-tested.
- Rollback also clears the new attribution fields on restore.
Epics share the tombstone Cypher — combined commit (disclosed).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-storm-001): feature doc + verification + rollback drill test + CHANGELOG + CLAUDE.md — sprint complete (Epic F)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): commit ExecuteTombstoneStaleForTest wrapper missed from Epic F
The rollback drill test (committed in 2534a28) references this
test-support wrapper, but the Epic F git-add listed the test file and
not internal/ape/task_dispatch.go — CI's integration build failed on
the already-merged PR #435 while local builds passed (the method
existed uncommitted in the working tree). 7-line addition, no behavior
change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(tsdb): initial backup on start — restart-resetting ticker meant zero backups ever
Alert triage on the (correctly firing) 'No Successful TSDB Backup In
Window' staleness alert found scheduled_job_events has ZERO tsdb-backup
rows: the scheduler's 24h ticker resets on every restart, and a server
restarted more often than the interval never backs up (8 restarts
today alone). Same gap BACKUP-RESTORE-VERIFY-001 fixed for the Neo4j
scheduler. The shared runOnce() now also fires
TSDB_BACKUP_INITIAL_DELAY_MIN (default 10; 0 disables) after start —
the staleness rule's 'never ran' guarantee did its job catching this.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(uats-gap-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(jiminy): reformulate returns 400 (not 500) for missing context
Live-caught while probing for the UATS-GAP-001 contract spec: the
service rejects an empty context but the handler surfaced that
request-validation failure as 500 'internal error'. Validate at the
edge like space_id. The /strict reformulation channel (prompt-context
hook in strict mode) was otherwise contract-clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(transfer): nil-guard edge identity assertions — one bad edge panicked the whole server (UATS-caught P0)
The UATS suite's backup_trigger spec ran a live export that hit an
edge whose endpoint returned nil fromId/toId — the unchecked
fromID.(string)/toID.(string)/relType.(string) assertions at
exporter.go:641-643 panicked, taking the entire server down mid-run
(launchd restarted it; 167 connection errors in the suite were the
visible symptom). An unexportable edge is now skipped with a warning;
the two parentVal assertion sites hardened with comma-ok for the same
class. An HTTP-triggerable request must never be able to kill the
process.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(uats-gap-001): 8 contract specs for the revived channels + suite hygiene (Epics 1-6)
New specs (27 cases, 100% live pass, hash-stamped): jiminy_strict,
jiminy_reformulate, jiminy_classify (incl. the fail-open contract),
jiminy_warm (202 warming|debounced union), jiminy_latest (warmth-state
union — the Follow-up C strict-JSON surface), admin_breakers_list,
admin_…
* feat(eventgraph-003): wire ApplyNegativeFeedback weaken path → reinforcement_events (Epic 3)
The weaken path (existing CO_ACTIVATED_WITH edge weakened by negWeight) now emits
trigger_path=apply_negative_feedback rows with a NEGATIVE delta_weight and
created_new_edge=false. The FOREACH writes (weaken SET + contradict MERGE) are
untouched; only the RETURN changed from aggregated `action,count(*)` to per-pair
rows — the Go side counts rows (sum = grouped count, NegativeFeedbackResult
preserved) and emits reinforcement events for weaken rows only. prevWeight is
captured before the FOREACH SET. Contradict path deliberately not emitted
(CONTRADICTS isn't traversed by the federation walk; deferred). EXPLAIN-validated;
build + lint clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(conversation): inject learning service so CoactivateSession actually runs
Discovered via EVENTGRAPH-003 live smoke: session co-activation
(CO_ACTIVATED_WITH edges between same-session conversation observations) had
NEVER fired — 0 such edges ever in mdemg-dev across 5495 conversation
observations. Root cause: conversation.NewServiceWithConfig sets
learningService=nil ("set via SetLearningService to avoid circular dependency"),
but SetLearningService had NO caller, so the `if s.learningService != nil` guard
in Observe() always skipped CoactivateSession. The function + its Cypher were
correct (verified by running it directly: 3 pairs, proper Hebbian weights) —
it was just never invoked.
Fix: convSvc.SetLearningService(lea) at construction. Live-verified: 3 distinct
observations in a session now create 6 CO_ACTIVATED_WITH edges + emit
coactivate_session reinforcement events. Standalone fix-commit per the
live-smoke precedent (surprise bugs don't get rolled into the sprint commit).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-003): Tier 3 verification + feature doc + CHANGELOG + close (Epic 4)
All four trigger_paths live-verified (apply_coactivation 50, apply_symbol_
coactivation 1000, apply_negative_feedback 1 negative-delta, coactivate_session
4 after the dormancy fix); federation CLI surfaces them. Feature doc updated to
all-four-paths + the trigger_path table; CHANGELOG Added (EVENTGRAPH-003) + Fixed
(CoactivateSession never-invoked); CLAUDE.md note + correction (CoactivateSession
was dead, not "writing via sidecar paths"); verification.md + post.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch budget
Caught live during the HIDDEN-WEIGHT-001 corpus reingest: the post-ingest
/v1/memory/consolidate call used the shared batch-ingest client
(--timeout, 300s); consolidating a ~10k-node space exceeds that, so the
client reported failure while the server completed the work — the
GUIDANCE-SYNTH-001 bug class (long graph/LLM work needs its own budget).
New --consolidate-timeout flag / INGEST_CONSOLIDATE_TIMEOUT_SEC env
(default 1800s) with a dedicated client. Live-verified: "running
consolidation timeout_sec=1800" → complete.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): Tier 3 verification + corpus restoration + UVTS harness audit + close (Epics 4-5)
Tier 3: real consolidation minted edges with varied cosine weights
(0.83-0.94) + CUIDv2 ids; at-scale via the corpus reingest (9,500 edges,
0 NULL, mean 0.923); gauge holds 0; evaluator rules 16→17.
UVTS harness: corpus space lnl-demo-whk had been deleted with zero trace
(no UVTS run since 2026-05-04 measured anything real); restored by
operator-directed full reingest. A fresh baseline NUMBER remains blocked
by further live-found harness rot — grader/persist breakage, expected-
path format drift, vector post-filter dilution (service.go:1137 global
top-K then space filter) amplified by the duplicate whk-wms space —
complete defect inventory handed to UXTS-CI-001. Retrieval ranking on
the restored corpus verified correct (expected files at ranks 1-4).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): sprint plan — scheduled maintenance actually runs (Epic 0)
Roadmap Q3 Phase 1 rank #4. Weekly decay+prune has never executed
(--dry-run defaults true; plist passes no override) while reporting
success — NOSILENT's blind spot. Tonight's Memory Bloat alerts (79k+
nodes) are the accumulated backlog. Safety verified in code before
planning: nodes are tombstoned (never deleted) with abstraction-chain/
degree/recency protections; edge deletion is the designed near-zero-
weight lifecycle, meaningful now that HIDDEN-WEIGHT made weights real.
Plan: live-by-default plist (+installed refresh), dry_run in job-event
metadata (no schema change — disclosed), maintenance_no_live_run
evaluator rule, darwin upgrade refreshes plists/hooks, first-ever live
run with preview-first protocol.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(maint-live-001): scheduled maintenance runs live — plist passes --dry-run=false (Epic 1)
The weekly LaunchAgent ran `mdemg maintenance` with no dry-run override;
the CLI defaults --dry-run=true, so every scheduled cycle previewed and
reported success — decay+prune NEVER executed (the 79k-node Memory
Bloat backlog). Both plist copies now pass --dry-run=false (the CLI
default stays true for safe manual previews — the SCHEDULE is what must
not silently no-op); installed plist refreshed + agent reloaded.
reportScheduledJobMeta threads job metadata into V0024; maintenance
records dry_run so the only-ever-dry-runs pattern is queryable
(metadata JSONB — no schema change, disclosed in the plan).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): maintenance_no_live_run evaluator rule (Epic 2)
Fires when maintenance rows exist in MAINT_LIVE_LOOKBACK_DAYS (default
8) but none ran live (success + metadata dry_run=false) — the only-
ever-dry-runs pattern self-reports instead of hiding inside "the job
ran". Distinct service maintenance-liveness per the cooldown rule.
Config: MAINT_LIVE_ALERT_ENABLED (true), MAINT_LIVE_LOOKBACK_DAYS (8).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): mdemg upgrade refreshes installed LaunchAgents + hooks (darwin) (Epic 3)
Plist/hook fixes shipped in releases but never reached installed
machines — the maintenance dry-run override would have sat unreachable
next to upgraded binaries forever. Upgrade now re-renders ALREADY-
INSTALLED mdemg LaunchAgents from the new binary's embedded templates
(refresh-only — never installs new services) + re-syncs mdemg-managed
Claude hooks in the current project (marker-checked). Substitution
logic single-sourced into renderLaunchdTemplate (Install + Refresh —
the drift class that exit-78'd the sidecar during HOOKSYNC live smoke).
Mirrors the existing Linux systemd-unit refresh.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): context-dependent orphan policy — --exclude-role-types (Epic 4a)
Orphan disposition is context-dependent (operator, 2026-06-11): a
uniform degree/age rule conflates governance constraints, conversation
history, test junk, and hierarchy debris. New --exclude-role-types on
prune + maintenance (env PRUNE_EXCLUDE_ROLE_TYPES) makes the policy
expressible; the scheduled plist ships
constraint,conversation_observation excluded per the operator's call
(constraints are load-bearing governance rules at any degree;
conversation observations differ by SESSION which the knob can't
express yet). Aged hierarchy debris stays eligible — that's the
lifecycle working. Candidate census that drove the decision: 5,388
conv-obs (9 eligible tonight under the 90d shield), 11 constraints,
238 hierarchy nodes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(prune): orphan sweeps use implicit transactions for batched deletes
Caught by the FIRST-EVER live maintenance run (MAINT-LIVE-001 Tier 3):
Neo4j raises TransactionStartFailed when a batched CALL-IN-TRANSACTIONS
statement executes inside an explicit transaction. Both orphan sweeps
(SymbolNode + Observation) ran their batched delete via ExecuteWrite;
the dry-run path never executes the deleting statement, so no preview
or unit test could surface it — only live execution. Switched to
session.Run (implicit tx). The failure ALSO proved the NOSILENT chain
live: the run fired "Scheduled job failed: maintenance" before exiting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): first live run verification + feature doc + CHANGELOG + close (Epics 4b-5)
First live maintenance in MDEMG history: 20,236 orphan SymbolNodes
deleted; all 5,010 tombstone candidates protected (recency + operator
exclusions); liveness rule born-firing → silenced by the real run; the
3-row job-event story (preview/true → failure/false alerted →
success/false) proves the dry_run plumbing through every path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md architecture note for MAINT-LIVE-001
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): sprint plan — embedder wiring + ingest exec resolution (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(embed-wire-001): breaker + recorder reach the real embedder through the wrapper chain (Epic 1)
The embedding circuit breaker was NEVER wired in any default deployment:
embeddings.New returns *CachedEmbedder when EMBEDDING_CACHE_ENABLED=true
(the default), so the server's emb.(*embeddings.OpenAI)/(*Ollama)
assertions on the OUTERMOST value failed silently (no else branch). The
recorder assertion had the inverse fragility (cache off → training-data
recording silently dies).
New: Unwrap() chain (CachedEmbedder joins RateLimitedEmbedder's existing
one) + embeddings.Base() / FindCached() interface-driven walkers — any
future wrapper joins by adding Unwrap(), no type lists. Wiring now walks
to the base for the breaker and to the cache layer for the recorder,
with LOUD warns when nothing matches. Tier 1 pins the production shape
(ratelimit(cache(provider))) plus cache-off and bare chains.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest-exec-001): server-triggered ingest resolves the mdemg binary — was hardcoded ./bin/mdemg (Epic 2)
Both ingest-job exec sites ran a relative "./bin/mdemg": broken in
Docker (the documented-primary deployment — binary at /usr/local/bin,
no repo checkout) and any CWD other than the repo root. New
resolveMdemgBin(): MDEMG_BIN env → os.Executable() (the server IS the
binary) → PATH → ./bin/mdemg legacy fallback; cached; Tier 1 pins the
order. Scheduled-sync jobs now report outcomes to scheduled_job_events
via jobhealth (job_name codebase-sync) — an unattended sync that keeps
failing is never silent; manual API jobs stay queue-visible only.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): live verification + CHANGELOG + CLAUDE.md + close (Epics 3-4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): sprint plan — documentation matches reality (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): CLAUDE.md FT section rewritten to post-pivot reality (Epic 1)
The section presented the abandoned Qwen3.6-35B-A3B MoE target, two-tier
MoE-Sieve strategy, and Sprint A→E critical path as CURRENT — all
superseded by the 2026-04-22 MoE→dense pivot; this stale text seeded the
Q3 roadmap audit with a dead architecture. Rewritten: shipped state
(dense Qwen3-14B mdemg-llm-v1, 0.8389, llama-server runtime), superseded
plan documented with the pivot rationale (never deleted — supersede-with-
pointer), guardrail llmclient exception marked CLOSED (re-verified in
code), memo-07 provenance break disclosed (the file never existed;
00_README_v2.md is canonical), open FT work named (FT-CLASSIFY-002 +
recursive-retraining trigger). Adapter env-name drift fixed
(MDEMG_ADAPTER_BASE, not MDEMG_MODEL_ADAPTER_BASE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(doc-truth-001): operator-facing text matches the Phase 13.5 reality (Epic 2)
preflight errors directed operators to start the DECOMMISSIONED
mlx_lm.server on :8101 — following them reintroduces the crash-looping
stack Phase 13.5 replaced. Now: llama-server :8102 guidance (managed
service install + manual command), backend-agnostic wording. model.go
help text dropped three stale "deferred to MODEL-DIST-002" mentions
(shipped 2026-05-25). Operationally (untracked .env): removed the
J17_SIDECAR_TIMEOUT_MS=200 override that re-pinned the exact value
DH-004 remediated — the 1000ms default now applies; server restarted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): 00_README STATUS block + AGENT_HANDOFF retired (Epic 3)
00_README_v2.md gains a top-of-file STATUS block: shipped-through-
cutover state, superseded MoE plan (FT-2 skip + FT-3 supersession +
R-LT-4 prototype-discipline adjudication recorded), the NOT-STARTED
recursive-retraining loop with its FT-CLASSIFY-002 trigger, and
provenance notes (memo-07 never existed; the spec is untracked pending
FG-2). AGENT_HANDOFF.md (stale since 2026-05-06) retired to a pointer
stub — handoff state lives in CLAUDE.md/roadmap/CHANGELOG/CMS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): grep-sweep proof + CHANGELOG + close (Epic 4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): last stale --adapter help string (sweep straggler)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): sprint plan — fail-closed self-improvement (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): honest criteria evaluation — populated keys + fail-closed mutations (Epics 1-2)
The cycle baseline populated 10 metric keys while task criteria
referenced ~15 others (only volatile_count + correction_rate
intersected) → missing_data → skip → ~16/17 actions validated
vacuously; criteria-driven rollback was unreachable. The
SelfAssessmentReport already carried nearly every needed key — they
were never copied into the maps.
New single source reportMetricsMap() feeds BOTH MetricsBefore and
MetricsAfter (the mismatch class cannot recur), resolving
edges_below_threshold, total_edges, consolidation_age_sec,
avg_edge_weight, guidance_health, protocol_health + 13 more. Fail-
closed rule: for MUTATING actions (15-entry registry) a criterion with
missing evidence counts as NOT met ("missing_data_failclosed") — an
unverifiable mutation must never be recorded as success; observational
actions keep advisory semantics. The prior test pinned the vacuous
pass as the contract — updated to the honest one + advisory companion.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): tombstone_stale scoped to correction-linked nodes; refresh_stale_edges decays for real (Epic 3)
tombstone_stale archived 50 ARBITRARY older observations whenever ANY
correction existed in the 7-day window — no relationship between
correction and target. Now requires linkage: same session as the
correction OR its 1-hop CO_ACTIVATED_WITH neighborhood. Live check:
0 corrections in the current 7-day window, so both old and new scopes
are 0 RIGHT NOW — the hazard was conditional (any future correction
re-armed the old query against thousands of unrelated observations;
the new query bounds it to genuinely related nodes).
refresh_stale_edges bumped last_activated BEFORE the weight expression
read it → staleness=0 → the decay term vanished → every refresh was a
pure +0.1·log(count+1) boost. Staleness now captured via WITH before
SET; weights can genuinely decay. Both statements EXPLAIN-validated.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): counter-free confidence calibration — RSIC stops polluting its own signal (Epic 4)
RSIC-SK1 injected synthetic "followed"/"ignored" outcomes through
UpdateConfidence, incrementing total_surfaced/total_followed/
total_ignored — the exact counters GetConstraintEffectiveness reads
next cycle: measured effectiveness drove synthetic outcomes which drove
measured effectiveness (circular self-reinforcement). New
AdjustConfidenceDirect applies the clamp+archive confidence delta with
ZERO counter writes; the outcome counters now belong exclusively to
real guidance feedback. Provider interface + adapter + dispatcher use
the direct path with the configured boost/decay magnitudes; test mock
maps deltas back to outcome labels so existing assertions keep meaning.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): Tier 3 verification + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(rsic-validate-001): integration seeds carry session linkage for the scoped tombstone contract
CI's TombstoneStaleEndToEnd + MultiActionDispatchAndMetrics failed
because the seeded observations had NO relationship to the seeded
corrections — under the old behavior they were archived anyway (the
memory-eroding bug the sprint removed); under the new correction-
linkage contract they are correctly spared. SeedObservationNodes now
stamps a per-space test session shared by corrections and their stale
peers, so the tests exercise the new contract. Query-level proof
against the exact seeded shape: 10/10 linked observations match the
scoped Cypher. (Local integration runs hit the 30s client timeout —
the loaded local stack's cycles take ~6 min; CI's arbitrates.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): sprint plan — finish the score-scale contract (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): persistent rerank clients — failure alerting re-armed on the hottest LLM path (Epic 1)
doRerankWithOpenAI/doRerankWithOllama constructed a fresh llmclient per
call: the consecutive-failure counter reset every time, so
LLM_CONSECUTIVE_FAILURE_THRESHOLD could NEVER fire for
retrieval.rerank_cross / rerank_nli (a north-star distill task), and the
HTTP transport was discarded per call. Per-provider base clients now
init once (sync.Once); WithContext() shallow-copies and SHARES the
*atomic counter + breaker, so per-call contexts keep failure accounting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): config-driven score thresholds — suggest revival, MCP tiers, guardrail floor (Epic 2)
Three score-literal leftovers from the RRF-SCALE-001 audit instruction:
(1) /v1/memory/suggest's hardcoded 0.5 min-confidence default filtered
nearly everything on a scale topping out ~0.58 → CONSULTING_SUGGEST_
MIN_CONFIDENCE (default 0.45, RRF-calibrated); (2) MCP memory_reflect
tiers 0.7/0.4 (high tier unreachable) → MCP_REFLECT_SCORE_HIGH/_MEDIUM
(0.45/0.25); (3) guardrail constraint-retrieval Cypher's hardcoded
sim > 0.3 → GUARDRAIL_CONSTRAINT_SIM_FLOOR via GuardrailConfig
(cosine-stable today but inside the class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): CacheKey covers ALL result-affecting fields + two forcing functions (Epics 3-4)
CACHE-KEY-002: the key omitted result-affecting RetrieveRequest fields —
the audit named 5 (include/exclude_extensions, temporal_after/before,
policy_context); the new reflection forcing-function caught 8 MORE on
its first run: sparse-gate per-call overrides (SparseEnabled/
SparsePercentile/SparseOverridePresent/Category — the ?sparse= URL
params), pagination (Cursor/Limit), and the context-fingerprint params
(QueryContextFingerprint/StrictContextMode). All now keyed, plus a
caller-supplied query-embedding hash. Two requests differing in any of
these no longer collide on one cache entry.
Forcing functions: (1) reflection test — every RetrieveRequest field
must be in CacheKey or explicitly classified result-neutral with
justification (new fields fail until classified); (2) score-literal
scan — flags `.Score/score <op> 0.x` comparisons repo-wide outside a
justified allowlist (first run triaged 3 scale-local sites; clamp
guards excluded by pattern). The RRF-SCALE bug class is now CI-caught.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): live-calibrated suggest floor + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): sprint plan — stable concept identity, two-PR delivery (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): automated consolidation no longer skips LLM emergence (Epic A1)
dynamicEmergenceStep registers at phase 22, but RunConsolidation ran
hardcoded ranges (10,20) + (25,30) — phase 22 fell in the gap, so with
EMERGENCE_ENABLED=true the AUTOMATED path silently skipped LLM concept
emergence while the manual path (RunNodeCreationPipeline, 10–22 with an
emergence gate) ran it. RunConsolidation now delegates to
RunNodeCreationPipeline(cfg.EmergenceEnabled) — single range source; a
pin test fails if the step's phase ever leaves the range.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): stable theme identity — centroid match-or-create replaces the 5-minute churn (Epic A2)
ClusterConversations detached EVERY observation→theme edge, deleted
childless themes, and recreated all themes from scratch each ~5-min
cycle: new node_ids every run, evidence chains destroyed continuously,
recall flooded with stacks of near-identical concepts (observed live in
this session's own prompt headers).
New flow: cluster first → match each cluster to an EXISTING theme by
centroid cosine (HIDDEN_THEME_IDENTITY_SIM_THRESHOLD, default 0.90,
greedy with per-run claiming) → matched themes UPDATE in place
(props + theme-scoped member-edge rewire; node_id and all inbound
references survive) → unmatched clusters create as before → only themes
claimed by NO cluster are deleted. The global detach is gone.
ThemesUpdated added to the result. Tier 1: match/threshold/claimed/
best-of selection.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* chore(hidden-churn-001): remove the dead global-detach helper — the churn mechanism itself
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-A verification + CHANGELOG (Epics A3-A4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): PR-B coverage retune — config ratio, density assignment, gauge + rule (Epic B1)
maxThemes was an inline ceil(n/10) equation → HIDDEN_THEME_TARGET_RATIO
(default preserves it). NOISE observations (previously dropped from the
hierarchy forever — the 94% coverage gap's mechanism) now density-assign
to their nearest theme when cosine ≥ HIDDEN_THEME_ASSIGN_SIM_THRESHOLD
(default 0.70; edges only, no new themes; below-floor stays unthemed
honestly). New per-space coverage gauge
mdemg_neo4j_conversation_coverage_ratio (collector Query 5) + evaluator
rule low_conversation_coverage (CONVERSATION_COVERAGE_ALERT_FLOOR 0.2,
6h ForDuration for convergence).
Audit bonus: caught WeightIntegrityRules querying metric_samples with
recorded_at — the column is `time`; the null-weight rule had been
silently erroring every evaluation since it shipped (Debug-only logging
— the SUPERVISOR-002 finding in action). Both rules fixed + a pin test
bans recorded_at against metric_samples.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): mdemg concepts repair + trace — grounding audit CLI (Epic B2)
repair: tombstones childless layer>=2 abstraction nodes (no inbound
ABSTRACTS_TO|GENERALIZES|GENERALIZES_TO — 10,395 live in mdemg-dev,
churn-era debris). Recoverable (is_archived=true + archived_reason),
batched, dry-run default, --limit for small-batch-first verification.
trace: per-node grounding audit — direct children, transitive per-layer
census, grounded/ungrounded verdict, sample path to L0.
Live data note: GENERALIZES alone over-counts (19,147) — ABSTRACTS_TO
is the hidden layer's actual child edge; pin test guards the predicate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): surface themes_updated + noise_assigned in consolidate API + periodic log (Epic B3)
/v1/conversation/consolidate now reports themes_updated and
noise_assigned alongside themes_created. The periodic-consolidation
log condition also gains both — with stable theme identity (PR-A),
created is usually 0 on healthy cycles, which would have silenced the
success log entirely (the silent-success bug class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): live-smoke fixes — noise pool was structurally empty, clustering included archived debris, coverage gauge gated on min-obs
Three defects only the live run surfaced (Tier 3 forcing function):
1. KMeans never emits label -1, so the density-assignment hook received
an always-empty noise list; the min-samples/max-themes/nil-centroid
drops now feed their members into the noise pool instead of silently
excluding them from the hierarchy.
2. fetchClusterableConversationObservations had no is_archived filter —
it clustered 4,838 observations of which only 183 were live (MAINT-LIVE
tombstones), building themes on archived debris. Both fetch variants
now exclude archived. Live effect: 24 debris themes swept to 5 clean
ones; second cycle themes_updated=5/created=0 (stable identity on
real data).
3. Coverage gauge gated on CONVERSATION_COVERAGE_MIN_OBS (default 50,
DH-005 confidence-threshold pattern) — tiny scratch/test spaces
(2-13 observations) emitted 0.000 and would have alarmed forever
(born-firing alert hazard). Sentinel -1 skips emission.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-B verification + CHANGELOG + CLAUDE.md — sprint complete (Epic B5)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): sprint plan + background loop inventory (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): sliding-window restart budget + late registration (Epic 1)
The restart counter only ever incremented — a once-a-week transient
permanently killed a worker after 3 weeks. Budget is now a sliding
window (restarts older than the window are forgotten); permanent
failure requires >SUPERVISOR_MAX_RESTARTS within
SUPERVISOR_RESTART_WINDOW_MIN. New Go() registers+launches workers
after Start (the API server starts its loops late); nil return without
ctx cancellation now means intentional completion, not a restart.
Start() outlives dead workers so late workers stay supervised.
Config: SUPERVISOR_MAX_RESTARTS (3), SUPERVISOR_RESTART_WINDOW_MIN
(60), SUPERVISOR_BACKOFF_BASE_SEC (5).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): register the 12 unsupervised background loops (Epic 2)
Every scheduler/loop goroutine now runs under the goroutine supervisor
(panic recovery + sliding-window restart budget) instead of as a bare
go func() whose panic silently killed the subsystem forever:
- api.Server (6): periodic-consolidation, context-cooler,
space-prune-scheduler, weekly-gap-interviews, scheduled-sync,
rsic-macro-cron — via injected SetSupervisor(sup.Go) + goSupervised
helper (bgWg brackets each run; stop channels remain the graceful
path and return nil = no restart)
- ape (3): rsic-watchdog, rsic-store-flush, signal-learner-flush
- backup schedulers (2): neo4j-backup-scheduler, tsdb-backup-scheduler
— their NewServer construction-time Start() moved to
StartSupervisedBackground() (serve.go) so the hook exists first
- serve.go (1): llm-fastfail-burst-flush via sup.Go
All owners keep a nil-hook fallback (legacy bare goroutine), so tests
and non-server callers are unchanged. Buffered TSDB writers are
explicitly out of scope (TSDB-CONSUME-001 owns flush observability).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): rule-health meta-alert on evaluator query failures (Epic 3)
A rule whose SQL errors was a silently-disabled alert: failures were
logged at Debug and nothing watched the watcher — bitten twice in one
week (HIDDEN-WEIGHT-001 null-weight rule + the recorded_at column bug,
both found by accident in later sprints). Query failures now log at
Warn, and after ALERT_RULE_FAILURE_THRESHOLD (default 3) consecutive
failures a high-severity meta-alert fires directly via the dispatcher
(not via a rule — the meta-channel must not depend on the failing
mechanism). Service label is rule-health-<rule-id> so concurrent
failing rules don't cooldown-suppress each other; success re-arms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): recency-gate the RSIC llm_error_rate_spike insight (Epic 4)
Insight 26 computes the error rate over a 24h window with no recency
requirement, so a 35-min jiminy.synthesize timeout burst at 02:00 UTC
kept re-firing HIGH 'LLM error rate spike' (and escalating to CRITICAL
'Jiminy Pipeline Critical') every RSIC micro-cycle for 12+ hours after
the incident self-resolved (live, 2026-06-11). LLMPerformanceSummary
now carries LastErrorAt (MAX(time) over errored rows); the spike
insight fires only when the most recent error is within
RSIC_LLM_ERROR_RECENCY_MIN (default 60; 0 disables the gate). A zero
LastErrorAt (older data source) keeps legacy behavior — the gate never
widens silently.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): nolint G118 on legacy-fallback loop launches
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): aggregate evaluator-degraded alert for global outages
A TSDB-level outage fails every rule at once; per-rule meta-alerts
would storm ~19 alerts duplicating the health prober's signal. At the
failure threshold the evaluator now distinguishes: other rules
succeeding recently → per-rule rule-health alert (broken SQL class);
nothing succeeding within threshold×interval → ONE
alert-evaluator-degraded alert per outage. Success re-arms both.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(jiminy): detach feedback outcome processing from the hook's connection lifetime
Live-smoke surprise during SUPERVISOR-002 Epic 5 (own fix-commit per
policy): jiminy.evaluate_llm was failing at 94.9% (657 'context
canceled' rows/24h). The post-tool-observe hook POSTs
/v1/jiminy/feedback with curl --max-time 5, but per-item Tier-2
outcome classification routinely outlives the connection — the request
ctx then cancelled every in-flight LLM call and outcomes silently
degraded to the keyword heuristic. Same defect class as
GUIDANCE-SYNTH-001's warm-path budget.
handleJiminyFeedback now uses context.WithoutCancel(r.Context()) with
its own server-side budget JIMINY_FEEDBACK_TIMEOUT_MS (default 60000,
0 = unbounded). The hook keeps its fire-and-forget 5s curl.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): streak-relative global-outage discriminator (drill-caught)
The Epic 5 TSDB-stop drill caught the freshness-window heuristic
misclassifying outage ONSET: rules were succeeding seconds before the
stop, so lastAnySuccess was fresh when the first rules hit threshold —
2 per-rule alerts leaked before the aggregate fired. The discriminator
is now streak-relative: per-rule only when some other rule succeeded
AFTER this rule's failure streak began; otherwise global, once per
outage. Unit-pinned with the drill scenario
(TestRuleFailureStreak_OutageOnsetIsGlobal).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): harden the restore path (Epics 1-4)
Epics 1-4 land together — they share the restore-path signatures.
1. Checksum gate (Epic 1): the manifest SHA-256 was written at backup
time and never read; a corrupted .mdemg restored silently. The gate
now fails closed before import; legacy manifests without a checksum
warn and proceed.
2. Snapshot completion polling (Epic 2): the pre-restore safety
snapshot was raced with time.Sleep(2s) against an async backup
goroutine. waitForBackupJob now polls the jobs queue until
completed, failing closed on failure/cancel/vanish/timeout
(BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC, default 300).
3. Count validation (Epic 3): manifest NodeCount/EdgeCount are
whole-database counts and cannot validate file contents (they
diverge on partial backups) — new additive file_node_count/
file_edge_count/file_observation_count manifest fields are counted
from the exported chunks; restore re-counts the file and hard-fails
on mismatch (truncation class). Importer accounting divergence
under CONFLICT_SKIP is warn-only, surfaced in a job-result
validation block.
4. dockerbin routing (Epic 4): the legacy .dump restore shelled out to
bare "docker" (the launchd-minimal-PATH class NOSILENT-001 fixed
for TSDB); now routes via dockerbin unless the operator set a
non-default FullCmd.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): neo4j-backup jobhealth + generalized staleness rules (Epic 5)
The default-ON Neo4j backup scheduler had zero jobhealth coverage —
the inverse of NOSILENT-001 (which wired only tsdb-backup). The
scheduler now waits on each triggered job (its Trigger is queue-async;
a fire-and-forget report would always claim success) and reports
outcome via SetResultHook → jobhealth.Report with
job_name='neo4j-backup' (wired in SetTSDBClient next to the tsdb
hook). The staleness rule is generalized into a jobStalenessRule
factory; Neo4jBackupStalenessRule (neo4j_backup_no_recent_success,
Service scheduled-job-staleness-neo4j, window = partial interval × 2
unless BACKUP_JOB_STALENESS_HOURS overrides) registers when
BACKUP_ENABLED. The existing tsdb rule is pinned unchanged through the
refactor.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): initial backup on start (rule honesty)
With the neo4j_backup_no_recent_success rule registered, a fresh
install would alarm honestly-but-noisily for up to 24h (the scheduler's
first tick). The scheduler now runs an initial partial backup
BACKUP_INITIAL_DELAY_MIN (default 5) minutes after start, so every
install has a backup — and a quiet staleness rule — within minutes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(backup-restore-verify-001): retention was deleting every backup it just made (drill-caught)
The Tier 3 round-trip exposed that the backup system was a no-op for
this database: BACKUP_RETENTION_MAX_STORAGE_GB had a comment/code
default drift (documented 50, code read 2), and with RunAfter=true the
quota pass deleted each 3-4 GB whole-database backup ~80 ms after it
completed (log: 'backup completed' → 'retention cleaned backups
deleted_count=1 freed_bytes=<exactly the new backup>'). Three fixes:
1. Quota retention NEVER deletes the newest backup of each type — a
quota smaller than one backup degrades to 'over quota, keep it'
with a loud warning, not 'delete the only backup'. Sparse-file unit
tests pin both the two-backup and only-backup-oversize shapes.
2. Default quota raised to the documented 50 GB.
3. BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC default 300 → 3600: the live
whole-database export runs ~15 min; the 5-min wait made the initial
scheduled run report failure (jobhealth correctly recorded it —
the wiring works) while the backup actually completed later.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(transfer): omit empty path/name on import — restores with observations always failed (drill-caught)
The Tier 3 round-trip's real restore failed with
ConstraintValidationFailed: conversation observations carry path=NULL
in Neo4j (which memorynode_path_unique (space_id, path) ignores), but
the exporter serializes NULL as the proto default "" and the importer
wrote the literal empty string unconditionally — so the second
observation node in any restore collided. Every restore containing 2+
observation nodes had always been broken; this was invisible because
no backup had ever been restore-tested (the sprint's premise,
demonstrated). nodeProps now omits empty path/name (null fidelity);
unit-pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 7)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-storm-001): sprint plan with corrected burst attribution (Epic 0)
Triage correction baked in: the 5,397-node burst was the Context
Cooler via the session-start hook's /v1/conversation/graduate (uncapped
backlog sweep of pre-DH-004 graduation-bug victims), NOT RSIC —
mis-attributed because tombstone_stale stamps no metadata and two
archive-reason property names coexist. RSIC's own issues stand:
trigger-race cycle storm (~20-30k/day) and snapshot/executor predicate
drift (rollback restores nothing).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): atomic trigger admission — reserve-on-allow (Epic A)
EvaluateTrigger checked activeCycles/lastTrigger, but both were written
only by RecordTrigger — which callers invoke AFTER RunCycle completes.
For a cycle's entire multi-second duration every concurrent trigger
passed every gate: ~20-30k micro cycles/day live (4 spawning within
50ms of each tool-use burst), the 300s cooldown effectively
nonexistent, llama-server saturated (the recurring synthesize/
evaluate_llm/intent_translate timeout cascades), and RSIC actions
dispatched at storm frequency.
Admission now reserves the active + cooldown (+dedupe) records under
the same lock that performs the checks; RecordTrigger updates the
reservation with the real cycle ID; CompleteCycle clears the active
slot; a failed cycle still cools down. Unit-pinned: 50 concurrent
triggers admit exactly one; cooldown holds from admission through
completion and failure.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): attributable archival + unified tombstone predicate (Epics B+C)
Epic B — every archival is now attributable:
- tombstone_stale stamps archived_at + archive_reason
('rsic_tombstone_stale') + archived_cycle_id (bare is_archived made
the 2026-06-11 burst forensics mis-attribute the Context Cooler's
5,397-node sweep to RSIC for hours).
- Canonical property name is archive_reason; concepts.go (the one
archived_reason writer) migrates; historical rows keep the old name
(readers coalesce; no data migration).
- Context Cooler tombstone step capped per run
(COOLER_TOMBSTONE_MAX_PER_RUN, default 500; 0=unlimited) with a loud
cap-reached warning — the incident sweep was a single uncapped run
over the pre-DH-004 volatile backlog via the session-start hook's
graduate call.
Epic C — rollback restores the right nodes:
- The executor and the rollback snapshot now share ONE candidate
predicate (tombstoneStaleCandidates const). RSIC-VALIDATE-001 had
updated only the executor; the snapshot captured the old unlinked
set, so rollback restored nodes that were never archived
(restored_count=0 live). Drift class eliminated, pin-tested.
- Rollback also clears the new attribution fields on restore.
Epics share the tombstone Cypher — combined commit (disclosed).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-storm-001): feature doc + verification + rollback drill test + CHANGELOG + CLAUDE.md — sprint complete (Epic F)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): commit ExecuteTombstoneStaleForTest wrapper missed from Epic F
The rollback drill test (committed in 2534a28) references this
test-support wrapper, but the Epic F git-add listed the test file and
not internal/ape/task_dispatch.go — CI's integration build failed on
the already-merged PR #435 while local builds passed (the method
existed uncommitted in the working tree). 7-line addition, no behavior
change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(tsdb): initial backup on start — restart-resetting ticker meant zero backups ever
Alert triage on the (correctly firing) 'No Successful TSDB Backup In
Window' staleness alert found scheduled_job_events has ZERO tsdb-backup
rows: the scheduler's 24h ticker resets on every restart, and a server
restarted more often than the interval never backs up (8 restarts
today alone). Same gap BACKUP-RESTORE-VERIFY-001 fixed for the Neo4j
scheduler. The shared runOnce() now also fires
TSDB_BACKUP_INITIAL_DELAY_MIN (default 10; 0 disables) after start —
the staleness rule's 'never ran' guarantee did its job catching this.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(uats-gap-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(jiminy): reformulate returns 400 (not 500) for missing context
Live-caught while probing for the UATS-GAP-001 contract spec: the
service rejects an empty context but the handler surfaced that
request-validation failure as 500 'internal error'. Validate at the
edge like space_id. The /strict reformulation channel (prompt-context
hook in strict mode) was otherwise contract-clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(transfer): nil-guard edge identity assertions — one bad edge panicked the whole server (UATS-caught P0)
The UATS suite's backup_trigger spec ran a live export that hit an
edge whose endpoint returned nil fromId/toId — the unchecked
fromID.(string)/toID.(string)/relType.(string) assertions at
exporter.go:641-643 panicked, taking the entire server down mid-run
(launchd restarted it; 167 connection errors in the suite were the
visible symptom). An unexportable edge is now skipped with a warning;
the two parentVal assertion sites hardened with comma-ok for the same
class. An HTTP-triggerable request must never be able to kill the
process.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(uats-gap-001): 8 contract specs for the revived channels + suite hygiene (Epics 1-6)
New specs (27 cases, 100% live pass, hash-stamped): jiminy_strict,
jiminy_reformulate, jiminy_classify (incl. the fail-open contract),
jiminy_warm (202 warming|debounced union), jiminy_latest (warmth-state
union — the Follow-up C strict-JSON surface), admin_breakers_list,
admin_breakers_reset (404-with-available[] discoverability +
mutation-safe reset round-trip), memory_retrieve_sparse_context (first
specs referencing ?sparse=/?sparse_percentile=/?context=auto/
?strict_context= — pins the sparse-debug-fields-are-optional client
contract).
Suite hygiene: 8 pre-existing env-conditional *_disabled variants
converted to skip variants (they pass only with Jiminy/J17 OFF; the
live stack runs them ON), and j17_protocol_learn's missing-fields
guard fixed — the runner deep-merges variant bodies, so the original
key-omission body merged into a complete request and asserted nothing;
now empty-string overrides. Conventions recorded in CLAUDE.md.
make test-api: 0 failed, 0 errors, 425 hashes verified.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats-gap-001): tag breaker reset round-trip llm_required — CI has no openai-embeddings breaker
PR #437 CI caught the deployment dependence the spec description
already flagged: the reset_round_trip variant targets the
openai-embeddings breaker, which exists only when the openai embedding
provider is configured — CI runs provider-less and 404'd. Tagged with
the suite's established llm_required exclude tag: CI (which excludes
it) skips the variant; the live stack still exercises the full
round-trip. Verified both modes locally (27/27 untagged; 26/0 with the
CI exclude set).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(uxts-ci-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(uxts-ci-001): TSDB in CI, un-zombie UOTS, delete UVTS step, ULTS hash gate live (Epics 1-3a)
CI: timescaledb service (compose-pinned image) + TSDB_* env; tsdb tag
un-excluded from UATS; UOTS converted from its since-inception
continue-on-error zombie to merge-blocking; UVTS step DELETED per the
roadmap's 'un-zombie or delete' option (semantic grading with a stub
embedder and no seeded corpus measures nothing — UVTS gates retrieval
changes via the documented live flow); new merge-blocking gates: ULTS
prompt-hash verification + UBENCH dataset↔holdout contract.
Arming the gates immediately caught real rot, each fixed here:
- ULTS runner could not parse annotated source locations
('file.go:74 (rendered with ...)') — 5/17 specs failed on a parser
artifact; the 'dynamic' sentinel was rejected inside hash arrays.
- With the parser fixed…
* docs(eventgraph-004): sprint plan + CoactivateSession post-revival health review (Epic 0)
EVENTGRAPH-004 federates the last unfederated Hebbian write — the
ApplyNegativeFeedback contradict action — into reinforcement_events
(trigger_path=apply_negative_feedback_contradict). Data-decided scope:
reuse the existing V0022 sink (zero CONTRADICTS edges exist anywhere;
no producer calls /v1/learning/negative-feedback — instrument before
the producer arrives, the inverse of the dormancy pattern).
Also closes the EVENTGRAPH-003 follow-up: 30h post-fix health review of
the revived CoactivateSession path — no tuning needed, textbook session
cliques, pre-fix orphans stay as historical record (operator decision).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(eventgraph-004): wire ApplyNegativeFeedback contradict path → reinforcement_events (Epic 1)
The contradict action (no co-activation edge → MERGE CONTRADICTS) was the
last unfederated Hebbian write. The CONTRADICTS MERGE lived inside a
FOREACH, where the edge variable is invisible to RETURN — so the original
single statement is split into two statements in the SAME ExecuteWrite
transaction: (a) weaken (EVENTGRAPH-003 telemetry, RETURN unchanged) and
(b) contradict with a per-pair RETURN. Classification is identical: weaken
never deletes edges, so contradict's NOT EXISTS sees the same edge set the
original OPTIONAL MATCH did.
Contradict rows land with trigger_path=apply_negative_feedback_contradict.
created_new_edge detected via `c.updated_at IS NULL` (ON MATCH always sets
it; ON CREATE never does — invariant pinned by comment). delta_weight is
the CONTRADICTS edge's OWN weight delta (+negWeight on create, 0 on
re-match); negative-feedback semantics are carried by trigger_path, not
the sign.
Both statements EXPLAIN-validated against live Neo4j. Tier 1: 2 new parser
tests (create/re-match branches); learning suite green; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): Tier 3 live verification — contradict create/re-match + weaken unchanged (Epic 2)
Live against the restarted Epic-1 binary: contradict create row
(+0.15, created_new_edge=true), re-match row (delta=0, evidence=2),
weaken row byte-equivalent to pre-split behavior (negative delta,
floor at 0). Federation CLI surfaces the new trigger_path with no
read-side change. UATS learning_negative_feedback 5/5 PASS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(eventgraph-004): feature doc + CHANGELOG + UATS pin + close (Epic 3)
Feature doc: 5-path trigger_path table + delta-semantics consumer
warning (contradict delta is the CONTRADICTS edge's own weight delta —
semantics live in trigger_path, not the sign). UATS spec extended:
zero-count equals assertions on nonexistent nodes (hash refreshed,
5/5 live). CLAUDE.md architecture note + producer-gap disclosure.
Sprint close in post.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci: auto-sync dev branch with main after each squash-merged PR
Squash merges never advance the dev branch's merge-base, so every
sprint touching CHANGELOG.md/CLAUDE.md hit CONFLICTING on its next PR
(first bitten: PR #419). New sync-dev-after-merge.yml merges main back
into the source *_dev* branch after each merged PR; the GITHUB_TOKEN
push triggers no other workflows, so it can never spawn an empty
auto-PR (the PR #420 failure mode). Conflicts fail loudly for manual
resolution; workflow_dispatch enables manual runs/live testing.
auto-pr.yml additionally skips PR creation when branch content is
identical to main — guards MANUAL sync pushes, verified against the
live repo state (current dev01 ≡ main → empty=true → skip).
actionlint clean (untrusted refs passed via env, not inline).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(roadmap): Q3 2026 vision-derived roadmap from 26-agent codebase deep-dive
Full-codebase review vs MDEMG's purpose (cognitive substrate / connection
layer): 19 map agents (3 vision + 16 subsystem), 3 cross-cutting assessors,
synthesizer + adversarial completeness critic (19 revisions applied).
Verdict: server-side substrate is mature, but the system is not currently
functioning as the assistant's internal dialogue — the per-prompt delivery
channel silently no-ops (hook reads .user_prompt, Claude Code sends
.prompt), 100% of GENERALIZES edges have NULL weight (22,170/22,170,
live-verified), scheduled decay/prune has been a permanent dry-run, RSIC
validates 16/17 actions vacuously, and supervision covers 3 of ~14
background loops. Every defect is the same disease: wired-looking seams
with no caller, wrong contract, or no reader.
4 phases ≈ 75 days committed: (1) reconnect the loop ends, (2) close the
learning loops, (3) survivability + class-ending forcing functions,
(4) FT frontier + release hygiene. Top-10 ranked; deferrals explicit.
Orchestrator spot-verification annex included (5 claims re-verified live).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): sprint plan — fix hook stdin contract, reconnect per-prompt channel (Epic 0)
Roadmap Q3 Phase 1 rank #1. Audit of all 6 hooks vs the actual Claude
Code stdin schemas: prompt-context.sh reads .user_prompt (CC sends
.prompt) → channel exits silently on every prompt; post-tool-observe.py
reads tool_output (CC sends tool_response) → false "Build/test
succeeded" observations with empty output; guidance wrongly coupled to
RESULT_COUNT>0; minor pre-compact transcript jq. session-start /
pre-bash-check / pre-write-check verified correct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): prompt-context.sh reads .prompt — revive the per-prompt channel (Epic 1)
Claude Code's UserPromptSubmit stdin field is `prompt`; the hook read
`.user_prompt`, which is always empty → exit 0 → per-prompt CMS recall,
Jiminy guidance, /strict reformulation, the warm trigger, and the
retrieve-time Hebbian reinforcement have NEVER fired in any session.
Now reads `.prompt // .user_prompt` (legacy fallback kept).
Also decouples guidance from recall: the RESULT_COUNT=0 branch no longer
exits — it printed its notice then skipped guidance + warm + retrieval
reinforcement, coupling independent deliveries.
Both copies (live + installer template). Tier 1 simulated stdin: real
.prompt payload → first-ever guidance delivery (J17 T1 bootstrap + DICT,
5363 guidance bytes vs 0 forever); legacy fallback works; short/empty/
malformed payloads exit silently (fail-open preserved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): post-tool-observe reads tool_response — end blind "succeeded" observations (Epic 2)
Claude Code's PostToolUse stdin field is `tool_response` (string or
object); the hook read `tool_output`, which is always absent → output_str
empty → error indicators never matched → every go build/go test/pytest
Bash call was recorded as "Build/test succeeded" sight-unseen, and real
errors were never observed.
Now reads tool_response (fallback tool_output) via _response_text(),
normalizing string|dict|list (stdout/stderr join). Success classification
requires NON-EMPTY clean output — a silent success records nothing rather
than fabricating; failures land as error observations with real stderr.
Both copies (template regenerated from fixed live, {{SPACE_ID}}
placeholder preserved, verified identical modulo placeholder). Tier 1
against real CMS: failing build → error obs with stderr; passing →
progress; empty → no record.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hookwire-001): pre-compact transcript extraction reads the real line shape (Epic 3)
Transcript lines are {type, message:{content:[{type, text|name, ...}]}};
the old top-level `.content` read always yielded empty, so pre-compaction
snapshots never carried recent-activity context. New jq walks
.message.content[] extracting .text/.name. Verified against this
session's real transcript (old: nothing; new: real activity). Both
copies, placeholders preserved.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hookwire-001): Tier 3 verification + CHANGELOG + CLAUDE.md contract pin + close (Epics 4-5)
Live in the real session: first-ever guidance delivery (J17 T1 bootstrap
+ DICT, 5363 bytes vs 0 forever); real failing build → error observation
with actual compiler output in CMS. PostToolUse success-only firing
documented as a limitation. Hook stdin contract pinned in CLAUDE.md.
Drift + clique-semantics findings logged for HOOKSYNC-001 / Phase 2.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): sprint plan — drift-proof + self-monitoring hook channel (Epic 0)
Roadmap Q3 Phase 1 rank #2. Investigation grounded all five findings:
template→live drift severed alert delivery (50-entry file actively
rotating today, never shown); no Cleared lifecycle (nothing sets the
field; no /v1/alert* endpoints); no absence detection for the channel
that just had a months-long silent outage; compose publishes 9999 on
0.0.0.0; neural sidecar binds 0.0.0.0:8101 via a 39-day-old process
serving pre-J17-fix code. 8 epics: reconcile, CI parity gate, clear
lifecycle, hook_events absence rule (reuses V0024 via jobhealth),
hooks doctor, PORT-TRUTH rider, Tier 3, docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): reconcile bidirectional hook drift — alert delivery restored to live (Epic 1)
Live hooks adopted from templates (SPACE_ID substituted): restores the
alert-display blocks (all-pending per prompt; critical/high + degraded
healthz at session start) that the live copies lacked — the NOSILENT
last mile. Reverse drift caught during reconcile: the live hook's T1/T2
bootstrap-detection block (MAX_TIER → /v1/jiminy/bootstrap → ACTIVE
CONSTRAINTS header) never existed in the template and was nearly lost —
now single-sourced in the template and regenerated into live.
Live-verified: one prompt now renders alerts (50 pending incl. live
CRITICALs) + recall + J17:INIT bootstrap + guidance + synergy footer,
coexisting. All 6 hooks byte-identical to templates modulo {{SPACE_ID}}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* ci(hooksync-001): hook-template parity gate — live hooks must match templates (Epic 2)
Mirrors the compose/launchd parity pattern: every *.sh/*.py template
must byte-match .claude/hooks/ modulo the {{SPACE_ID}} placeholder.
Proven locally: passes clean, fails (with a bounded diff dump) on
deliberate drift. Ends the bidirectional-drift class that severed
alert delivery and nearly lost the T1 bootstrap block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): alert Cleared lifecycle — display once, then delivered (Epic 3)
Alert.Cleared existed but nothing ever set it: once hooks rendered the
file, the same entries would re-render every prompt forever. New:
FileBackend.Clear (ids and/or all_before cutoff, idempotent, under the
existing lock) → Dispatcher.ClearAlerts → POST /v1/alerts/clear. Hooks
now clear exactly what they displayed (fire-and-forget, fail-open);
cleared = delivered-to-operator, not resolved — persisting conditions
re-fire via the evaluator. Alert IDs now CUIDv2 per the identifier
standard (was UnixNano; old ids remain valid opaque strings).
Live-verified lifecycle: prompt 1 → "50 pending, showing 10" + 10
cleared; prompt 2 → "40 pending, showing 10" (next batch, no re-render)
→ 20 cleared. Tier 1: Clear by-id/by-time/idempotent/no-backend. UATS
alerts_clear 3/3 live (runner falsy-body inheritance discovered:
variant bodies must be non-empty objects).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): hook-channel absence detection — the channel now self-reports outages (Epic 4)
POST /v1/hooks/event records heartbeats into V0024 scheduled_job_events
via the jobhealth policy point (job_name hook:<name>; no new sink).
Two independent heartbeats: prompt-context fires per delivery (the
monitored channel); post-tool-observe fires throttled (HOOK_HEARTBEAT_
COOLDOWN_SEC, default 300 — proves sessions ACTIVE). Evaluator rule
hook_channel_silent (distinct service per the NOSILENT cooldown rule):
sessions active + zero prompt-context fires in HOOK_SILENT_LOOKBACK_
HOURS (24) → high alert. This is the "job never ran" guarantee applied
to the channel whose months-long outage HOOKWIRE-001 found only by
manual audit — the next contract drift self-reports.
Config: HOOK_HEALTH_ALERT_ENABLED (true), HOOK_SILENT_LOOKBACK_HOURS
(24), HOOK_ACTIVITY_MIN_EVENTS (5). Live-verified: real hook fires land
rows (session metadata, latency); throttle holds; rule SQL positive +
negative branches proven against the real table; UATS hooks_event 3/3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hooksync-001): mdemg hooks doctor — one-shot hook-channel triage (Epic 5)
11 checks: per-hook template parity (the CI gate's local twin),
settings registration, server healthz, a stdin-contract self-test
piping a real-shape UserPromptSubmit payload through the installed
hook (asserts the always-present synergy footer), alert-file state
(pending/total), and the last hook:prompt-context heartbeat age from
scheduled_job_events (SKIP when TSDB unreachable). Table or --json;
non-zero exit on any FAIL.
Live: 11/11 PASS on this machine ("last fire 5s ago" — fed by the
doctor's own self-test); correctly fails (exit 1) on deliberate drift.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hooksync-001): PORT-TRUTH — loopback bind defaults + sidecar zombie replaced (Epic 6)
Compose published the API on 0.0.0.0 (unauthenticated admin/destructive
routes exposed off-host): now "${MDEMG_BIND_ADDR:-127.0.0.1}:${MDEMG_PORT}
:9999" — wide bind is an explicit opt-in (both compose copies, CI-synced).
Neural sidecar bound 0.0.0.0:8101 via config.py default AND the plist
arg: both now 127.0.0.1 (both plist copies, CI-synced; SIDECAR_HOST env
overrides). Operational: the 39-day-old sidecar process (started
2026-05-02, serving pre-J17-fix code) replaced — fresh process verified
on 127.0.0.1:8101, both models loaded, health 200.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hooksync-001): Tier 3 verification + feature doc + CHANGELOG + close (Epics 7-8)
Live-verified across the sprint: alert backlog drained 50→2 on real
prompts (display-then-clear); evaluator rules 15→16 (hook_channel_silent
loaded); doctor 11/11 + correct failure mode; sidecar fresh on
127.0.0.1:8101 (NLI 234ms). Feature doc docs/features/hook-channel-
health.md (config table incl. MDEMG_BIND_ADDR + SIDECAR_HOST). Findings:
packaging plists are templates (raw copy → launchd exit 78; service
install is canonical); UATS falsy-variant-body inheritance pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats): jiminy_guide_sanitized timeout 30s → 90s — stale vs synthesis latency
Caught in the HOOKSYNC-001 full-suite regression: the synchronous
/v1/jiminy/guide includes local-model synthesis (~43s observed quiet,
~50s typical per GUIDANCE-SYNTH-001) — the spec's 30s timeout has been
silently erroring since synthesis latency grew. Aligned with the
JIMINY_WARM_COMPUTE_TIMEOUT_MS budget (90s); hash refreshed; passes
live. Pre-existing — not a HOOKSYNC regression (Guide path untouched).
The other 3 suite errors were load-induced flakes (pass individually):
suite-vs-llama-server slot contention, noted for UXTS-CI-001.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ci): track .claude/hooks/pre-write-check.py so hook-parity check passes
Root cause: the new 'Verify live hooks match hook templates' CI step
(HOOKSYNC-001) diffs every internal/cli/hook_templates/*.{sh,py} against
.claude/hooks/<name>, but the .gitignore allowlist only un-ignored the
5 original hooks. pre-write-check.py gained a template in this sprint
while its live counterpart stayed gitignored, so CI checked out a tree
without it and failed with 'MISSING live hook:
.claude/hooks/pre-write-check.py'.
Fix: add '!.claude/hooks/pre-write-check.py' to the allowlist and commit
the live hook (already byte-identical to its template modulo SPACE_ID),
preserving the full parity guarantee instead of weakening the CI step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): sprint plan — real weights on the abstraction hierarchy (Epic 0)
Roadmap Q3 Phase 1 rank #3. Live investigation: point.distance() returns
NULL on embedding lists (proven: NULL where vector.similarity.cosine
returns 0.627 on the same pair); 3 creation sites affected incl. an
ABSTRACTS_TO site the audit missed. Scale worse than audited and
growing: 28,332/28,332 GENERALIZES + 36,110/37,996 ABSTRACTS_TO = 64,442
NULL-weight abstraction edges. Neo4j cosine returns [0,1] directly —
drop-in. Plan: fix sites (+ CUIDv2 edge ids), LIMIT-5-then-batched
backfill, null-weight gauge + alert rule via the existing graph-stats →
metric_samples path, UVTS-quick regression guard.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-weight-001): abstraction-edge weights — vector.similarity.cosine replaces point.distance (Epic 1)
point.distance() is a spatial-Point function: on embedding lists it
returns NULL, so every weight at the 3 abstraction-edge creation sites
was never set (100% of GENERALIZES + 95% of ABSTRACTS_TO weightless;
the CASE guards passed on good embeddings, then the THEN expr evaluated
NULL — edges with good embeddings got nothing while embedding-less ones
got the 0.5 fallback). vector.similarity.cosine returns [0,1] directly
(live-verified: identical=1.0, orthogonal=0.5, opposite=0.0). Site 1
(theme GENERALIZES) gains the null-guard it never had.
Also: edge_id randomUUID() → CUIDv2 per the identifier standard, minted
Go-side via memberEdgePairs (Cypher can't generate CUIDv2) and zipped
with member ids for UNWIND. All 3 statements EXPLAIN-validated live.
Tier 1: pair-builder tests (uniqueness, CUID format, empty input).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): mdemg graph backfill-weights — heal 56k NULL abstraction weights (Epic 2)
Standalone subcommand (deliberately NOT folded into `graph repair`,
whose orphan sweep would delete the pre-fix orphan observations the
operator chose to keep). Weight = vector.similarity.cosine(endpoint
embeddings) when both exist, else 0.5 (the creation sites' fallback);
similarity_score set alongside; idempotent (pure function of
embeddings); batched (default 1000/txn) with --limit for trials.
Executed per the small-batch-first rule: dry-run count → LIMIT-5 live
trial → hand-verified (stored ≡ independently recomputed to 6dp) →
distribution preview over 2000 (min 0.704, mean 0.96; the ~50% near-1.0
mass is single-member-cluster degeneracy — centroid ≡ member embedding,
HIDDEN-CHURN-001 territory, faithfully encoded) → full runs. Mid-run
the count GREW: the running server predated Epic 1 and kept minting
NULL edges — restarted on the fixed binary, swept stragglers, then
whk-wms (8,755) + linear (199). Final: 0 NULL / 57,395 edges globally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-weight-001): null-weight gauge + regression alert rule (Epic 3)
Query 4 in the graph-stats collector counts NULL-weight GENERALIZES/
ABSTRACTS_TO edges per space → new gauge
mdemg_neo4j_graph_null_weight_edges → metric_samples → evaluator rule
null_weight_abstraction_edges (service graph-weight-integrity, distinct
per the cooldown rule; NULL_WEIGHT_EDGE_ALERT_THRESHOLD default 100,
ForDuration 10m). Steady state post-backfill is 0; sustained
reappearance = the point.distance bug class regressed at a creation
site — it self-reports instead of waiting for the next audit.
Live: evaluator rules 16→17; gauge rows persisting at value 0 across
all spaces.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest): config-driven consolidation timeout — was sharing the 300s batch budget
Caught live during the HIDDEN-WEIGHT-001 corpus reingest: the post-ingest
/v1/memory/consolidate call used the shared batch-ingest client
(--timeout, 300s); consolidating a ~10k-node space exceeds that, so the
client reported failure while the server completed the work — the
GUIDANCE-SYNTH-001 bug class (long graph/LLM work needs its own budget).
New --consolidate-timeout flag / INGEST_CONSOLIDATE_TIMEOUT_SEC env
(default 1800s) with a dedicated client. Live-verified: "running
consolidation timeout_sec=1800" → complete.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-weight-001): Tier 3 verification + corpus restoration + UVTS harness audit + close (Epics 4-5)
Tier 3: real consolidation minted edges with varied cosine weights
(0.83-0.94) + CUIDv2 ids; at-scale via the corpus reingest (9,500 edges,
0 NULL, mean 0.923); gauge holds 0; evaluator rules 16→17.
UVTS harness: corpus space lnl-demo-whk had been deleted with zero trace
(no UVTS run since 2026-05-04 measured anything real); restored by
operator-directed full reingest. A fresh baseline NUMBER remains blocked
by further live-found harness rot — grader/persist breakage, expected-
path format drift, vector post-filter dilution (service.go:1137 global
top-K then space filter) amplified by the duplicate whk-wms space —
complete defect inventory handed to UXTS-CI-001. Retrieval ranking on
the restored corpus verified correct (expected files at ranks 1-4).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): sprint plan — scheduled maintenance actually runs (Epic 0)
Roadmap Q3 Phase 1 rank #4. Weekly decay+prune has never executed
(--dry-run defaults true; plist passes no override) while reporting
success — NOSILENT's blind spot. Tonight's Memory Bloat alerts (79k+
nodes) are the accumulated backlog. Safety verified in code before
planning: nodes are tombstoned (never deleted) with abstraction-chain/
degree/recency protections; edge deletion is the designed near-zero-
weight lifecycle, meaningful now that HIDDEN-WEIGHT made weights real.
Plan: live-by-default plist (+installed refresh), dry_run in job-event
metadata (no schema change — disclosed), maintenance_no_live_run
evaluator rule, darwin upgrade refreshes plists/hooks, first-ever live
run with preview-first protocol.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(maint-live-001): scheduled maintenance runs live — plist passes --dry-run=false (Epic 1)
The weekly LaunchAgent ran `mdemg maintenance` with no dry-run override;
the CLI defaults --dry-run=true, so every scheduled cycle previewed and
reported success — decay+prune NEVER executed (the 79k-node Memory
Bloat backlog). Both plist copies now pass --dry-run=false (the CLI
default stays true for safe manual previews — the SCHEDULE is what must
not silently no-op); installed plist refreshed + agent reloaded.
reportScheduledJobMeta threads job metadata into V0024; maintenance
records dry_run so the only-ever-dry-runs pattern is queryable
(metadata JSONB — no schema change, disclosed in the plan).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): maintenance_no_live_run evaluator rule (Epic 2)
Fires when maintenance rows exist in MAINT_LIVE_LOOKBACK_DAYS (default
8) but none ran live (success + metadata dry_run=false) — the only-
ever-dry-runs pattern self-reports instead of hiding inside "the job
ran". Distinct service maintenance-liveness per the cooldown rule.
Config: MAINT_LIVE_ALERT_ENABLED (true), MAINT_LIVE_LOOKBACK_DAYS (8).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): mdemg upgrade refreshes installed LaunchAgents + hooks (darwin) (Epic 3)
Plist/hook fixes shipped in releases but never reached installed
machines — the maintenance dry-run override would have sat unreachable
next to upgraded binaries forever. Upgrade now re-renders ALREADY-
INSTALLED mdemg LaunchAgents from the new binary's embedded templates
(refresh-only — never installs new services) + re-syncs mdemg-managed
Claude hooks in the current project (marker-checked). Substitution
logic single-sourced into renderLaunchdTemplate (Install + Refresh —
the drift class that exit-78'd the sidecar during HOOKSYNC live smoke).
Mirrors the existing Linux systemd-unit refresh.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(maint-live-001): context-dependent orphan policy — --exclude-role-types (Epic 4a)
Orphan disposition is context-dependent (operator, 2026-06-11): a
uniform degree/age rule conflates governance constraints, conversation
history, test junk, and hierarchy debris. New --exclude-role-types on
prune + maintenance (env PRUNE_EXCLUDE_ROLE_TYPES) makes the policy
expressible; the scheduled plist ships
constraint,conversation_observation excluded per the operator's call
(constraints are load-bearing governance rules at any degree;
conversation observations differ by SESSION which the knob can't
express yet). Aged hierarchy debris stays eligible — that's the
lifecycle working. Candidate census that drove the decision: 5,388
conv-obs (9 eligible tonight under the 90d shield), 11 constraints,
238 hierarchy nodes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(prune): orphan sweeps use implicit transactions for batched deletes
Caught by the FIRST-EVER live maintenance run (MAINT-LIVE-001 Tier 3):
Neo4j raises TransactionStartFailed when a batched CALL-IN-TRANSACTIONS
statement executes inside an explicit transaction. Both orphan sweeps
(SymbolNode + Observation) ran their batched delete via ExecuteWrite;
the dry-run path never executes the deleting statement, so no preview
or unit test could surface it — only live execution. Switched to
session.Run (implicit tx). The failure ALSO proved the NOSILENT chain
live: the run fired "Scheduled job failed: maintenance" before exiting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(maint-live-001): first live run verification + feature doc + CHANGELOG + close (Epics 4b-5)
First live maintenance in MDEMG history: 20,236 orphan SymbolNodes
deleted; all 5,010 tombstone candidates protected (recency + operator
exclusions); liveness rule born-firing → silenced by the real run; the
3-row job-event story (preview/true → failure/false alerted →
success/false) proves the dry_run plumbing through every path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md architecture note for MAINT-LIVE-001
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): sprint plan — embedder wiring + ingest exec resolution (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(embed-wire-001): breaker + recorder reach the real embedder through the wrapper chain (Epic 1)
The embedding circuit breaker was NEVER wired in any default deployment:
embeddings.New returns *CachedEmbedder when EMBEDDING_CACHE_ENABLED=true
(the default), so the server's emb.(*embeddings.OpenAI)/(*Ollama)
assertions on the OUTERMOST value failed silently (no else branch). The
recorder assertion had the inverse fragility (cache off → training-data
recording silently dies).
New: Unwrap() chain (CachedEmbedder joins RateLimitedEmbedder's existing
one) + embeddings.Base() / FindCached() interface-driven walkers — any
future wrapper joins by adding Unwrap(), no type lists. Wiring now walks
to the base for the breaker and to the cache layer for the recorder,
with LOUD warns when nothing matches. Tier 1 pins the production shape
(ratelimit(cache(provider))) plus cache-off and bare chains.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(ingest-exec-001): server-triggered ingest resolves the mdemg binary — was hardcoded ./bin/mdemg (Epic 2)
Both ingest-job exec sites ran a relative "./bin/mdemg": broken in
Docker (the documented-primary deployment — binary at /usr/local/bin,
no repo checkout) and any CWD other than the repo root. New
resolveMdemgBin(): MDEMG_BIN env → os.Executable() (the server IS the
binary) → PATH → ./bin/mdemg legacy fallback; cached; Tier 1 pins the
order. Scheduled-sync jobs now report outcomes to scheduled_job_events
via jobhealth (job_name codebase-sync) — an unattended sync that keeps
failing is never silent; manual API jobs stay queue-visible only.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(embed-wire-001): live verification + CHANGELOG + CLAUDE.md + close (Epics 3-4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): sprint plan — documentation matches reality (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): CLAUDE.md FT section rewritten to post-pivot reality (Epic 1)
The section presented the abandoned Qwen3.6-35B-A3B MoE target, two-tier
MoE-Sieve strategy, and Sprint A→E critical path as CURRENT — all
superseded by the 2026-04-22 MoE→dense pivot; this stale text seeded the
Q3 roadmap audit with a dead architecture. Rewritten: shipped state
(dense Qwen3-14B mdemg-llm-v1, 0.8389, llama-server runtime), superseded
plan documented with the pivot rationale (never deleted — supersede-with-
pointer), guardrail llmclient exception marked CLOSED (re-verified in
code), memo-07 provenance break disclosed (the file never existed;
00_README_v2.md is canonical), open FT work named (FT-CLASSIFY-002 +
recursive-retraining trigger). Adapter env-name drift fixed
(MDEMG_ADAPTER_BASE, not MDEMG_MODEL_ADAPTER_BASE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(doc-truth-001): operator-facing text matches the Phase 13.5 reality (Epic 2)
preflight errors directed operators to start the DECOMMISSIONED
mlx_lm.server on :8101 — following them reintroduces the crash-looping
stack Phase 13.5 replaced. Now: llama-server :8102 guidance (managed
service install + manual command), backend-agnostic wording. model.go
help text dropped three stale "deferred to MODEL-DIST-002" mentions
(shipped 2026-05-25). Operationally (untracked .env): removed the
J17_SIDECAR_TIMEOUT_MS=200 override that re-pinned the exact value
DH-004 remediated — the 1000ms default now applies; server restarted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): 00_README STATUS block + AGENT_HANDOFF retired (Epic 3)
00_README_v2.md gains a top-of-file STATUS block: shipped-through-
cutover state, superseded MoE plan (FT-2 skip + FT-3 supersession +
R-LT-4 prototype-discipline adjudication recorded), the NOT-STARTED
recursive-retraining loop with its FT-CLASSIFY-002 trigger, and
provenance notes (memo-07 never existed; the spec is untracked pending
FG-2). AGENT_HANDOFF.md (stale since 2026-05-06) retired to a pointer
stub — handoff state lives in CLAUDE.md/roadmap/CHANGELOG/CMS.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): grep-sweep proof + CHANGELOG + close (Epic 4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(doc-truth-001): last stale --adapter help string (sweep straggler)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): sprint plan — fail-closed self-improvement (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): honest criteria evaluation — populated keys + fail-closed mutations (Epics 1-2)
The cycle baseline populated 10 metric keys while task criteria
referenced ~15 others (only volatile_count + correction_rate
intersected) → missing_data → skip → ~16/17 actions validated
vacuously; criteria-driven rollback was unreachable. The
SelfAssessmentReport already carried nearly every needed key — they
were never copied into the maps.
New single source reportMetricsMap() feeds BOTH MetricsBefore and
MetricsAfter (the mismatch class cannot recur), resolving
edges_below_threshold, total_edges, consolidation_age_sec,
avg_edge_weight, guidance_health, protocol_health + 13 more. Fail-
closed rule: for MUTATING actions (15-entry registry) a criterion with
missing evidence counts as NOT met ("missing_data_failclosed") — an
unverifiable mutation must never be recorded as success; observational
actions keep advisory semantics. The prior test pinned the vacuous
pass as the contract — updated to the honest one + advisory companion.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): tombstone_stale scoped to correction-linked nodes; refresh_stale_edges decays for real (Epic 3)
tombstone_stale archived 50 ARBITRARY older observations whenever ANY
correction existed in the 7-day window — no relationship between
correction and target. Now requires linkage: same session as the
correction OR its 1-hop CO_ACTIVATED_WITH neighborhood. Live check:
0 corrections in the current 7-day window, so both old and new scopes
are 0 RIGHT NOW — the hazard was conditional (any future correction
re-armed the old query against thousands of unrelated observations;
the new query bounds it to genuinely related nodes).
refresh_stale_edges bumped last_activated BEFORE the weight expression
read it → staleness=0 → the decay term vanished → every refresh was a
pure +0.1·log(count+1) boost. Staleness now captured via WITH before
SET; weights can genuinely decay. Both statements EXPLAIN-validated.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-validate-001): counter-free confidence calibration — RSIC stops polluting its own signal (Epic 4)
RSIC-SK1 injected synthetic "followed"/"ignored" outcomes through
UpdateConfidence, incrementing total_surfaced/total_followed/
total_ignored — the exact counters GetConstraintEffectiveness reads
next cycle: measured effectiveness drove synthetic outcomes which drove
measured effectiveness (circular self-reinforcement). New
AdjustConfidenceDirect applies the clamp+archive confidence delta with
ZERO counter writes; the outcome counters now belong exclusively to
real guidance feedback. Provider interface + adapter + dispatcher use
the direct path with the configured boost/decay magnitudes; test mock
maps deltas back to outcome labels so existing assertions keep meaning.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-validate-001): Tier 3 verification + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(rsic-validate-001): integration seeds carry session linkage for the scoped tombstone contract
CI's TombstoneStaleEndToEnd + MultiActionDispatchAndMetrics failed
because the seeded observations had NO relationship to the seeded
corrections — under the old behavior they were archived anyway (the
memory-eroding bug the sprint removed); under the new correction-
linkage contract they are correctly spared. SeedObservationNodes now
stamps a per-space test session shared by corrections and their stale
peers, so the tests exercise the new contract. Query-level proof
against the exact seeded shape: 10/10 linked observations match the
scoped Cypher. (Local integration runs hit the 30s client timeout —
the loaded local stack's cycles take ~6 min; CI's arbitrates.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): sprint plan — finish the score-scale contract (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): persistent rerank clients — failure alerting re-armed on the hottest LLM path (Epic 1)
doRerankWithOpenAI/doRerankWithOllama constructed a fresh llmclient per
call: the consecutive-failure counter reset every time, so
LLM_CONSECUTIVE_FAILURE_THRESHOLD could NEVER fire for
retrieval.rerank_cross / rerank_nli (a north-star distill task), and the
HTTP transport was discarded per call. Per-provider base clients now
init once (sync.Once); WithContext() shallow-copies and SHARES the
*atomic counter + breaker, so per-call contexts keep failure accounting.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): config-driven score thresholds — suggest revival, MCP tiers, guardrail floor (Epic 2)
Three score-literal leftovers from the RRF-SCALE-001 audit instruction:
(1) /v1/memory/suggest's hardcoded 0.5 min-confidence default filtered
nearly everything on a scale topping out ~0.58 → CONSULTING_SUGGEST_
MIN_CONFIDENCE (default 0.45, RRF-calibrated); (2) MCP memory_reflect
tiers 0.7/0.4 (high tier unreachable) → MCP_REFLECT_SCORE_HIGH/_MEDIUM
(0.45/0.25); (3) guardrail constraint-retrieval Cypher's hardcoded
sim > 0.3 → GUARDRAIL_CONSTRAINT_SIM_FLOOR via GuardrailConfig
(cosine-stable today but inside the class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rrf-scale-002): CacheKey covers ALL result-affecting fields + two forcing functions (Epics 3-4)
CACHE-KEY-002: the key omitted result-affecting RetrieveRequest fields —
the audit named 5 (include/exclude_extensions, temporal_after/before,
policy_context); the new reflection forcing-function caught 8 MORE on
its first run: sparse-gate per-call overrides (SparseEnabled/
SparsePercentile/SparseOverridePresent/Category — the ?sparse= URL
params), pagination (Cursor/Limit), and the context-fingerprint params
(QueryContextFingerprint/StrictContextMode). All now keyed, plus a
caller-supplied query-embedding hash. Two requests differing in any of
these no longer collide on one cache entry.
Forcing functions: (1) reflection test — every RetrieveRequest field
must be in CacheKey or explicitly classified result-neutral with
justification (new fields fail until classified); (2) score-literal
scan — flags `.Score/score <op> 0.x` comparisons repo-wide outside a
justified allowlist (first run triaged 3 scale-local sites; clamp
guards excluded by pattern). The RRF-SCALE bug class is now CI-caught.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rrf-scale-002): live-calibrated suggest floor + CHANGELOG + close (Epics 5-6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): sprint plan — stable concept identity, two-PR delivery (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): automated consolidation no longer skips LLM emergence (Epic A1)
dynamicEmergenceStep registers at phase 22, but RunConsolidation ran
hardcoded ranges (10,20) + (25,30) — phase 22 fell in the gap, so with
EMERGENCE_ENABLED=true the AUTOMATED path silently skipped LLM concept
emergence while the manual path (RunNodeCreationPipeline, 10–22 with an
emergence gate) ran it. RunConsolidation now delegates to
RunNodeCreationPipeline(cfg.EmergenceEnabled) — single range source; a
pin test fails if the step's phase ever leaves the range.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): stable theme identity — centroid match-or-create replaces the 5-minute churn (Epic A2)
ClusterConversations detached EVERY observation→theme edge, deleted
childless themes, and recreated all themes from scratch each ~5-min
cycle: new node_ids every run, evidence chains destroyed continuously,
recall flooded with stacks of near-identical concepts (observed live in
this session's own prompt headers).
New flow: cluster first → match each cluster to an EXISTING theme by
centroid cosine (HIDDEN_THEME_IDENTITY_SIM_THRESHOLD, default 0.90,
greedy with per-run claiming) → matched themes UPDATE in place
(props + theme-scoped member-edge rewire; node_id and all inbound
references survive) → unmatched clusters create as before → only themes
claimed by NO cluster are deleted. The global detach is gone.
ThemesUpdated added to the result. Tier 1: match/threshold/claimed/
best-of selection.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* chore(hidden-churn-001): remove the dead global-detach helper — the churn mechanism itself
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-A verification + CHANGELOG (Epics A3-A4)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): PR-B coverage retune — config ratio, density assignment, gauge + rule (Epic B1)
maxThemes was an inline ceil(n/10) equation → HIDDEN_THEME_TARGET_RATIO
(default preserves it). NOISE observations (previously dropped from the
hierarchy forever — the 94% coverage gap's mechanism) now density-assign
to their nearest theme when cosine ≥ HIDDEN_THEME_ASSIGN_SIM_THRESHOLD
(default 0.70; edges only, no new themes; below-floor stays unthemed
honestly). New per-space coverage gauge
mdemg_neo4j_conversation_coverage_ratio (collector Query 5) + evaluator
rule low_conversation_coverage (CONVERSATION_COVERAGE_ALERT_FLOOR 0.2,
6h ForDuration for convergence).
Audit bonus: caught WeightIntegrityRules querying metric_samples with
recorded_at — the column is `time`; the null-weight rule had been
silently erroring every evaluation since it shipped (Debug-only logging
— the SUPERVISOR-002 finding in action). Both rules fixed + a pin test
bans recorded_at against metric_samples.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): mdemg concepts repair + trace — grounding audit CLI (Epic B2)
repair: tombstones childless layer>=2 abstraction nodes (no inbound
ABSTRACTS_TO|GENERALIZES|GENERALIZES_TO — 10,395 live in mdemg-dev,
churn-era debris). Recoverable (is_archived=true + archived_reason),
batched, dry-run default, --limit for small-batch-first verification.
trace: per-node grounding audit — direct children, transitive per-layer
census, grounded/ungrounded verdict, sample path to L0.
Live data note: GENERALIZES alone over-counts (19,147) — ABSTRACTS_TO
is the hidden layer's actual child edge; pin test guards the predicate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(hidden-churn-001): surface themes_updated + noise_assigned in consolidate API + periodic log (Epic B3)
/v1/conversation/consolidate now reports themes_updated and
noise_assigned alongside themes_created. The periodic-consolidation
log condition also gains both — with stable theme identity (PR-A),
created is usually 0 on healthy cycles, which would have silenced the
success log entirely (the silent-success bug class).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(hidden-churn-001): live-smoke fixes — noise pool was structurally empty, clustering included archived debris, coverage gauge gated on min-obs
Three defects only the live run surfaced (Tier 3 forcing function):
1. KMeans never emits label -1, so the density-assignment hook received
an always-empty noise list; the min-samples/max-themes/nil-centroid
drops now feed their members into the noise pool instead of silently
excluding them from the hierarchy.
2. fetchClusterableConversationObservations had no is_archived filter —
it clustered 4,838 observations of which only 183 were live (MAINT-LIVE
tombstones), building themes on archived debris. Both fetch variants
now exclude archived. Live effect: 24 debris themes swept to 5 clean
ones; second cycle themes_updated=5/created=0 (stable identity on
real data).
3. Coverage gauge gated on CONVERSATION_COVERAGE_MIN_OBS (default 50,
DH-005 confidence-threshold pattern) — tiny scratch/test spaces
(2-13 observations) emitted 0.000 and would have alarmed forever
(born-firing alert hazard). Sentinel -1 skips emission.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(hidden-churn-001): PR-B verification + CHANGELOG + CLAUDE.md — sprint complete (Epic B5)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): sprint plan + background loop inventory (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): sliding-window restart budget + late registration (Epic 1)
The restart counter only ever incremented — a once-a-week transient
permanently killed a worker after 3 weeks. Budget is now a sliding
window (restarts older than the window are forgotten); permanent
failure requires >SUPERVISOR_MAX_RESTARTS within
SUPERVISOR_RESTART_WINDOW_MIN. New Go() registers+launches workers
after Start (the API server starts its loops late); nil return without
ctx cancellation now means intentional completion, not a restart.
Start() outlives dead workers so late workers stay supervised.
Config: SUPERVISOR_MAX_RESTARTS (3), SUPERVISOR_RESTART_WINDOW_MIN
(60), SUPERVISOR_BACKOFF_BASE_SEC (5).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): register the 12 unsupervised background loops (Epic 2)
Every scheduler/loop goroutine now runs under the goroutine supervisor
(panic recovery + sliding-window restart budget) instead of as a bare
go func() whose panic silently killed the subsystem forever:
- api.Server (6): periodic-consolidation, context-cooler,
space-prune-scheduler, weekly-gap-interviews, scheduled-sync,
rsic-macro-cron — via injected SetSupervisor(sup.Go) + goSupervised
helper (bgWg brackets each run; stop channels remain the graceful
path and return nil = no restart)
- ape (3): rsic-watchdog, rsic-store-flush, signal-learner-flush
- backup schedulers (2): neo4j-backup-scheduler, tsdb-backup-scheduler
— their NewServer construction-time Start() moved to
StartSupervisedBackground() (serve.go) so the hook exists first
- serve.go (1): llm-fastfail-burst-flush via sup.Go
All owners keep a nil-hook fallback (legacy bare goroutine), so tests
and non-server callers are unchanged. Buffered TSDB writers are
explicitly out of scope (TSDB-CONSUME-001 owns flush observability).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): rule-health meta-alert on evaluator query failures (Epic 3)
A rule whose SQL errors was a silently-disabled alert: failures were
logged at Debug and nothing watched the watcher — bitten twice in one
week (HIDDEN-WEIGHT-001 null-weight rule + the recorded_at column bug,
both found by accident in later sprints). Query failures now log at
Warn, and after ALERT_RULE_FAILURE_THRESHOLD (default 3) consecutive
failures a high-severity meta-alert fires directly via the dispatcher
(not via a rule — the meta-channel must not depend on the failing
mechanism). Service label is rule-health-<rule-id> so concurrent
failing rules don't cooldown-suppress each other; success re-arms.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): recency-gate the RSIC llm_error_rate_spike insight (Epic 4)
Insight 26 computes the error rate over a 24h window with no recency
requirement, so a 35-min jiminy.synthesize timeout burst at 02:00 UTC
kept re-firing HIGH 'LLM error rate spike' (and escalating to CRITICAL
'Jiminy Pipeline Critical') every RSIC micro-cycle for 12+ hours after
the incident self-resolved (live, 2026-06-11). LLMPerformanceSummary
now carries LastErrorAt (MAX(time) over errored rows); the spike
insight fires only when the most recent error is within
RSIC_LLM_ERROR_RECENCY_MIN (default 60; 0 disables the gate). A zero
LastErrorAt (older data source) keeps legacy behavior — the gate never
widens silently.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): nolint G118 on legacy-fallback loop launches
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(supervisor-002): aggregate evaluator-degraded alert for global outages
A TSDB-level outage fails every rule at once; per-rule meta-alerts
would storm ~19 alerts duplicating the health prober's signal. At the
failure threshold the evaluator now distinguishes: other rules
succeeding recently → per-rule rule-health alert (broken SQL class);
nothing succeeding within threshold×interval → ONE
alert-evaluator-degraded alert per outage. Success re-arms both.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(jiminy): detach feedback outcome processing from the hook's connection lifetime
Live-smoke surprise during SUPERVISOR-002 Epic 5 (own fix-commit per
policy): jiminy.evaluate_llm was failing at 94.9% (657 'context
canceled' rows/24h). The post-tool-observe hook POSTs
/v1/jiminy/feedback with curl --max-time 5, but per-item Tier-2
outcome classification routinely outlives the connection — the request
ctx then cancelled every in-flight LLM call and outcomes silently
degraded to the keyword heuristic. Same defect class as
GUIDANCE-SYNTH-001's warm-path budget.
handleJiminyFeedback now uses context.WithoutCancel(r.Context()) with
its own server-side budget JIMINY_FEEDBACK_TIMEOUT_MS (default 60000,
0 = unbounded). The hook keeps its fire-and-forget 5s curl.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(supervisor-002): streak-relative global-outage discriminator (drill-caught)
The Epic 5 TSDB-stop drill caught the freshness-window heuristic
misclassifying outage ONSET: rules were succeeding seconds before the
stop, so lastAnySuccess was fresh when the first rules hit threshold —
2 per-rule alerts leaked before the aggregate fired. The discriminator
is now streak-relative: per-rule only when some other rule succeeded
AFTER this rule's failure streak began; otherwise global, once per
outage. Unit-pinned with the drill scenario
(TestRuleFailureStreak_OutageOnsetIsGlobal).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(supervisor-002): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 6)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): harden the restore path (Epics 1-4)
Epics 1-4 land together — they share the restore-path signatures.
1. Checksum gate (Epic 1): the manifest SHA-256 was written at backup
time and never read; a corrupted .mdemg restored silently. The gate
now fails closed before import; legacy manifests without a checksum
warn and proceed.
2. Snapshot completion polling (Epic 2): the pre-restore safety
snapshot was raced with time.Sleep(2s) against an async backup
goroutine. waitForBackupJob now polls the jobs queue until
completed, failing closed on failure/cancel/vanish/timeout
(BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC, default 300).
3. Count validation (Epic 3): manifest NodeCount/EdgeCount are
whole-database counts and cannot validate file contents (they
diverge on partial backups) — new additive file_node_count/
file_edge_count/file_observation_count manifest fields are counted
from the exported chunks; restore re-counts the file and hard-fails
on mismatch (truncation class). Importer accounting divergence
under CONFLICT_SKIP is warn-only, surfaced in a job-result
validation block.
4. dockerbin routing (Epic 4): the legacy .dump restore shelled out to
bare "docker" (the launchd-minimal-PATH class NOSILENT-001 fixed
for TSDB); now routes via dockerbin unless the operator set a
non-default FullCmd.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): neo4j-backup jobhealth + generalized staleness rules (Epic 5)
The default-ON Neo4j backup scheduler had zero jobhealth coverage —
the inverse of NOSILENT-001 (which wired only tsdb-backup). The
scheduler now waits on each triggered job (its Trigger is queue-async;
a fire-and-forget report would always claim success) and reports
outcome via SetResultHook → jobhealth.Report with
job_name='neo4j-backup' (wired in SetTSDBClient next to the tsdb
hook). The staleness rule is generalized into a jobStalenessRule
factory; Neo4jBackupStalenessRule (neo4j_backup_no_recent_success,
Service scheduled-job-staleness-neo4j, window = partial interval × 2
unless BACKUP_JOB_STALENESS_HOURS overrides) registers when
BACKUP_ENABLED. The existing tsdb rule is pinned unchanged through the
refactor.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(backup-restore-verify-001): initial backup on start (rule honesty)
With the neo4j_backup_no_recent_success rule registered, a fresh
install would alarm honestly-but-noisily for up to 24h (the scheduler's
first tick). The scheduler now runs an initial partial backup
BACKUP_INITIAL_DELAY_MIN (default 5) minutes after start, so every
install has a backup — and a quiet staleness rule — within minutes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(backup-restore-verify-001): retention was deleting every backup it just made (drill-caught)
The Tier 3 round-trip exposed that the backup system was a no-op for
this database: BACKUP_RETENTION_MAX_STORAGE_GB had a comment/code
default drift (documented 50, code read 2), and with RunAfter=true the
quota pass deleted each 3-4 GB whole-database backup ~80 ms after it
completed (log: 'backup completed' → 'retention cleaned backups
deleted_count=1 freed_bytes=<exactly the new backup>'). Three fixes:
1. Quota retention NEVER deletes the newest backup of each type — a
quota smaller than one backup degrades to 'over quota, keep it'
with a loud warning, not 'delete the only backup'. Sparse-file unit
tests pin both the two-backup and only-backup-oversize shapes.
2. Default quota raised to the documented 50 GB.
3. BACKUP_SNAPSHOT_WAIT_TIMEOUT_SEC default 300 → 3600: the live
whole-database export runs ~15 min; the 5-min wait made the initial
scheduled run report failure (jobhealth correctly recorded it —
the wiring works) while the backup actually completed later.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(transfer): omit empty path/name on import — restores with observations always failed (drill-caught)
The Tier 3 round-trip's real restore failed with
ConstraintValidationFailed: conversation observations carry path=NULL
in Neo4j (which memorynode_path_unique (space_id, path) ignores), but
the exporter serializes NULL as the proto default "" and the importer
wrote the literal empty string unconditionally — so the second
observation node in any restore collided. Every restore containing 2+
observation nodes had always been broken; this was invisible because
no backup had ever been restore-tested (the sprint's premise,
demonstrated). nodeProps now omits empty path/name (null fidelity);
unit-pinned.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(backup-restore-verify-001): feature doc + verification + CHANGELOG + CLAUDE.md — sprint complete (Epic 7)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-storm-001): sprint plan with corrected burst attribution (Epic 0)
Triage correction baked in: the 5,397-node burst was the Context
Cooler via the session-start hook's /v1/conversation/graduate (uncapped
backlog sweep of pre-DH-004 graduation-bug victims), NOT RSIC —
mis-attributed because tombstone_stale stamps no metadata and two
archive-reason property names coexist. RSIC's own issues stand:
trigger-race cycle storm (~20-30k/day) and snapshot/executor predicate
drift (rollback restores nothing).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): atomic trigger admission — reserve-on-allow (Epic A)
EvaluateTrigger checked activeCycles/lastTrigger, but both were written
only by RecordTrigger — which callers invoke AFTER RunCycle completes.
For a cycle's entire multi-second duration every concurrent trigger
passed every gate: ~20-30k micro cycles/day live (4 spawning within
50ms of each tool-use burst), the 300s cooldown effectively
nonexistent, llama-server saturated (the recurring synthesize/
evaluate_llm/intent_translate timeout cascades), and RSIC actions
dispatched at storm frequency.
Admission now reserves the active + cooldown (+dedupe) records under
the same lock that performs the checks; RecordTrigger updates the
reservation with the real cycle ID; CompleteCycle clears the active
slot; a failed cycle still cools down. Unit-pinned: 50 concurrent
triggers admit exactly one; cooldown holds from admission through
completion and failure.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): attributable archival + unified tombstone predicate (Epics B+C)
Epic B — every archival is now attributable:
- tombstone_stale stamps archived_at + archive_reason
('rsic_tombstone_stale') + archived_cycle_id (bare is_archived made
the 2026-06-11 burst forensics mis-attribute the Context Cooler's
5,397-node sweep to RSIC for hours).
- Canonical property name is archive_reason; concepts.go (the one
archived_reason writer) migrates; historical rows keep the old name
(readers coalesce; no data migration).
- Context Cooler tombstone step capped per run
(COOLER_TOMBSTONE_MAX_PER_RUN, default 500; 0=unlimited) with a loud
cap-reached warning — the incident sweep was a single uncapped run
over the pre-DH-004 volatile backlog via the session-start hook's
graduate call.
Epic C — rollback restores the right nodes:
- The executor and the rollback snapshot now share ONE candidate
predicate (tombstoneStaleCandidates const). RSIC-VALIDATE-001 had
updated only the executor; the snapshot captured the old unlinked
set, so rollback restored nodes that were never archived
(restored_count=0 live). Drift class eliminated, pin-tested.
- Rollback also clears the new attribution fields on restore.
Epics share the tombstone Cypher — combined commit (disclosed).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(rsic-storm-001): feature doc + verification + rollback drill test + CHANGELOG + CLAUDE.md — sprint complete (Epic F)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(rsic-storm-001): commit ExecuteTombstoneStaleForTest wrapper missed from Epic F
The rollback drill test (committed in 2534a28) references this
test-support wrapper, but the Epic F git-add listed the test file and
not internal/ape/task_dispatch.go — CI's integration build failed on
the already-merged PR #435 while local builds passed (the method
existed uncommitted in the working tree). 7-line addition, no behavior
change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(tsdb): initial backup on start — restart-resetting ticker meant zero backups ever
Alert triage on the (correctly firing) 'No Successful TSDB Backup In
Window' staleness alert found scheduled_job_events has ZERO tsdb-backup
rows: the scheduler's 24h ticker resets on every restart, and a server
restarted more often than the interval never backs up (8 restarts
today alone). Same gap BACKUP-RESTORE-VERIFY-001 fixed for the Neo4j
scheduler. The shared runOnce() now also fires
TSDB_BACKUP_INITIAL_DELAY_MIN (default 10; 0 disables) after start —
the staleness rule's 'never ran' guarantee did its job catching this.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(uats-gap-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(jiminy): reformulate returns 400 (not 500) for missing context
Live-caught while probing for the UATS-GAP-001 contract spec: the
service rejects an empty context but the handler surfaced that
request-validation failure as 500 'internal error'. Validate at the
edge like space_id. The /strict reformulation channel (prompt-context
hook in strict mode) was otherwise contract-clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(transfer): nil-guard edge identity assertions — one bad edge panicked the whole server (UATS-caught P0)
The UATS suite's backup_trigger spec ran a live export that hit an
edge whose endpoint returned nil fromId/toId — the unchecked
fromID.(string)/toID.(string)/relType.(string) assertions at
exporter.go:641-643 panicked, taking the entire server down mid-run
(launchd restarted it; 167 connection errors in the suite were the
visible symptom). An unexportable edge is now skipped with a warning;
the two parentVal assertion sites hardened with comma-ok for the same
class. An HTTP-triggerable request must never be able to kill the
process.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(uats-gap-001): 8 contract specs for the revived channels + suite hygiene (Epics 1-6)
New specs (27 cases, 100% live pass, hash-stamped): jiminy_strict,
jiminy_reformulate, jiminy_classify (incl. the fail-open contract),
jiminy_warm (202 warming|debounced union), jiminy_latest (warmth-state
union — the Follow-up C strict-JSON surface), admin_breakers_list,
admin_breakers_reset (404-with-available[] discoverability +
mutation-safe reset round-trip), memory_retrieve_sparse_context (first
specs referencing ?sparse=/?sparse_percentile=/?context=auto/
?strict_context= — pins the sparse-debug-fields-are-optional client
contract).
Suite hygiene: 8 pre-existing env-conditional *_disabled variants
converted to skip variants (they pass only with Jiminy/J17 OFF; the
live stack runs them ON), and j17_protocol_learn's missing-fields
guard fixed — the runner deep-merges variant bodies, so the original
key-omission body merged into a complete request and asserted nothing;
now empty-string overrides. Conventions recorded in CLAUDE.md.
make test-api: 0 failed, 0 errors, 425 hashes verified.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(uats-gap-001): tag breaker reset round-trip llm_required — CI has no openai-embeddings breaker
PR #437 CI caught the deployment dependence the spec description
already flagged: the reset_round_trip variant targets the
openai-embeddings breaker, which exists only when the openai embedding
provider is configured — CI runs provider-less and 404'd. Tagged with
the suite's established llm_required exclude tag: CI (which excludes
it) skips the variant; the live stack still exercises the full
round-trip. Verified both modes locally (27/27 untagged; 26/0 with the
CI exclude set).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(uxts-ci-001): sprint plan (Epic 0)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(uxts-ci-001): TSDB in CI, un-zombie UOTS, delete UVTS step, ULTS hash gate live (Epics 1-3a)
CI: timescaledb service (compose-pinned image) + TSDB_* env; tsdb tag
un-excluded from UATS; UOTS converted from its since-inception
continue-on-error zombie to merge-blocking; UVTS step DELETED per the
roadmap's 'un-zombie or delete' option (semantic grading with a stub
embedder and no seeded corpus measures nothing — UVTS gates retrieval
changes via the documented live flow); new merge-blocking gates: ULTS
prompt-hash verification + UBENCH dataset↔holdout contract.
Arming the gates immediately caught real rot, each fixed here:
- ULTS runner could not parse annotated source locations
('file.go:74 (rendered with ...)') — 5/17 specs failed on a parser
artifact; the 'dynamic' sentinel was rejected inside hash arrays.
- With the parser fixed, FOUR genuine prompt drifts surfaced
(ape.reflect, hidden.reclassify, jiminy.evaluate_llm,
retrieval.rerank_cross) — prompts evolved through shipped sprints
with no spec updates; re-pinned, and the two full+compact array
specs now carry per-variant source locations (the single shared
source made the compact hash unverifiable by construction).
- UOTS grafana_neo4j_dashboard spec was stale against the
GRAFANA-AUDIT-001 dashboard restructure (renamed panels, per-space
graph_* queries no longer on that dashboard) — invisible while the
step was a zombie; refreshed to current truth.
- UOTS 'artifact' spec type was unimplemented (the runner honestly
failed tsdb_backup_manifest); implemented: glob + required-field
type/pattern/RFC3339 validation, absence = pass for
merge_blocking=false (artifacts come from scheduled jobs).
Local proof: ULTS 17/17 exit 0; UOTS 11/11 exit 0; UBENCH contract
green.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(uxts-ci-001): neural pytest+ruff CI job — first run caught two silent failures (Epic 3b)
~610 neural/training tests had zero CI. The new job (ubuntu; mlx-only
modules excluded — Apple Silicon dependency; torch-heavy sidecar app
tests stay local, disclosed) immediately surfaced rot:
- test_distill_driver expected the MoE-era teacher label
'synth-qwen3.6-local'; the code moved to qwen3-14b at the 2026-04-22
pivot — the test had been failing silently ever since.
- test_quality_filter assumed hidden.reclassify's prompt hash is the
scalar 'dynamic'; the spec evolved to a [hash, 'dynamic'] array.
Plus ruff cleanup across neural/ (32 findings → 0): dead batched
mx.arrays in the GRPO adapter (superseded by per-item stash reads),
dead retry-status stores, semicolon statements, ambiguous names.
610 passed / ruff clean locally.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(uxts-ci-001): drift checker covers all 15 frameworks, matrix refreshed, gate un-zombied (Epic 4)
verify_uxts_drift.py gains the 5 omitted frameworks (ULTS, UITS, UTDS,
UAITS, UBENCH — UNTS remains registry-based with its own merge-blocking
verify-hashes step in ci.yml). UXTS_FRAMEWORK_MATRIX.md refreshed from
51 days stale to on-disk truth (UATS 124→220, USTS pilot→active 3→5,
UVTS 1→3 + live-gated status, +UBENCH row, 7 rows corrected). The
drift step in uxts-canonical-specs.yml was ITSELF a continue-on-error
zombie — now merge-blocking; local run passes across …
Implement the
ApplyCoactivation()function inmdemg_build/service/internal/learning/service.goto create and strengthen CO_ACTIVATED_WITH edges between retrieved nodes using Hebbian learning principles ("neurons that fire together wire together").This function is called after retrieval returns results and is responsible for the learning/memory consolidation aspect of the MDEMG system.
Summary by CodeRabbit
Release Notes
New Features
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.