Skip to content

Daemon self-terminating TTL + global status, honest HNSW/init footguns (carry-forward from pacphi/ruflo-machine-ref) #2360

@shaal

Description

@shaal

Summary

Carry-forward of the still-open, fixable-in-ruflo findings from pacphi/ruflo-machine-ref — a rigorous, reproduction-driven investigation kit by @pacphi that documents reliability and token-consumption problems in ruflo on modern Node (24/26) and the honesty of its self-learning/security surface.

🙏 Huge thanks to @pacphi for the depth here: root-cause docs, Docker reproductions, a held-out multi-seed statistical eval harness, and a cross-session token audit. Most of the kit's findings have already landed upstream (see the reconciliation table below); this issue tracks only what remains and is fixable in this repo.

A PR accompanies this issue.

Already fixed upstream (no action needed — verified against 3.10.42 source)

Finding Status
route feedback never persisted (no saveModel()) ✅ 3.10.6 (#2222, @pacphi credited)
Negative-reward inversion (-r -1.0 parsed as +1.0) ✅ 3.10.7
Stale route cache hid learning / --explore false ignored ✅ 3.10.8
#2219 Node 24/26 better-sqlite3 silent WASM fallback → data loss ✅ 3.10.6 (better-sqlite3 ≥12.8.0 override)
#2239 / F3 Q-state encoder collapse (keyword block discarded) 3.10.11 (FNV-1a lossless fold)
security defend .color crash after detection ✅ already guarded (`
security cve --list silent empty ✅ already prints honest "no DB → use npm audit"
.mcp.json writes ruv-swarm+flow-nexus by default ✅ default init writes only claude-flow (cloud is --full-only)
Fabricated Flash-Attention metric ✅ removed 3.10.7

Still open and fixable in ruflo (addressed by the PR)

1. Background daemon has no self-terminating lifecycle → multi-day token leak (primary)

The daemon runs interval workers (audit ~30m, optimize/testgaps ~60m, map/consolidate/…), each spawning a headless claude --print sweep. The daemon never self-terminates and daemon status only inspects the current workspace, so leaked daemons in other projects are invisible.

@pacphi's token audit traced a Max-plan quota burned in 1–2 days to 6 immortal daemons (oldest 19 days), and a later recurrence to 17 per-project daemons (34,533 total worker runs) — ~94% of token spend was background Haiku/Sonnet machinery vs ~6% interactive. Evidence: token-consumption-findings, recurrence.

Fix: self-terminating TTL (default 12h) + opt-in idle shutdown inside the daemon process, a --ttl flag (0 disables), and daemon status --all for a global, leak-surfacing view.

2. neural status falsely reports "HNSW Not loaded"

getHNSWStatus() reports availability off a lazy in-process singleton that neural status never warms, so it prints "Not loaded — @ruvector/core not available" even when the package is installed and exposes VectorDb. Cosmetic but misleading (it reads as real dormancy). Fix: report real capability (@ruvector/core resolvable) separately from in-process load.

3. ruflo init cloud-MCP + CLAUDE.md token footguns

--full still writes auth-gated cloud MCP servers (ruv-swarm, flow-nexus) into a committed .mcp.json (per-session MCP tool-def token cost). The generated per-project CLAUDE.md also still uses legacy claude mcp add claude-flow -- npx -y @claude-flow/cli@latest, inconsistent with the ruflo@latest mcp start the actual .mcp.json now emits. Fix: gate cloud servers behind an explicit --cloud-mcp even under --full; emit the ruflo@latest command and a note about the daemon's token cost.

Tracked, but NOT fixable in this repo

F4 — SONA learn→inference loop is a no-op stub (dependency-only)

The trained LoRA delta is never consumed in any decision (deltaNorm stays 0): processInstantLearning in the published @ruvector/ruvllm is an empty stub and WasmSonaEngine::learn_from_feedback is a no-op. This lives in ruvnet/ruvector (RuVector#553; #519 closed without a published fix), not here. ruflo consumes the API write-only (intelligence.ts recordTrajectory/runBackgroundLoop), so the loop comes alive automatically once the stub is replaced — ruflo's only action is a future dep bump of @ruvector/ruvllm in v3/@claude-flow/providers/package.json. No ruflo code change is useful until then.

Separately worth noting: ruflo's own JS-fallback LoRA in services/ruvector-training.ts (adapt_with_reward) is a random perturbation rather than a real gradient on the WASM-unavailable path. Different code path from F4; can be tracked on its own if desired.

Acceptance criteria

  • A daemon left running self-terminates after its TTL (default 12h; --ttl 0 / RUFLO_DAEMON_TTL_SECS=0 disables); optional idle shutdown.
  • daemon status --all lists daemons across all workspaces with age + TTL and flags stale ones.
  • neural status no longer prints a false "HNSW Not loaded" when @ruvector/core is installed.
  • --full no longer writes cloud MCP servers unless --cloud-mcp is passed; generated CLAUDE.md uses ruflo@latest.
  • F4 documented as a ruvector dependency item.

Filed alongside a PR. All credit for the underlying investigation, reproductions, and the token-audit method belongs to @pacphihttps://github.com/pacphi/ruflo-machine-ref

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions