Skip to content

feat(multi-process): per-bot process isolation via PM2 + BOT_NAME#244

Open
uestney wants to merge 3 commits into
xvirobotics:mainfrom
uestney:feat/multi-process-bot-architecture
Open

feat(multi-process): per-bot process isolation via PM2 + BOT_NAME#244
uestney wants to merge 3 commits into
xvirobotics:mainfrom
uestney:feat/multi-process-bot-architecture

Conversation

@uestney

@uestney uestney commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Three coordinated changes that together let each bot run in its own Node.js process instead of multiplexing all bots in one:

  1. feat(session): support METABOT_DATA_DIR for per-bot data isolation

    • session-manager.ts reads METABOT_DATA_DIR env var (falls back to ~/.metabot)
    • Each bot's PM2 entry sets it to ~/.metabot/<name>/ so session files don't collide
  2. feat(multi-process): per-bot PM2 process isolation via BOT_NAME filter

    • src/config.ts — when BOT_NAME env is set, loadAppConfig() filters all platform arrays (feishu/telegram/web/wechat) down to just the matching bot
    • ecosystem.config.cjs — replaces the single metabot PM2 entry with one app per bot read from bots.json. Each gets its own port range, data dir, log files
    • Single-process mode still works (just leave BOT_NAME unset)
  3. feat(bridge): handle deferred project switch via pending-switch.json

    • In multi-process mode, switching a bot's working directory must coordinate across processes
    • pending-switch.json is dropped by the orchestrator; the target bot picks it up on next message and applies the switch
    • Avoids tearing down/restarting other unrelated bots

Why these go together

The three changes are strongly coupled — METABOT_DATA_DIR is the bridge between the PM2 config and per-bot session storage; without it the multi-process setup either crashes or shares state. pending-switch.json is only meaningful once you have multiple processes.

Compatibility

Backward-compatible: with no BOT_NAME env var and the original single-app PM2 entry, behavior is identical to today.

Test plan

  • npm run build passes
  • npm test — 219/219 real tests pass
  • PM2 startup with bots.json containing 7 bots → confirm 7 separate processes, distinct ports
  • Restart single bot → other bots untouched
  • Switch bot working directory while other bots are mid-task → no disruption

uestney and others added 3 commits May 8, 2026 14:58
When running multiple MetaBot instances as separate processes (e.g. via PM2
with a per-bot ecosystem entry), users may want each bot's session metadata
stored under a dedicated subdirectory: ~/.metabot/<bot-name>/sessions-<bot>.json.

Previously the data directory was hardcoded to SESSION_STORE_DIR or ~/.metabot.
This patch adds METABOT_DATA_DIR as an additional fallback, allowing a per-bot
ecosystem to set this env var without affecting other bots or the default mode.

Resolution order: SESSION_STORE_DIR → METABOT_DATA_DIR → ~/.metabot

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds support for running each Feishu bot in its own Node.js process instead of
multiplexing all bots in one. Useful when:
- Restarting a single bot shouldn't disrupt others
- Bots have very different memory/CPU profiles
- Different bots need to be on different code branches/versions

Two coordinated changes:

1. src/config.ts — When BOT_NAME env var is set, loadAppConfig() filters all
   four platform arrays (feishu/telegram/web/wechat) down to just the matching
   bot. Throws if no bot matches. Existing single-process mode (no BOT_NAME)
   is unaffected.

2. ecosystem.config.cjs — Replaces the single 'metabot' entry with one PM2
   app per bot read from bots.json. Each app gets:
   - BOT_NAME=<name> for filtering
   - Distinct API_PORT (base + index*3) and MEMORY_PORT (base + index*3 + 1)
   - METABOT_DATA_DIR=~/.metabot/<name>/ for data isolation
   - MEMORY_DATABASE_DIR + META_MEMORY_URL pointing at the bot's own port
   - Per-bot log files (logs/<name>-{out,error}.log)
   - API_PORT_BASE configurable via .env (default 10001)

Single-process mode still works — just leave BOT_NAME unset and run with the
old ecosystem entry pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds handling for an external workflow (e.g. an `mb switch <bot>` script or an
`/api/bots/:name/switch` endpoint) that wants to restart a bot pointed at a new
working directory while resuming a specific Claude session and surfacing the
last few messages back to the user.

Mechanism:

1. The external tool writes ~/.metabot/<bot>/pending-switch.json before
   restarting the bot:
     { chatId?, workDir, sessionId, recentHistory: [{role, content}, ...] }

2. On startup, src/index.ts reads the file and either:
   - injects sessionId immediately if chatId is known + pushes a notice card
     listing the recent history, OR
   - if chatId is unknown, defers the notice into bridge.pendingSwitchNotice
     to be consumed by the next incoming message.

3. handleMessage() consumes pendingSwitchNotice on the first message after a
   switch, injecting the sessionId before the message goes through the normal
   command/query pipeline.

4. The file is unlinked after processing so the next restart is clean.

This is the missing half of a project-switch feature: ecosystem.config.cjs
already supports per-bot data dirs, but until now Claude wouldn't resume the
prior session after a workdir change, and the user got no indication that a
switch had happened.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@floodsung

Copy link
Copy Markdown
Contributor

Architecturally clean — BOT_NAME filter, pending-switch.json lifecycle, and per-bot port allocation all hold up under inspection. Two blockers and one design question before merge:

Blocker 1: Inherits the Windows tsx wrapper bug

The new ecosystem.config.cjs still spawns tsx via path.join(ROOT, 'node_modules', '.bin', 'tsx') — same broken pattern that #234 reports and that just got fixed in #245 (node --import tsx cross-platform invocation). If this PR lands as-is, it re-introduces the Windows EINVAL on every bot's interpreter. Please rebase onto main once #245 is merged and switch to interpreter: 'node', interpreter_args: '--import tsx' for each generated app entry.

Blocker 2: Should land after #243

PR #243 removes the SQLite registry. If #244 lands first, every per-bot process holds its own SQLite handle on the same sessions.db, and the WAL contention is exactly what #243 was meant to eliminate. Recommend merge order: #243#244.

Design question: default deployment model

This rewrites the default ecosystem.config.cjs to spawn one process per bot. That's a real change in deployment semantics — single-process is still possible (just don't use this file), but anyone who runs pm2 start ecosystem.config.cjs after the upgrade will silently get N processes instead of 1.

Is that the intended default, or should multi-process be opt-in (e.g. ship ecosystem.multi.config.cjs separately and keep the existing single-process file)? Either is defensible — single-process is simpler for first-time installs, multi-process is more robust for production. Need a call from @floodsung before this merges. The README and CLAUDE.md will also need an update to document the new mode.

Smaller things (non-blocking)

  • The pending-switch.json lifecycle is fine for the documented flow, but a brief comment in src/index.ts explaining "if the process crashes between read and unlink, the file persists and is re-applied on next restart — idempotent" would save the next reader 5 minutes.
  • No tests added for the deferred-switch path. Worth one integration test that drops a pending-switch.json, starts a bot, sends a message, asserts the switch was applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants