feat(multi-process): per-bot process isolation via PM2 + BOT_NAME by uestney · Pull Request #244 · xvirobotics/metabot

uestney · 2026-05-08T15:05:12Z

Summary

Three coordinated changes that together let each bot run in its own Node.js process instead of multiplexing all bots in one:

feat(session): support METABOT_DATA_DIR for per-bot data isolation
- session-manager.ts reads METABOT_DATA_DIR env var (falls back to ~/.metabot)
- Each bot's PM2 entry sets it to ~/.metabot/<name>/ so session files don't collide
feat(multi-process): per-bot PM2 process isolation via BOT_NAME filter
- src/config.ts — when BOT_NAME env is set, loadAppConfig() filters all platform arrays (feishu/telegram/web/wechat) down to just the matching bot
- ecosystem.config.cjs — replaces the single metabot PM2 entry with one app per bot read from bots.json. Each gets its own port range, data dir, log files
- Single-process mode still works (just leave BOT_NAME unset)
feat(bridge): handle deferred project switch via pending-switch.json
- In multi-process mode, switching a bot's working directory must coordinate across processes
- pending-switch.json is dropped by the orchestrator; the target bot picks it up on next message and applies the switch
- Avoids tearing down/restarting other unrelated bots

Why these go together

The three changes are strongly coupled — METABOT_DATA_DIR is the bridge between the PM2 config and per-bot session storage; without it the multi-process setup either crashes or shares state. pending-switch.json is only meaningful once you have multiple processes.

Compatibility

Backward-compatible: with no BOT_NAME env var and the original single-app PM2 entry, behavior is identical to today.

Test plan

npm run build passes
npm test — 219/219 real tests pass
PM2 startup with bots.json containing 7 bots → confirm 7 separate processes, distinct ports
Restart single bot → other bots untouched
Switch bot working directory while other bots are mid-task → no disruption

When running multiple MetaBot instances as separate processes (e.g. via PM2 with a per-bot ecosystem entry), users may want each bot's session metadata stored under a dedicated subdirectory: ~/.metabot/<bot-name>/sessions-<bot>.json. Previously the data directory was hardcoded to SESSION_STORE_DIR or ~/.metabot. This patch adds METABOT_DATA_DIR as an additional fallback, allowing a per-bot ecosystem to set this env var without affecting other bots or the default mode. Resolution order: SESSION_STORE_DIR → METABOT_DATA_DIR → ~/.metabot Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds support for running each Feishu bot in its own Node.js process instead of multiplexing all bots in one. Useful when: - Restarting a single bot shouldn't disrupt others - Bots have very different memory/CPU profiles - Different bots need to be on different code branches/versions Two coordinated changes: 1. src/config.ts — When BOT_NAME env var is set, loadAppConfig() filters all four platform arrays (feishu/telegram/web/wechat) down to just the matching bot. Throws if no bot matches. Existing single-process mode (no BOT_NAME) is unaffected. 2. ecosystem.config.cjs — Replaces the single 'metabot' entry with one PM2 app per bot read from bots.json. Each app gets: - BOT_NAME=<name> for filtering - Distinct API_PORT (base + index*3) and MEMORY_PORT (base + index*3 + 1) - METABOT_DATA_DIR=~/.metabot/<name>/ for data isolation - MEMORY_DATABASE_DIR + META_MEMORY_URL pointing at the bot's own port - Per-bot log files (logs/<name>-{out,error}.log) - API_PORT_BASE configurable via .env (default 10001) Single-process mode still works — just leave BOT_NAME unset and run with the old ecosystem entry pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds handling for an external workflow (e.g. an `mb switch <bot>` script or an `/api/bots/:name/switch` endpoint) that wants to restart a bot pointed at a new working directory while resuming a specific Claude session and surfacing the last few messages back to the user. Mechanism: 1. The external tool writes ~/.metabot/<bot>/pending-switch.json before restarting the bot: { chatId?, workDir, sessionId, recentHistory: [{role, content}, ...] } 2. On startup, src/index.ts reads the file and either: - injects sessionId immediately if chatId is known + pushes a notice card listing the recent history, OR - if chatId is unknown, defers the notice into bridge.pendingSwitchNotice to be consumed by the next incoming message. 3. handleMessage() consumes pendingSwitchNotice on the first message after a switch, injecting the sessionId before the message goes through the normal command/query pipeline. 4. The file is unlinked after processing so the next restart is clean. This is the missing half of a project-switch feature: ecosystem.config.cjs already supports per-bot data dirs, but until now Claude wouldn't resume the prior session after a workdir change, and the user got no indication that a switch had happened. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

floodsung · 2026-05-09T03:43:30Z

Architecturally clean — BOT_NAME filter, pending-switch.json lifecycle, and per-bot port allocation all hold up under inspection. Two blockers and one design question before merge:

Blocker 1: Inherits the Windows tsx wrapper bug

The new ecosystem.config.cjs still spawns tsx via path.join(ROOT, 'node_modules', '.bin', 'tsx') — same broken pattern that #234 reports and that just got fixed in #245 (node --import tsx cross-platform invocation). If this PR lands as-is, it re-introduces the Windows EINVAL on every bot's interpreter. Please rebase onto main once #245 is merged and switch to interpreter: 'node', interpreter_args: '--import tsx' for each generated app entry.

Blocker 2: Should land after #243

PR #243 removes the SQLite registry. If #244 lands first, every per-bot process holds its own SQLite handle on the same sessions.db, and the WAL contention is exactly what #243 was meant to eliminate. Recommend merge order: #243 → #244.

Design question: default deployment model

This rewrites the default ecosystem.config.cjs to spawn one process per bot. That's a real change in deployment semantics — single-process is still possible (just don't use this file), but anyone who runs pm2 start ecosystem.config.cjs after the upgrade will silently get N processes instead of 1.

Is that the intended default, or should multi-process be opt-in (e.g. ship ecosystem.multi.config.cjs separately and keep the existing single-process file)? Either is defensible — single-process is simpler for first-time installs, multi-process is more robust for production. Need a call from @floodsung before this merges. The README and CLAUDE.md will also need an update to document the new mode.

Smaller things (non-blocking)

The pending-switch.json lifecycle is fine for the documented flow, but a brief comment in src/index.ts explaining "if the process crashes between read and unlink, the file persists and is re-applied on next restart — idempotent" would save the next reader 5 minutes.
No tests added for the deferred-switch path. Worth one integration test that drops a pending-switch.json, starts a bot, sends a message, asserts the switch was applied.

uestney and others added 3 commits May 8, 2026 14:58

This was referenced May 9, 2026

fix(ecosystem): use 'node --import tsx' for cross-platform PM2 spawn (closes #234) #245

Merged

refactor(session): replace SQLite registry with file-backed impl #243

Open

floodsung mentioned this pull request May 9, 2026

feat: 多会话支持 — /sessions、/switch、/session 命令 #174

Open

4 tasks

hahhforest mentioned this pull request May 14, 2026

refactor(session): SessionRegistry 从 SQLite 迁移为 JSON 文件持久化 #278

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(multi-process): per-bot process isolation via PM2 + BOT_NAME#244

feat(multi-process): per-bot process isolation via PM2 + BOT_NAME#244
uestney wants to merge 3 commits into
xvirobotics:mainfrom
uestney:feat/multi-process-bot-architecture

uestney commented May 8, 2026

Uh oh!

floodsung commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

uestney commented May 8, 2026

Summary

Why these go together

Compatibility

Test plan

Uh oh!

floodsung commented May 9, 2026

Blocker 1: Inherits the Windows tsx wrapper bug

Blocker 2: Should land after #243

Design question: default deployment model

Smaller things (non-blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants