Tags: isair/jarvis
Tags
fix(mcp): keep MCP server sessions alive across tool calls (#381) * fix(mcp): keep MCP server sessions alive across tool calls Each MCP tool invocation previously spawned the server subprocess, ran the call, then tore it down via asyncio.run. For stateful servers that own external resources, e.g. chrome-devtools-mcp launching Chrome on first navigation, killing the server also killed its child processes. The user-visible symptom: Chrome opened on a navigate call and closed the moment the tool returned. Introduce a persistent MCP runtime: one background asyncio loop, one long-lived worker task per server holding the stdio session open and draining a call_tool queue. MCPClient.invoke_tool now routes through the runtime, and the daemon shuts the runtime down cleanly so child processes go with it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(mcp): address review feedback - resilience, idle-reaping, spec, tests Polish pass on the persistent MCP runtime in response to PR review: - Convert post-enqueue TimeoutError to _WorkerDeadError when the worker died after we put a request on the queue, so the runtime's retry path takes over instead of surfacing a misleading timeout - Wrap the private _WorkerDeadError as a public MCPServerSessionError at the MCPClient boundary - Replace asserts with explicit raises so invariants survive python -O - Quote the command in the login-shell `which` fallback with shlex - Cancel a stuck worker task on shutdown if the polite sentinel path stalls; surface previously-swallowed cleanup errors via debug_log - Comment the `BaseException` catches in _run (anyio task-group cancellation surfaces as BaseExceptionGroup) - Document the per-server serialisation invariant and timeout/retry semantics in the module docstring and invoke() docstring - Add opt-in `idle_timeout_sec` per server so stateless MCPs can free their subprocess after inactivity (chrome-devtools-mcp leaves it unset, which is the default) - Route MCPClient.list_tools through the persistent runtime so discovery and the first invocation share one session - Collapse the double cross-thread sync in invoke (use call_soon_threadsafe + put_nowait, single hop) - Rename test fixture to drop the leading underscore Tests added: - runtime retries exactly once after _WorkerDeadError - worker is replaced when server config changes - subprocess startup failure propagates rather than hanging - distinct servers each get their own worker - list_tools and the first invoke_tool share a session Also adds mcp_runtime.spec.md (registered in CLAUDE.md) and a brief README note about session persistence and the idle-timeout opt-in. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
fix(planner): normalise URL-shaped tool args before dispatch (#365) The planner's resolver was handing chrome-devtools__navigate_page values like `[youtube.com](http://youtube.com)` (markdown wrapper bleeding in from the small-model planner's training priors) and bare `youtube.com` (no scheme). Puppeteer's Page.navigate rejects both with "Cannot navigate to invalid URL", which the chat model paraphrased as "the protocol encountered an error". Add `_normalise_url_args` in `src/jarvis/reply/planner.py`, applied at both the fast-path and slow-path return sites of `resolve_next_tool_call`. It strips `[text](url)` markdown wrappers and prepends `https://` to scheme-less bare domains. Scoped to URL-shaped keys (`url`, `uri`, `href`, `link`, `address`, `target_url`, `page_url`, `location`) so unrelated string args (e.g. a `query='youtube.com tutorials'` step) stay literal. Also surface the resolved args JSON and tool errors on the planner direct-exec path so future "tool ran but did nothing" regressions are diagnosable from the log alone. Tests: 5 new cases in `TestUrlArgNormalisation` cover markdown stripping, scheme prepending on both paths, qualified-URL passthrough, and the negative case. Verified live: with the fix, the previously failing args produce a successful navigation. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
fix(planner+nutrition): direct-exec carryover bug + logMeal schema + … …review fixes (#328) * fix(planner): direct-exec skipped when prior-query tool carryover in messages Two related bugs caused tool calls to silently fail after any earlier query in the same session had used a tool (weather, web search, etc.): 1. **_tool_results_so_far baseline** — the direct-exec gate counted ALL tool_name messages in `messages`, including those carried over from prior queries via dialogue memory. If a previous query used one tool, `_tool_results_so_far` started at 1 ≥ len(plan_tool_steps), so the gate condition `0 ≤ N < len(steps)` was False and direct-exec silently skipped. The LLM then ran with no tool result context and produced an empty reply → "Sorry, I had trouble processing that." Fix: establish `_plan_steps_baseline` before the loop and subtract it from the tool_name count in both direct-exec and the LLM-path progress nudge, so only steps executed in the current reply are counted. 2. **logMeal schema / key mismatch** — the planner emits `logMeal meal='Big Mac'` but the schema declared `description` as the parameter key. The fast-path parser rejected `meal` as unknown and fell to the LLM resolver, which couldn't fill in all the required nutrition fields (calories_kcal, protein_g, etc.) so returned null → direct-exec skipped again. Fix: simplify the schema to a single optional `meal` parameter; the tool internally calls extract_and_log_meal with that text (or falls back to context.redacted_text). Fast-path now accepts `logMeal meal='...'` deterministically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(nutrition): address PR review — spec, dead code, fence, guard, readability - Add log_meal.spec.md documenting the single-parameter schema contract, extraction-input precedence, untrusted-data fence, and reply shape; register it in the CLAUDE.md spec registry. - Delete log_meal_from_args (dead code — removed from run() in the prior commit, confirmed unused codebase-wide). - Wrap original_text in <<<BEGIN/END UNTRUSTED USER TEXT>>> fence inside extract_and_log_meal, mirroring the web_search prompt-injection defence. - Add early-return guard in run() when both meal arg and redacted_text are empty — avoids a no-op LLM call and surfaces a friendly message. - Split meal_arg extraction into meal_text / redacted / extract_text intermediate variables for readability. - Add three new tests: fence presence, empty-both guard, and redacted_text=None safety. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(memory): dedupe graph appends across diary re-flushes (#282) * fix(memory): dedupe graph appends across diary re-flushes The daily conversation summary is cumulative, so every diary flush re-extracts the same facts and, without a check, appended them to the graph again. When the diary flushed twice in quick succession (e.g. a TTS echo triggered a second flush), the user saw the same "learned N new facts" block twice and the graph ended up with duplicate fact lines under the same node. Add `GraphMemoryStore.normalise_fact` and `node_contains_fact` so the data-store layer owns the equality primitive. Normalisation uses `unicodedata.NFKC` + `str.casefold` rather than `str.lower` so the dedupe is locale-safe — Turkish dotted-I (İ/i̇) and German ß/ss fold to the same key. Without this, the same fact written by different extractors (or by the same extractor on different days) could leak through as a duplicate. In `update_graph_from_dialogue`, after `find_best_node` resolves the target node within the fact's tagged branch, skip the append if the fact is already present on that node. The skipped fact is logged but not reported to the user as newly learned, and we deliberately do not call `touch_node`: a re-extraction from a cumulative summary isn't fresh reinforcement and shouldn't bump the access score. The check only covers the picker's chosen node, so a stale picker that routes a repeat fact to a different node within the same branch on a later flush can still leak a duplicate. Documented in code and spec as an accepted trade-off versus scanning the whole subtree per fact. Tests cover the store-level primitive (exact match, case/whitespace folding, Turkish İ, German ß, substring non-match, empty inputs) and the end-to-end orchestrator path (dedupe across flushes; locale-safe folding for Turkish and German). * test(memory): cover dedupe on child node after auto-split + clarify step labels Address second review pass: - Add test_dedupe_on_child_after_split: pre-populate a child under world, force the picker to descend, re-extract the same fact, assert no dupe. - Renumber inline step comments in update_graph_from_dialogue to align with graph.spec.md (1 Extract, 2 Place, 3 Dedupe, 4 Append, 5 Split). - Expand normalise_fact docstring to note that the whitespace regex also collapses embedded newlines on the candidate side. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
feat(listening): friendlier startup banner with varied example queries The single "Try: How's the weather, Jarvis?" line undersold the assistant's range. Replace it with a short menu of distinct example queries that touch different capabilities (weather, nutrition, introspection, personal knowledge), and drop the 👍/👎 contrast framing for the small-model and Chrome MCP tips in favour of a single positive example each. The small-model tip now sits above its more involved example and warns users to "spell out the steps" because gemma4:e2b collapses vague compound requests into a single search. The Chrome MCP tip keeps the "name the URL" guidance with one concrete example instead of contrasting pairs. Spec updated to reflect the new banner shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(output): stop thinking tune without cross-thread abort race (#286) On macOS, calling ``stream.abort()`` from the caller's thread and ``stream.close()`` from the tune thread races inside PortAudio/CoreAudio: the AudioObject is torn down twice and stderr spams ``||PaMacCore (AUHAL)|| Error … err=''!obj''`` on every reply. Fix: only the tune thread touches the stream. ``stop_tune()`` sets the stop event and joins; the tune thread's finally block calls ``stream.close()``, which discards pending buffers cleanly — no abort needed. Drops the now-dead ``_stream_lock`` / ``_stream`` fields. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
feat(output): pluck-style thinking tune with phase-locked listening r… …ings (#285) Follow-up to #280 after more live iteration: Sound - Replace the continuous ambient pad with a short click-and-decay pluck of the same A3/C#4/E4 triad. 2s cycle per loop (1s tone + 1s silent rest), 5 cycles per 10s buffer. - Fast 8ms linear attack + ~1s exponential decay gives the pluck its slight "click" feel; pure sines with ±1 Hz unison detune keep the techy triad character from #280 but let the ear rest between pulses so long thinking runs aren't fatiguing. - Volume bumped to 0.38 since each pulse is now brief. Listening rings - Rewire the bell-echo ring timing off frame counting onto wall time. Frame counting drifts against the 44.1 kHz audio clock so rings slowly wander out of phase with the pluck; anchoring spawns to monotonic() elapsed-since-LISTENING keeps them locked. - First ring now spawns at elapsed=0 (with the first audible click), not after a 60-frame warm-up, so the animation lands on the first beep instead of the second. - Transition detection lives in the state-polling block in _animate (_on_state_changed is bypassed by that polling path, which is why the earlier signal-based hook never fired).
feat(listening): tiny-model tip + interest-flavoured query eval (#276) * feat(listening): warn on tiny model and suggest direct phrasing When the configured chat model is gemma4:e2b or gemma4:e4b, print a disclaimer after the "Listening!" line advising users to keep queries direct and clear, with a good/bad example. Small models struggle with indirect, multi-step phrasing, so nudging the user up-front yields better first-time results. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(listening): tie tiny-model examples to same task Show that the tiny model can still handle complex flows when the user spells out the steps, rather than implying it just can't do complex things. Good/bad examples now target the same underlying task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * style(listening): trim tiny-model good example for readability Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(listening): use detect_model_size for tiny-model disclaimer Replace hardcoded gemma4:e2b/e4b check with detect_model_size from model_variants so the disclaimer tracks the canonical SMALL classifier. Keeps behaviour consistent when supported models change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * style(listening): move wake word to end of good example Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(evals): recall-then-search field regression case Captures the 2026-04-24 gemma4:e2b field failure: user said "Recall my interests and search the web for news on them, Jarvis." The intent judge paraphrased the recall step away, enrichment returned unrelated diary entries, and the model bounced the question back instead of grounding on seeded interests. New live eval seeds AI/space interests via mocked enrichment, sends the explicit two-step imperative, and asserts the model either (a) searches with an interest-flavoured query or (b) names a seeded interest in the reply, and never bounces the question back. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(evals): recall-then-search must hard-fail, not xfail on gemma4 The bar is that we make the small model handle this via planner or prompt work, not that we accept the failure. Remove the gemma4 xfail fallbacks so the eval fails loudly until fixed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * style(listening): 3-step weather/activity example for tiny-model tip Swap the recall-interests example (which depends on pre-seeded memory) for a weather/activity chain that only needs built-in tools. Makes the spell-out-the-steps point with three concrete steps vs. a short sentence a smart model could have synthesised. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(evals): parametrise interest-flavoured phrasings + strengthen enrichment example User pointed out that 'recall my interests', 'of interest to me', 'that would interest me', and 'interesting for me' are semantically the same intent: the reply should be related to the user's stored interests. Parametrise the live eval across all four phrasings so enrichment + planner are validated as a unit, not just the one phrasing from the field log. Add a multi-phrasing example to the enrichment extractor prompt so the small model picks 'interests' keywords for all variants. All four variants pass on gemma4:e2b end-to-end: enrichment surfaces interests, planner weaves them into per-interest search steps, reply names the seeded interests and never bounces back. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * style(listening): weather-and-events example for tiny-model tip Swap to the weather + events example the user suggested. Planner trace shows the short phrasing produces 1 step (events only), the spelled-out phrasing produces 3 (weather → events → outdoor activities), which is exactly the point of the tip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
perf(config): default tool-result digest off; default diary recall to… … 5 (#271) The tool-result digest was auto-on for small models, but in practice the extra LLM pass added per-tool-call latency and frequently dropped salient numbers/names the main model would otherwise have grounded on. Default to ``false`` so small setups pay neither cost by default; callers who still want the digest can set ``true`` to force on or ``null`` to opt back into the auto-on-for-SMALL behaviour. Also lower ``memory_enrichment_max_results`` default from 10 to 5: the enrichment snippets land in the system prompt on every turn, so the marginal recall value of entries 6–10 rarely pays for their prompt-token cost on small models.
fix(evaluator): prevent direct-exec infinite loop on small models (#267) * fix(evaluator): terminate loop when direct-exec would duplicate a recent tool call Field log showed gemma4:e2b looping: the chat model replied in prose after direct-exec, the evaluator kept returning the same tool_call (sometimes with a case-flipped arg key like 'url' vs 'URL'), and the engine re-ran the tool every turn until agentic_max_turns. The engine now normalises argument keys to lower-case and compares against recent_tool_signatures before stashing pending_tool_call. On a duplicate, the loop terminates with the latest candidate reply instead of re-executing. Added regression test covering the case-flipped-arg scenario. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(evaluator): feed invoked-tools history so judge stops re-requesting completed calls Root cause of the field loop (chrome-devtools__navigate_page fired 8x on gemma4:e2b): the evaluator only saw the current prose turn and the static toolbox, never the tool results in the conversation history. After a successful direct-exec, the chat model replied in prose, the evaluator saw the prose + the tool in the toolbox, and kept returning "continue" with the same tool_call because it had no way to know the tool had already run. Fix: thread an `invoked_tools` list of (name, args, result) through `evaluate_turn` and into the prompt as a "TOOLS ALREADY INVOKED THIS REPLY" block. Rubric updated to tell the judge to mark terminal when a tool covering the user's action has already run successfully. The duplicate guard from the previous commit stays as defence-in-depth and is documented as such — the invoked-tools context is now the primary mechanism. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
PreviousNext