Skip to content

Tags: isair/jarvis

Tags

v1.34.1

Toggle v1.34.1's commit message
fix(mcp): keep MCP server sessions alive across tool calls (#381)

* fix(mcp): keep MCP server sessions alive across tool calls

Each MCP tool invocation previously spawned the server subprocess, ran
the call, then tore it down via asyncio.run. For stateful servers that
own external resources, e.g. chrome-devtools-mcp launching Chrome on
first navigation, killing the server also killed its child processes.
The user-visible symptom: Chrome opened on a navigate call and closed
the moment the tool returned.

Introduce a persistent MCP runtime: one background asyncio loop, one
long-lived worker task per server holding the stdio session open and
draining a call_tool queue. MCPClient.invoke_tool now routes through
the runtime, and the daemon shuts the runtime down cleanly so child
processes go with it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(mcp): address review feedback - resilience, idle-reaping, spec, tests

Polish pass on the persistent MCP runtime in response to PR review:

- Convert post-enqueue TimeoutError to _WorkerDeadError when the worker
  died after we put a request on the queue, so the runtime's retry
  path takes over instead of surfacing a misleading timeout
- Wrap the private _WorkerDeadError as a public MCPServerSessionError
  at the MCPClient boundary
- Replace asserts with explicit raises so invariants survive python -O
- Quote the command in the login-shell `which` fallback with shlex
- Cancel a stuck worker task on shutdown if the polite sentinel path
  stalls; surface previously-swallowed cleanup errors via debug_log
- Comment the `BaseException` catches in _run (anyio task-group
  cancellation surfaces as BaseExceptionGroup)
- Document the per-server serialisation invariant and timeout/retry
  semantics in the module docstring and invoke() docstring
- Add opt-in `idle_timeout_sec` per server so stateless MCPs can free
  their subprocess after inactivity (chrome-devtools-mcp leaves it
  unset, which is the default)
- Route MCPClient.list_tools through the persistent runtime so
  discovery and the first invocation share one session
- Collapse the double cross-thread sync in invoke (use
  call_soon_threadsafe + put_nowait, single hop)
- Rename test fixture to drop the leading underscore

Tests added:
- runtime retries exactly once after _WorkerDeadError
- worker is replaced when server config changes
- subprocess startup failure propagates rather than hanging
- distinct servers each get their own worker
- list_tools and the first invoke_tool share a session

Also adds mcp_runtime.spec.md (registered in CLAUDE.md) and a brief
README note about session persistence and the idle-timeout opt-in.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

v1.34.0

Toggle v1.34.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(planner): normalise URL-shaped tool args before dispatch (#365)

The planner's resolver was handing chrome-devtools__navigate_page values
like `[youtube.com](http://youtube.com)` (markdown wrapper bleeding in
from the small-model planner's training priors) and bare `youtube.com`
(no scheme). Puppeteer's Page.navigate rejects both with "Cannot
navigate to invalid URL", which the chat model paraphrased as "the
protocol encountered an error".

Add `_normalise_url_args` in `src/jarvis/reply/planner.py`, applied at
both the fast-path and slow-path return sites of
`resolve_next_tool_call`. It strips `[text](url)` markdown wrappers and
prepends `https://` to scheme-less bare domains. Scoped to URL-shaped
keys (`url`, `uri`, `href`, `link`, `address`, `target_url`, `page_url`,
`location`) so unrelated string args (e.g. a `query='youtube.com
tutorials'` step) stay literal.

Also surface the resolved args JSON and tool errors on the planner
direct-exec path so future "tool ran but did nothing" regressions are
diagnosable from the log alone.

Tests: 5 new cases in `TestUrlArgNormalisation` cover markdown
stripping, scheme prepending on both paths, qualified-URL passthrough,
and the negative case. Verified live: with the fix, the previously
failing args produce a successful navigation.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

v1.33.0

Toggle v1.33.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(planner+nutrition): direct-exec carryover bug + logMeal schema + …

…review fixes (#328)

* fix(planner): direct-exec skipped when prior-query tool carryover in messages

Two related bugs caused tool calls to silently fail after any earlier query
in the same session had used a tool (weather, web search, etc.):

1. **_tool_results_so_far baseline** — the direct-exec gate counted ALL
   tool_name messages in `messages`, including those carried over from
   prior queries via dialogue memory. If a previous query used one tool,
   `_tool_results_so_far` started at 1 ≥ len(plan_tool_steps), so the
   gate condition `0 ≤ N < len(steps)` was False and direct-exec silently
   skipped. The LLM then ran with no tool result context and produced an
   empty reply → "Sorry, I had trouble processing that."

   Fix: establish `_plan_steps_baseline` before the loop and subtract it
   from the tool_name count in both direct-exec and the LLM-path progress
   nudge, so only steps executed in the current reply are counted.

2. **logMeal schema / key mismatch** — the planner emits
   `logMeal meal='Big Mac'` but the schema declared `description` as the
   parameter key. The fast-path parser rejected `meal` as unknown and fell
   to the LLM resolver, which couldn't fill in all the required nutrition
   fields (calories_kcal, protein_g, etc.) so returned null → direct-exec
   skipped again.

   Fix: simplify the schema to a single optional `meal` parameter; the
   tool internally calls extract_and_log_meal with that text (or falls back
   to context.redacted_text). Fast-path now accepts `logMeal meal='...'`
   deterministically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(nutrition): address PR review — spec, dead code, fence, guard, readability

- Add log_meal.spec.md documenting the single-parameter schema contract,
  extraction-input precedence, untrusted-data fence, and reply shape; register
  it in the CLAUDE.md spec registry.
- Delete log_meal_from_args (dead code — removed from run() in the prior
  commit, confirmed unused codebase-wide).
- Wrap original_text in <<<BEGIN/END UNTRUSTED USER TEXT>>> fence inside
  extract_and_log_meal, mirroring the web_search prompt-injection defence.
- Add early-return guard in run() when both meal arg and redacted_text are
  empty — avoids a no-op LLM call and surfaces a friendly message.
- Split meal_arg extraction into meal_text / redacted / extract_text
  intermediate variables for readability.
- Add three new tests: fence presence, empty-both guard, and
  redacted_text=None safety.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

v1.32.1

Toggle v1.32.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(memory): dedupe graph appends across diary re-flushes (#282)

* fix(memory): dedupe graph appends across diary re-flushes

The daily conversation summary is cumulative, so every diary flush
re-extracts the same facts and, without a check, appended them to the
graph again. When the diary flushed twice in quick succession (e.g. a
TTS echo triggered a second flush), the user saw the same "learned N
new facts" block twice and the graph ended up with duplicate fact
lines under the same node.

Add `GraphMemoryStore.normalise_fact` and `node_contains_fact` so the
data-store layer owns the equality primitive. Normalisation uses
`unicodedata.NFKC` + `str.casefold` rather than `str.lower` so the
dedupe is locale-safe — Turkish dotted-I (İ/i̇) and German ß/ss fold
to the same key. Without this, the same fact written by different
extractors (or by the same extractor on different days) could leak
through as a duplicate.

In `update_graph_from_dialogue`, after `find_best_node` resolves the
target node within the fact's tagged branch, skip the append if the
fact is already present on that node. The skipped fact is logged but
not reported to the user as newly learned, and we deliberately do not
call `touch_node`: a re-extraction from a cumulative summary isn't
fresh reinforcement and shouldn't bump the access score.

The check only covers the picker's chosen node, so a stale picker that
routes a repeat fact to a different node within the same branch on a
later flush can still leak a duplicate. Documented in code and spec
as an accepted trade-off versus scanning the whole subtree per fact.

Tests cover the store-level primitive (exact match, case/whitespace
folding, Turkish İ, German ß, substring non-match, empty inputs) and
the end-to-end orchestrator path (dedupe across flushes; locale-safe
folding for Turkish and German).

* test(memory): cover dedupe on child node after auto-split + clarify step labels

Address second review pass:
- Add test_dedupe_on_child_after_split: pre-populate a child under world,
  force the picker to descend, re-extract the same fact, assert no dupe.
- Renumber inline step comments in update_graph_from_dialogue to align
  with graph.spec.md (1 Extract, 2 Place, 3 Dedupe, 4 Append, 5 Split).
- Expand normalise_fact docstring to note that the whitespace regex also
  collapses embedded newlines on the candidate side.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

v1.32.0

Toggle v1.32.0's commit message

Verified

This commit was signed with the committer’s verified signature.
isair Baris Sencan
feat(listening): friendlier startup banner with varied example queries

The single "Try: How's the weather, Jarvis?" line undersold the
assistant's range. Replace it with a short menu of distinct example
queries that touch different capabilities (weather, nutrition,
introspection, personal knowledge), and drop the 👍/👎 contrast framing
for the small-model and Chrome MCP tips in favour of a single positive
example each.

The small-model tip now sits above its more involved example and warns
users to "spell out the steps" because gemma4:e2b collapses vague
compound requests into a single search. The Chrome MCP tip keeps the
"name the URL" guidance with one concrete example instead of contrasting
pairs.

Spec updated to reflect the new banner shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

v1.31.1

Toggle v1.31.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(output): stop thinking tune without cross-thread abort race (#286)

On macOS, calling ``stream.abort()`` from the caller's thread and
``stream.close()`` from the tune thread races inside
PortAudio/CoreAudio: the AudioObject is torn down twice and stderr
spams ``||PaMacCore (AUHAL)|| Error … err=''!obj''`` on every reply.

Fix: only the tune thread touches the stream. ``stop_tune()`` sets the
stop event and joins; the tune thread's finally block calls
``stream.close()``, which discards pending buffers cleanly — no abort
needed. Drops the now-dead ``_stream_lock`` / ``_stream`` fields.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

v1.31.0

Toggle v1.31.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat(output): pluck-style thinking tune with phase-locked listening r…

…ings (#285)

Follow-up to #280 after more live iteration:

Sound
- Replace the continuous ambient pad with a short click-and-decay pluck
  of the same A3/C#4/E4 triad. 2s cycle per loop (1s tone + 1s silent
  rest), 5 cycles per 10s buffer.
- Fast 8ms linear attack + ~1s exponential decay gives the pluck its
  slight "click" feel; pure sines with ±1 Hz unison detune keep the
  techy triad character from #280 but let the ear rest between pulses
  so long thinking runs aren't fatiguing.
- Volume bumped to 0.38 since each pulse is now brief.

Listening rings
- Rewire the bell-echo ring timing off frame counting onto wall time.
  Frame counting drifts against the 44.1 kHz audio clock so rings
  slowly wander out of phase with the pluck; anchoring spawns to
  monotonic() elapsed-since-LISTENING keeps them locked.
- First ring now spawns at elapsed=0 (with the first audible click),
  not after a 60-frame warm-up, so the animation lands on the first
  beep instead of the second.
- Transition detection lives in the state-polling block in _animate
  (_on_state_changed is bypassed by that polling path, which is why
  the earlier signal-based hook never fired).

v1.30.0

Toggle v1.30.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat(listening): tiny-model tip + interest-flavoured query eval (#276)

* feat(listening): warn on tiny model and suggest direct phrasing

When the configured chat model is gemma4:e2b or gemma4:e4b, print a
disclaimer after the "Listening!" line advising users to keep queries
direct and clear, with a good/bad example. Small models struggle with
indirect, multi-step phrasing, so nudging the user up-front yields
better first-time results.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(listening): tie tiny-model examples to same task

Show that the tiny model can still handle complex flows when the user
spells out the steps, rather than implying it just can't do complex
things. Good/bad examples now target the same underlying task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* style(listening): trim tiny-model good example for readability

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(listening): use detect_model_size for tiny-model disclaimer

Replace hardcoded gemma4:e2b/e4b check with detect_model_size from
model_variants so the disclaimer tracks the canonical SMALL classifier.
Keeps behaviour consistent when supported models change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* style(listening): move wake word to end of good example

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(evals): recall-then-search field regression case

Captures the 2026-04-24 gemma4:e2b field failure: user said "Recall my
interests and search the web for news on them, Jarvis." The intent
judge paraphrased the recall step away, enrichment returned unrelated
diary entries, and the model bounced the question back instead of
grounding on seeded interests.

New live eval seeds AI/space interests via mocked enrichment, sends
the explicit two-step imperative, and asserts the model either (a)
searches with an interest-flavoured query or (b) names a seeded
interest in the reply, and never bounces the question back.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(evals): recall-then-search must hard-fail, not xfail on gemma4

The bar is that we make the small model handle this via planner or
prompt work, not that we accept the failure. Remove the gemma4 xfail
fallbacks so the eval fails loudly until fixed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* style(listening): 3-step weather/activity example for tiny-model tip

Swap the recall-interests example (which depends on pre-seeded memory)
for a weather/activity chain that only needs built-in tools. Makes the
spell-out-the-steps point with three concrete steps vs. a short
sentence a smart model could have synthesised.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(evals): parametrise interest-flavoured phrasings + strengthen enrichment example

User pointed out that 'recall my interests', 'of interest to me', 'that
would interest me', and 'interesting for me' are semantically the same
intent: the reply should be related to the user's stored interests.

Parametrise the live eval across all four phrasings so enrichment +
planner are validated as a unit, not just the one phrasing from the
field log. Add a multi-phrasing example to the enrichment extractor
prompt so the small model picks 'interests' keywords for all variants.

All four variants pass on gemma4:e2b end-to-end: enrichment surfaces
interests, planner weaves them into per-interest search steps, reply
names the seeded interests and never bounces back.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* style(listening): weather-and-events example for tiny-model tip

Swap to the weather + events example the user suggested. Planner trace
shows the short phrasing produces 1 step (events only), the spelled-out
phrasing produces 3 (weather → events → outdoor activities), which is
exactly the point of the tip.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

v1.29.1

Toggle v1.29.1's commit message
perf(config): default tool-result digest off; default diary recall to…

… 5 (#271)

The tool-result digest was auto-on for small models, but in practice the
extra LLM pass added per-tool-call latency and frequently dropped salient
numbers/names the main model would otherwise have grounded on. Default
to ``false`` so small setups pay neither cost by default; callers who
still want the digest can set ``true`` to force on or ``null`` to opt
back into the auto-on-for-SMALL behaviour.

Also lower ``memory_enrichment_max_results`` default from 10 to 5: the
enrichment snippets land in the system prompt on every turn, so the
marginal recall value of entries 6–10 rarely pays for their prompt-token
cost on small models.

v1.29.0

Toggle v1.29.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(evaluator): prevent direct-exec infinite loop on small models (#267)

* fix(evaluator): terminate loop when direct-exec would duplicate a recent tool call

Field log showed gemma4:e2b looping: the chat model replied in prose
after direct-exec, the evaluator kept returning the same tool_call
(sometimes with a case-flipped arg key like 'url' vs 'URL'), and the
engine re-ran the tool every turn until agentic_max_turns.

The engine now normalises argument keys to lower-case and compares
against recent_tool_signatures before stashing pending_tool_call. On
a duplicate, the loop terminates with the latest candidate reply
instead of re-executing. Added regression test covering the
case-flipped-arg scenario.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(evaluator): feed invoked-tools history so judge stops re-requesting completed calls

Root cause of the field loop (chrome-devtools__navigate_page fired 8x on
gemma4:e2b): the evaluator only saw the current prose turn and the static
toolbox, never the tool results in the conversation history. After a
successful direct-exec, the chat model replied in prose, the evaluator
saw the prose + the tool in the toolbox, and kept returning "continue"
with the same tool_call because it had no way to know the tool had
already run.

Fix: thread an `invoked_tools` list of (name, args, result) through
`evaluate_turn` and into the prompt as a "TOOLS ALREADY INVOKED THIS
REPLY" block. Rubric updated to tell the judge to mark terminal when a
tool covering the user's action has already run successfully.

The duplicate guard from the previous commit stays as defence-in-depth
and is documented as such — the invoked-tools context is now the
primary mechanism.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>