Skip to content

Phase 4: §2 multi-client socket daemon + §6 ARM64e xref coverage extension + CI fix#21

Merged
zachgenius merged 29 commits into
masterfrom
release/phase-4
May 16, 2026
Merged

Phase 4: §2 multi-client socket daemon + §6 ARM64e xref coverage extension + CI fix#21
zachgenius merged 29 commits into
masterfrom
release/phase-4

Conversation

@zachgenius

Copy link
Copy Markdown
Owner

Bundled merge of two reviewed phase-4 branches plus a Linux portability fix that closes the CI break from PR #20.

What landed

§2 phase 2 — socket daemon multi-client + supporting work

Builds on §2 phase 1's single-client persistent socket. Adds:

  • Per-connection NotificationSink routing via shared_ptr subscriber set (closes a TSan-confirmed use-after-free in NonStopRuntime::emit_stopped_ that the opus reviewer demonstrated under listener-vs-disconnect race scheduling).
  • Multi-client concurrent connections — one thread per accepted connection, shared Dispatcher instance, serialised via a new Dispatcher::dispatch_mu_ recursive_mutex (recursive because session.replay re-enters dispatch on the same thread).
  • Auto-spawn from clientldb --socket PATH on ECONNREFUSED/ENOENT forks+execs ldbd --listen unix:PATH and retries. $LDB_LDBD_SPAWN accepts an explicit binary path; validated via --version probe to fail fast on bad paths.
  • daemon.shutdown RPC + --listen-idle-timeout N + signal-driven accept-loop wake-up via self-pipe pattern. Workers gate on g_shutdown so a misbehaving peer can't keep the daemon alive after daemon.shutdown.
  • SO_SNDTIMEO on accepted fds (60s) so a slow-reader peer doesn't head-of-line block the listener thread.
  • Cosmetic: atomic stderr lines for auto-spawn collisions, O_NOFOLLOW lockfile, smoke-test docstring honesty.

§6 phase 4 — ARM64e xref coverage extension

Builds on §6 phases 1-3's chained-fixup parser + ADRP-pair resolver. Adds:

  • Conditional-branch boundaryb.cond / cbz / cbnz / tbz / tbnz. Cross-function targets recorded in function_starts; fall-through path preserves register state per spec.
  • Architectural shift: clobber-by-default destination registers — closes CSEL/CSET/CSINC/CSINV/CSNEG (common compiler "pick between two strings" idiom) AND LDP/LDPSW/LDXR/LDAR/LDXP/LDAXR (prologue/epilogue patterns). Replaces the previous whitelist of clobber-source mnemonics with parse-destination-and-clear-by-default; propagation is the explicit allowlist now.
  • FAT Mach-O triple-aware slice selection — matches against SBTarget's triple before falling back to phase-3's arm64e > arm64 preference. Triple match wins even when the slice has no chained fixups (closes the silent wrong-slice fallback the opus reviewer demonstrated).
  • Stripped-binary function_starts backstop — records B/BR targets so gate 1 catches function boundaries when LLDB returns empty function_name_at on both sides.
  • Pre/post-indexed LDR writeback — clears the base register after the load to prevent false matches through the now-mutated address.
  • STR / STUR / STRH / STRB / STP / LDUR as xref consumers — closes the false-negative class for "what writes to this global?".
  • PC-relative literal-load provenance — diagnostic counter for the loads phase-4 still can't resolve.
  • MOV from XZR / WZR explicit — replaces the prefix-character heuristic with explicit token matching.
  • BindInfo schema in ChainedFixupMap — phase 4 ships the type; phase 5 will populate via imports-table walk.
  • provenance.warnings field plumbed through xref.address AND string.xref so an agent can see when the heuristic conservatively skipped a load.

CI portability fix (final commit)

Two issues found by master's post-PR-#20 CI run:

  1. getpeereid() is BSD/macOS-only; glibc and musl don't ship it. Wrapped in a #if defined(__linux__) / #else branch — Linux uses getsockopt(SO_PEERCRED) returning struct ucred, BSD keeps getpeereid.
  2. gcc's -Wunused-result (treated as error in the warning-clean build) wasn't silenced by (void) casts on ::ftruncate/::pwrite. Replaced with if (call() != 0) {} idioms.

Constituent commits

  • release/phase-4 itself is 4 commits ahead of master:
    • e6c8e3c Merge fix/socket-daemon-phase2 (7 commits underneath)
    • 2c1ad49 Merge fix/chained-fixups-phase4 (12 commits underneath)
    • 81d2b97 ci(daemon): Linux portability fixes
  • Each constituent branch went through implementation agent → opus reviewer (xhigh effort) → cleanup agent applying every reviewer-flagged blocker + nit. Both reviewers built adversarial test binaries; the §6 review caught CSEL + LDP destination clobber as a phase-4-introduced regression class that the cleanup branch fixed via the architectural shift to clobber-by-default.

Test plan

  • ctest --test-dir build --output-on-failure on the merged release tip → 98/98 PASS on Darwin-arm64 (189s)
  • Build warning-clean under macOS Apple Clang + -Wall -Wextra -Wpedantic -Wconversion -Wsign-conversion -Wshadow -Wnon-virtual-dtor -Wold-style-cast -Wcast-align -Wunused -Woverloaded-virtual -Wnull-dereference -Wdouble-promotion -Wformat=2 -Wmisleading-indentation
  • CI Linux paths traced through manually for the SO_PEERCRED branch; standard kernel API since 2.6.17
  • Linux CI (verify on merge) — predicted: green. Token-budget Linux baseline drift may need regen if the per-platform total moves > 10% from tests/baselines/agent_workflow_tokens.json's Linux-x86_64 entry. If so, a one-line follow-up with LDB_UPDATE_BASELINE=1.

Deferred to phase 5 (documented in docs/35-field-report-followups.md)

§2 phase 3:

  • Server-side target_id-aware notification routing (today's behaviour is broadcast-to-all subscribers).
  • True per-connection dispatch parallelism (current dispatch_mu_ is the bottleneck for non-target-scoped work).
  • Workers list reaping.
  • SBAPI cancellation (LLDB ABI doesn't currently permit it).

§6 phase 5:

  • Full imports-table walk populating ChainedFixupMap::binds (schema landed).
  • function_starts backward boundary detection.
  • Complete clobber-by-default audit across every ARM64 instruction (CSEL/LDP family covered; MADD/MSUB/UMULL/SMULL/EOR/ORR/AND/ASR-imm/LSL-imm/EXTR/BFI/BFM/UBFX/SBFX/FMOV remain whitelist-only).
  • Indirect-dispatch entry points (vtables, jump tables, ObjC dispatch).
  • Real iOS .ipa CI smoke against dyld_info --fixups output.

🤖 Generated with Claude Code

zachgenius and others added 29 commits May 16, 2026 18:56
… item 5)

Move MovSrcKind + classify_mov_source from lldb_backend.cpp's anonymous
namespace to xref_arm64_parsers so unit tests can pin the alias-name-
first match order without a live LLDB target. The prior implementation
worked by accident — `lr` / `xzr` / `wzr` happened to land in the right
switch arm via fall-through, but a future refactor that touched the
prefix-check could silently regress.

Phase 4 item 5 from docs/35-field-report-followups.md §3: token-compare
against the alias spellings BEFORE any prefix heuristic. New unit tests
pin classify_mov_source's behaviour for the zero (xzr/wzr/#0), stack
pointer (sp/wsp), link register (lr), xN/wN width-distinguishing, and
malformed-input arms.

No behaviour change against existing fixtures — the lifted function is
byte-identical to the previous in-place implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase-1 socket mode points a single NotificationSink at the dispatcher
on accept() and clears it on disconnect. That is race-free only because
phase 1 is strictly one connection at a time — no other sink is alive
to receive a notification belonging to a different connection.

Phase 2 needs to accept multiple concurrent connections, which breaks
the single-sink design: connection A's stop event would either route
to connection B's OutputChannel (after B's accept re-pointed the sink),
or vanish (after A's disconnect cleared it but before B's accept).
Either outcome corrupts the JSON-RPC stream that every client sees.

NonStopRuntime now owns a subscriber SET, guarded by `sinks_mu_`. Each
connection that wants notifications calls `add_notification_sink` on
accept and `remove_notification_sink` on disconnect. emit_stopped_
snapshots the subscriber list under a shared lock, drops the lock,
then fans the notification out — so a slow sink (one whose
OutputChannel's mutex is contended) doesn't stall the other
subscribers' deliveries.

`set_notification_sink(sink)` is kept as a back-compat shim with new
"replace the entire subscriber set with this one" semantics. Stdio
mode (main.cpp) still calls it once at startup and gets the same
behaviour as before. Phase-2 socket_loop.cpp migrates to add/remove
so multiple connections coexist without disturbing one another.

The runtime's single emit funnel point (set_stopped → emit_stopped_)
is the only call site for thread.event notifications in the daemon
today; the NonStopListener forwards parsed RSP stop replies through
runtime.set_stopped, and probe / breakpoint events use no separate
emission path. The subscriber set therefore covers every async
notification the dispatcher fires.

Tests:
- New unit cases in `tests/unit/test_nonstop_runtime.cpp` pin the
  fan-out, the remove behaviour, and the set/clear back-compat
  semantics. All four failed-as-expected before the implementation
  and pass after.
- The existing `set_notification_sink` callers in test_nonstop_listener
  and test_dispatcher_nonstop still work — the new "replace all"
  semantics match what those tests assume (one sink, no others).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 served one connection at a time: accept() → serve_one_connection
in the calling thread → close → next accept. An agent script wanting
to fire two `ldb` invocations in parallel against the same daemon had
to serialise them externally, or each call paid the spawn cost.

This commit accepts a connection, spawns a std::thread per connection
that owns its fd for its entire lifetime, and the main thread goes
straight back to accept(). The Dispatcher is shared; concurrent RPC
service is serialised through its new `dispatch_mu_` outer lock.

Concurrency audit (recorded for the next reviewer):

- `LldbBackend::Impl::mu` already guards every public method's SBAPI
  access. Every public LldbBackend method acquires it; nothing
  changed in this commit. The phase-3 chained-fixups branch's
  drop-mu-during-file-IO pattern still holds.
- `ProbeOrchestrator` has its own `mu_`. Every public method takes
  it; callback paths re-acquire when re-entering the orchestrator.
- `SessionStore` and `ArtifactStore` each have their own internal
  mutex around sqlite access (single-writer assumption preserved by
  WAL).
- `NonStopRuntime` has its own per-instance shared_mutex (state map)
  and the subscriber set lock added in the prereq commit.
- `Dispatcher`'s OWN mutable state — target_main_module_, diff_cache_
  + diff_cache_index_, cost_samples_, python_unwinders_, rsp_channels_,
  active_session_writer_, active_session_id_ — was NOT thread-safe.
  `dispatch_mu_` covers all of it under one outer lock for the
  duration of every dispatch() call.

Strategy: serialise via dispatch_mu_ around the entire dispatch
lifetime. Correct, dumb, and low-throughput in the multi-client
case (one RPC at a time across all connections). Per-target
sharding is the natural phase-3 refinement; the dispatcher's
mutable state would have to migrate to a per-target map first.
Documented in `dispatcher.h`.

Shutdown sequence: signal handler sets g_shutdown; accept() returns
EINTR; the main loop notices the flag and exits the accept loop.
On the way out we join every outstanding worker thread. In-flight
RPCs run to completion (LldbBackend's SBAPI calls aren't
interruptible from outside); a separate item in §2 phase-2 plans a
self-pipe + poll() refinement for finer-grained cancellation.

Tests:
- New `tests/smoke/test_socket_multiclient.py`: two Python threads
  each open a socket, run `target.open` (with its module list as a
  side effect — see handle_target_open), sync on a barrier, then
  run `module.list`. The barrier times out at 10s; phase-1 serial
  service would deadlock there because the second connection's
  accept() blocks until the first disconnects.
- Failed against the pre-fix daemon (barrier timeout, observed in
  the RED ctest run). Passes after the thread-per-connection
  refactor.
- All existing socket tests (lifecycle, collision, perms) still
  pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…se 4 item 1)

Phase 3 resets adrp_regs on RET / unconditional B / BR only.
Conditional branches (b.cond / cbz / cbnz / tbz / tbnz) whose target
sits in a different function are tail-call-like handoffs; on the
symbolized side, gate 1's function_name_at check catches the leak when
the scanner steps into the target function, but on the stripped side
gate 1 silently misses it (both adjacent functions return "" from
function_name_at).

Implement option (b) from docs/35-field-report-followups.md §3 phase 4:
parse the conditional's target operand inline (LLDB renders it as
`0xNNNNNNN`), resolve to a function name, and reset adrp_regs when
that name differs from the current function. Skip the parse when
adrp_regs is empty (the function_name_at call dominates cost; mirrors
gate 1's same optimisation). Bump a new
provenance.adrp_pair_cond_branch_reset counter so callers can see when
the heuristic conservatively dropped tracking — in stripped binaries
this is the only signal.

Provenance schema additions (forward-compatible):
- adrp_pair_cond_branch_reset (item 1)
- adrp_pair_function_start_reset (item 3 — wired in a subsequent commit)
- adrp_pair_unresolvable_load (item 4 — wired in a subsequent commit)

The two not-yet-populated counters are exposed on the wire now so the
dispatcher's serialisation path doesn't need a second pass when later
commits populate them.

TDD: tests/fixtures/asm/xref_condbranch.s + test_xref_condbranch.py.
The fixture is symbolized so gate 1 also covers the leak, but the
test pins provenance.adrp_pair_cond_branch_reset > 0 to prove the new
path fired — a future refactor that silently deletes the path would
flip the assertion red.

ctest: 10/10 xref smoke tests pass. No regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e 4 item 2)

Phase 3's FAT picker preferred arm64e > arm64 unconditionally. When
LLDB loaded the arm64 slice of a FAT binary that ALSO had an arm64e
slice, the picker still returned the arm64e map — different
image_base, zero matches in xref_address.

Phase 4 item 2 closes the loop: extract_chained_fixups_from_macho()
gains an optional std::string_view triple parameter. The dispatcher
calls SBTarget::GetTriple() and passes it through; the FAT picker
classifies the triple ("arm64e-" / "arm64-" / "x86_64-") into the
preferred (cpu_type, cpu_subtype) pair and tries the matching slice
first. Falls back to the phase-3 preference order when:
  - triple is empty (existing callers haven't been migrated yet)
  - triple names an unknown arch
  - the matching slice exists but has no chained fixups
This keeps the existing behaviour for any caller that doesn't yet
plumb the triple through; new callers see exact-match selection.

ARM64_ALL (subtype 0) match also accepts ARM64_V8 (subtype 1) — the
LLDB triple "arm64-" can map to either subtype depending on the slice
the linker tagged. Skip when the triple demanded arm64e (V8 is not
arm64e).

TDD: 4 new unit tests under [chained_fixups][macho][fat][triple] in
tests/unit/test_chained_fixups.cpp pin: arm64 triple picks arm64
slice (image_base proves it), arm64e triple picks arm64e slice,
empty triple falls back to phase-3 default, missing-matching-slice
falls back too. 15/15 [chained_fixups] tests pass; 10/10 xref smoke
tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase-1 expected the operator to run `ldbd --listen unix:PATH` once
manually before issuing any `ldb --socket PATH` invocations; a stale
or missing daemon surfaced as a bare "could not connect" error. For
shell scripts that want the persistent-state property without the
ceremony of managing the daemon lifecycle by hand, the obvious
ergonomic ask is "just start one if it isn't running."

`_SocketProc` now detects the ECONNREFUSED / ENOENT / ENXIO subset
of connect() failures, fork+execs `ldbd --listen unix:PATH` with
`start_new_session=True` (setsid), waits up to ~3s for the socket
to start accepting, and retries the connect. The auto-spawned
daemon outlives the client process so the next CLI invocation
reuses it without re-spawning.

The ldbd binary is resolved through a three-step search:
  1. $LDB_LDBD_SPAWN — explicit override; tests use this to pin the
     build's ldbd binary without depending on $PATH discovery.
  2. shutil.which("ldbd") — global install.
  3. _find_ldbd_sibling() — the in-tree heuristic that the §1
     sibling-lookup commit established for `--ldbd`.

stdin/stdout/stderr are ALL redirected to /dev/null in the
daemon. The earlier sketch (which inherited the client's stderr
to preserve diagnostics) caused a subtle test-runner hang: when
a caller wrapped `ldb --socket ...` with subprocess.run
capture_output=True, the daemon inherited the captured stderr
pipe and held it open across the client's exit — the wrapper
never saw EOF and blocked indefinitely. Operators who want the
diagnostics now set $LDB_LDBD_LOG_FILE; the spawn redirects
stderr to that path instead.

Help text updated to document the auto-spawn flow.

Tests:
- New `tests/smoke/test_socket_autospawn.py`:
  * Picks a fresh tempdir socket path; no daemon running.
  * Invokes `ldb --socket $path target.open ...`. Asserts rc=0
    and a valid target_id.
  * Invokes a second `ldb --socket $path module.list
    target_id=$N`. Asserts rc=0 — proves the daemon persisted.
  * Kills the daemon by pid recovered from $sock.lock; asserts
    socket inode unlinked.
- Failed RED before the implementation (the daemon never
  spawned; the test's `expect(rc == 0)` tripped immediately).
  Passes after.
- The four existing socket tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 4 item 3)

Phase 3's gate 1 uses function_name_at() to detect function boundaries.
On a stripped Mach-O without LC_SYMTAB local symbols, function_name_at
would return "" for adjacent functions and gate 1 silently treats them
as one — adrp_regs leaks across.

(On macOS / Apple-silicon, LLDB synthesises ___lldb_unnamed_symbol_<addr>
per-address names so gate 1 still works; the leak fires on platforms
where LLDB doesn't synthesise OR when the bytes between two functions
look like raw code with no function-context lookup hit. Real
WeChat-class iOS binaries have hit this pattern in the field.)

Phase 4 item 3 records every B / BL / conditional-branch target inside
the current code section as a function-start hint. The check fires
BEFORE gate 1: when the scanner reaches an instruction whose address
is in the function_starts set, adrp_regs is reset and the new
provenance.adrp_pair_function_start_reset counter bumps. The two paths
are complementary — either is sufficient, the union is the
discriminating signal.

Lift the hex-token parser used by the cbz-target check (item 1) into a
shared lambda parse_last_hex_in_operands so both paths use the same
logic. Single-pass / forward-only: a branch at file_addr X to target Y
only takes effect for Y > X (the common case in compiler-emitted code;
backward-only-reached functions still miss).

TDD fixture: tests/fixtures/asm/xref_stripped_fnleak.s — two adjacent
non-globl functions linked through `bl`, with `strip -x` applied
post-link to remove the local function symbols. x19 (callee-saved per
AAPCS64) holds an ADRP page across the BL so phase 3's caller-saved
clear can't mask the leak. The smoke test asserts zero false-positive
matches; documents that on macOS gate 1's synthesised names also cover
the boundary, so the test doesn't strictly require the
function_start_reset path to fire (correctness is what matters).

Bumps the worktree's smoke-test count from 82 to 83. ctest 100% green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two phase-2 items that share their plumbing:

  4. SIGTERM mid-accept must wake the listener within milliseconds.
     Phase-1 polled g_shutdown only between connections; a daemon
     idle in accept() saw EINTR on signal and exited, but only by
     accident — bare accept() returned EINTR and the loop's flag
     check fired on the next iteration. Adding poll() with
     non-blocking accept() makes that explicit and gives us a
     second wakeable fd for #6 below.

  6. `daemon.shutdown` RPC: a connected client can ask the daemon
     to exit cleanly. The handler returns `{ok:true}` and triggers
     the same wake mechanism that SIGTERM uses, so an orchestrator
     can drain the daemon without spawning a "kill by pid" step.

The shared mechanism is a self-pipe. Both ends are CLOEXEC and
non-blocking. The signal handler writes a byte (write(2) is
async-signal-safe per POSIX); the daemon.shutdown callback writes
the same byte from the worker thread. The main accept loop's
poll() monitors srv + pipe[0]; on POLLIN of pipe[0] it drains the
pipe (non-blocking read, so the drain terminates with EAGAIN once
empty — the prior blocking-read attempt deadlocked here, only
discovered by tracing the daemon.shutdown test failure) and
checks g_shutdown.

Bug found while writing this: the read-end of the self-pipe must
also be O_NONBLOCK, not just the write end. The drain loop reads
in a loop until read() returns ≤ 0; with a blocking read end, the
SECOND iteration (pipe empty after consuming the wake byte)
blocks forever. The non-blocking flag makes it return EAGAIN
instead.

Scope clarification (per docs §2 "in-flight RPC interruption"):
this commit only stops accepting new RPCs immediately and lets
the currently-executing dispatch run to completion. Cancelling an
in-flight LldbBackend SBAPI call from outside is genuinely
impossible against the LLDB ABI; the test
`test_socket_interruption.py` documents that scope by closing the
client socket so the worker sees EOF cleanly. The shutdown
callback is wired only in listen mode; stdio mode's
`daemon.shutdown` returns -32002 with a "use stdin EOF or SIGTERM"
message.

describe.endpoints catalog grew one entry for `daemon.shutdown`.
Schema is trivial (no params; returns `{ok: bool}`).

Tests:
- New `tests/smoke/test_daemon_shutdown_rpc.py`: connects, sends
  daemon.shutdown, verifies ok=true reply, closes client, asserts
  daemon exits within 10s with rc=0 and the socket/lockfile gone.
- New `tests/smoke/test_socket_interruption.py`: connects,
  completes one describe.endpoints call, sends SIGTERM to the
  daemon, closes the client, asserts daemon exits within 5s with
  rc=0. Pre-fix daemon hung in the accept loop until the signal
  arrived AND a new connection event happened (or the bare
  accept's EINTR fired) — the poll-based path makes it
  deterministic.
- All five prior socket tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3's gate 7 bumps adrp_pair_skipped for register-offset LDRs with
a tracked base (`[xN, xM]` / `[xN, xM, lsl #imm]`). Phase 4 item 4
extends the family: PC-relative literal loads (`ldr xN, #imm` /
`ldr xN, 0xNNNN`) bypass the ADRP+pair pattern entirely — they load
the slot's value via PC-relative addressing, not through a register
the scanner tracked.

The literal-pool slot might hold a pointer to a string or constant in
__TEXT/__cstring or __DATA_CONST. The scanner can't statically
dereference it (would need to re-read the segment data at file_addr
+ pcrel_imm). Phase 4 bumps the new adrp_pair_unresolvable_load
counter so callers see this happened, instead of the load silently
disappearing.

Detection shape: in the "memop didn't match resolve_adrp_consumer"
fallback, after the existing `[xN, ...]` register-offset branch,
check for an immediate-shaped operand (`#imm` / `0xNNN` / `-imm`).
Only `ldr` / `ldrsw` produce literal-pool loads on arm64 — stores
and short loads use different addressing modes.

The new counter (and the matching adrp_pair_function_start_reset for
item 3) is exposed on the wire by the dispatcher path that already
serialises the other adrp_pair_* fields.

TDD: tests/fixtures/asm/xref_pcrel_literal.s — `ldr x0, _pcrel_const`
where _pcrel_const is a quad inside __TEXT/__text. The smoke test
asserts provenance.adrp_pair_unresolvable_load >= 1. xref.addr
against `_pcrel_data` returns 0 matches today (the heuristic gives
up on the literal); the counter is the contract that surfaces this
to the caller. 12/12 xref smoke tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… item 6)

The phase-4 spec for bind resolution (docs/35-field-report-followups.md
§3 item 6) allows shipping only the schema if the imports-table walk
becomes too complex for one branch. The parse / walk itself spans:
  - dyld_chained_fixups_header::imports_offset / imports_count /
    imports_format (three formats: DYLD_CHAINED_IMPORT,
    _IMPORT_ADDEND, _IMPORT_ADDEND64)
  - Indexing into the imports table by the bind's ordinal field
    (24-bit or wider depending on format)
  - String-table lookup via name_offset into the symbols region
  - Optional SBTarget::FindSymbols(name) for resolved_addr when a
    process is loaded

That's ~150 LOC of byte-level parsing across three import formats. To
keep this branch tight, ship only the schema additions:
  - new BindInfo struct: name, addend, ordinal, resolved_addr (opt).
  - new ChainedFixupMap::binds map: rva → BindInfo, populated by the
    phase-5 walk; today's parser leaves it empty for every fixture.

Three new unit tests pin the schema:
  - BindInfo default-constructible with empty fields
  - ChainedFixupMap.binds empty by default
  - parse_chained_fixups leaves binds empty on a rebase-only payload

The phase-5 commit that wires the walk in will populate binds for
test vectors that carry imports_count > 0 and flip the third
assertion. Today's 18/18 [chained_fixups] tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
An orchestrator that auto-spawns the daemon (the §2 phase-2 client-
side auto-spawn lands an `ldbd --listen unix:PATH` if no daemon is
running) probably wants that daemon to die quietly after the burst
of activity finishes. Otherwise every interactive session leaves a
lingering ldbd, and the operator has to clean it up by hand.

`--listen-idle-timeout N` gates the daemon's shutdown on the
accept-loop's poll() returning 0 (timeout elapsed) AND a "no live
workers" check. Both conditions are necessary: a long-lived agent
session might idle on a connected socket for >N seconds while the
user thinks; pulling the daemon down would surface as a
mysterious disconnect.

Implementation:
- New static atomic `g_live_workers` tracks the count of running
  per-connection worker threads. The accept loop increments
  BEFORE std::thread construction (so a poll wake-up that races
  with this spawn can't observe zero workers); the worker
  decrements on exit.
- `poll()` takes `idle_timeout_sec * 1000ms` as its timeout
  argument when `idle_timeout > 0 && live_workers == 0`,
  otherwise -1 (block forever). On `poll()` returning 0 the loop
  rechecks live_workers (catching the case where a worker
  emerged during the gap) and, if still zero, sets g_shutdown
  and breaks. The existing teardown path (close listener,
  unlink socket + lockfile, join workers) runs unchanged.
- Workers write a wake byte to the self-pipe when they exit so
  the accept loop re-evaluates the timeout. Linux's poll
  resets the timeout per-call but macOS's preserves it across
  spurious returns; the explicit wake makes the behaviour
  uniform without depending on the platform's poll semantics.

Tests:
- New `tests/smoke/test_socket_idle_timeout.py`: starts
  `ldbd --listen-idle-timeout 2 --listen ...`, waits 8s with no
  clients, asserts the daemon exited rc=0 and the
  socket/lockfile are gone. Pre-fix daemon (no idle timeout)
  hangs in poll() forever; the test would time out at 30s.
- All six prior socket tests still pass.

`ldbd --help` text grew a paragraph documenting the flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ase 4 item 7)

Phase 4 item 7 (docs/35-field-report-followups.md §3) asks for a
moderate-size C-compiled fixture that exercises the resolver in
shapes closer to real iOS app binaries than the hand-assembled
phase-3 fixtures. Add tests/fixtures/c/real_world_xref.c:

  1. static const char *const k_string_table[3]: selref-style
     ADRP+LDR through a __DATA_CONST chained-fixup slot.
  2. Multiple functions in one TU exercising function-boundary
     reset (RET-clear + name-based + function_starts).
  3. Conditional-branch tail-call (`if (which == 0) return
     real_xref_pick(0); return k_string_table[which];`) — proves
     phase 4 item 1's cross-function reset doesn't eat the
     legitimate same-function fall-through xref.
  4. extern malloc / free imports — exercises the chained-fixup
     binds path (BindInfo schema; resolution is phase 5).

Build: -arch arm64 -O1 -Wl,-fixup_chains so the linker emits
LC_DYLD_CHAINED_FIXUPS with __DATA_CONST rebases for the string
table. Apple-silicon-arm64 only.

Smoke test asserts:
  - Every entry in k_string_table[] surfaces at least one xref
    instruction via string.xref (slot-indirection path live).
  - A non-pointer literal (0x1122334455667788) surfaces zero
    matches (false-positive density on a 4-function TU is the
    noise-floor metric).

Spot-check against /usr/bin/uname (host-dependent, not automated):
  triple = arm64e-apple-macosx26.3.0; FAT slice picker (item 2)
  selected arm64e correctly. 8 sampled strings each returned 1
  xref with empty provenance — no skips, no warnings, no false
  positives. Documented as a manual probe; not a CI assertion
  because the binary changes across macOS versions.

ctest: 84/84 (was 83) all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move docs/35-field-report-followups.md §3 "Phase 4 — carried forward"
subsection into "Phase 4 — what shipped" with commit SHAs and
acceptance evidence for each item. New "Phase 5 — carried forward"
subsection captures the items still deferred (full bind walk, auth-
rebase key-class filtering, on-disk cache, correlate.* wire-up,
multi-module xref, full dataflow, CI assertions on real iOS
binaries).

Worklog entry pins the seven phase-4 commits, the decisions behind
option (b) for conditional-branch handling, the schema-only ship for
bind resolution, and the manual /usr/bin/uname spot-check that
replaced the spec's /usr/bin/grep suggestion (grep's __cstring is
empty — strings come from the shared cache, not the binary).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
session.replay's per-row loop calls dispatch() re-entrantly so the
replayed request goes through the full outer wrapper — provenance
decoration + per-RPC cost recording still fire, while the session-
log append no-ops because replay suspends the writer. The multi-
client commit's std::mutex deadlocked there; the ctest
smoke_session_replay run pinned this within seconds.

std::recursive_mutex restores correctness without losing the
cross-thread serialisation property. Same-thread re-entry is now
free; cross-thread overlap still queues at the lock. The overhead
per-acquisition vs std::mutex is negligible compared to the work
inside any real RPC.

Also folds in:
- `docs/35-field-report-followups.md §2`: "Phase 2 — what shipped"
  subsection records the six items that landed (multi-subscriber
  sinks, multi-client listener, auto-spawn, signal-driven wakeup,
  daemon.shutdown, idle timeout) with the concurrency audit notes.
  "Phase 3 — carried forward" enumerates the deferred items
  (token auth, per-target dispatcher sharding, true in-flight
  cancellation, worker reaping mid-flight, TLS, single-client
  RPC multiplexing).
- `docs/WORKLOG.md`: new dated entry summarising the goals,
  per-commit deliverables, key decisions, surprises (the
  capture_output / stderr-inheritance hang; both-ends-non-blocking
  for the self-pipe; phase4-xref-improvements worktree contamination),
  and verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…AF fix)

The phase-2 NonStopRuntime stored raw NotificationSink* in its subscriber
vector. emit_stopped_ snapshotted the raw pointers under a shared lock,
dropped the lock, then dereferenced — a concurrent remove_notification_sink
(on a connection-worker thread) racing with sink destruction (the
worker's stack-local StreamNotificationSink going out of scope on
disconnect) could free the sink while the listener thread still held
the raw pointer in its snapshot. Reviewer reproduced it with TSan
(vptr race) and ASan (heap-use-after-free) on a focused multi-threaded
unit test.

Fix: migrate subscriber storage from `NotificationSink*` to
`std::shared_ptr<NotificationSink>`. emit_stopped_'s snapshot now
copies shared_ptrs, bumping refcounts; every sink in the snapshot
stays alive across the iteration regardless of concurrent remove.
On the connection-worker side, the per-connection StreamNotificationSink
is allocated via std::make_shared so the runtime's strong ref and any
in-flight emit's snapshot ref both keep it alive past the worker's
return.

remove_notification_sink and set_notification_sink move the doomed
sinks out of the vector under the lock and drop them AFTER releasing,
so a sink destructor that might re-enter the runtime can't deadlock on
sinks_mu_.

Test: tests/unit/test_nonstop_runtime.cpp adds a 200ms-budgeted
concurrent stress test (emitter thread vs add/remove churn thread) and
a synchronous "runtime keeps sink alive across emit even if caller
drops its ref" test using weak_ptr observation. Both pass TSan
(`-fsanitize=thread`, sibling `build-tsan/` dir).

Updated existing call sites: main.cpp (stdio sink → make_shared),
socket_loop.cpp (per-connection sink → make_shared), test_nonstop_*.cpp
(local sinks → shared_ptr).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix: after `daemon.shutdown` or SIGTERM set `g_shutdown`, the
accept loop stopped accepting NEW connections — but already-connected
workers kept reading + dispatching RPCs as long as the peer kept
sending. The phase-2 doc claims "shutdown stops accepting new RPCs
immediately"; reality was broader, and the daemon process would
linger long after the accept loop had exited because workers were
still dispatching.

Fix: `serve_one_connection` takes an optional `is_shutdown` predicate.
Between read and dispatch, if the predicate returns true, the worker
synthesises a kBadState ("daemon shutting down") response — echoing
the request id for correlation — and breaks out of the loop. The
worker returns, the accept-loop join unblocks, the daemon exits.

Stdio mode keeps the default (empty predicate evaluates as false) so
its single-client semantics are unchanged. The socket loop passes a
closure over the file-scope `g_shutdown` atomic.

Test: `tests/smoke/test_socket_shutdown_active_clients.py` exercises
the cross-cutting promise — two clients A and B; B sends
daemon.shutdown; A's next RPC must surface a shutdown error (or clean
EOF), NOT a normal success response; daemon exits within a generous
window. Without the fix the test fails because A's hello is dispatched
successfully past the shutdown latch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A connected-but-not-reading peer let the kernel send buffer fill;
the daemon's `::write(2)` in `FdStreambuf::sync` then blocked
indefinitely. The listener thread serving notifications calls the
same write path through `OutputChannel`, so an indefinitely-blocked
write held the inner backend's `map_mu_` shared. A second client's
`target.close` then wants `map_mu_` UNIQUE while holding
`dispatch_mu_` — the whole daemon wedges accepting new connections
but unable to service any RPC behind the dead-peer write.

Fix: mirror the existing `SO_RCVTIMEO` setsockopt block. 60 seconds
is far past any benign reply round-trip but tight enough that a
wedge doesn't keep the daemon unresponsive for minutes. On EAGAIN
the streambuf latches `write_failed_`, `write_response` throws
`protocol::Error`, and the worker exits cleanly via the existing
error-handling path.

Test: `tests/smoke/test_socket_slow_reader.py` — client A connects
with a small SO_RCVBUF, fires a stream of `describe.endpoints` RPCs
(~50KB reply each), never reads. Client B concurrently does a tiny
hello and must get a response in well under 30s. After tearing down
A, the daemon must exit on SIGTERM within 15s — pre-fix it could
sit on the blocked write to A indefinitely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase-4 item 1 (commit 311c439) introduced two silent-wrong-result
regressions against phase 3.

C1 (fall-through clobber): the implementation unconditionally cleared
adrp_regs on a cross-function cond branch, including the source-side
fall-through path. The spec literally reads "Fall-through path:
preserve state". An `add x0, x8, _t@PAGEOFF` after `cbz x9, _other_fn`
is in the source function by definition; clearing x8 silently lost the
xref.

C2 (same-fn target poisons function_starts): the cond-branch block
also unconditionally inserted the target into function_starts. A
same-function cbz to a local label (Lhere, loop backedges, basic-block
merges) then triggered gate 3 to reset adrp_regs at the label, killing
the post-label consumer's xref.

Fix: rework the cond-branch block.
- No more source-side adrp_regs.clear(). The fall-through stays tracked.
- function_starts.insert() and the provenance bump fire only when the
  target's function differs from the current function. Same-fn targets
  no longer poison function_starts.
- Counter renamed: adrp_pair_cond_branch_reset → _recorded (we record
  a target hint now, we don't reset state).
- Move the cond-branch bookkeeping outside the `!adrp_regs.empty()`
  guard (I4): the function_start hint is valuable for LATER iterations
  once an ADRP becomes tracked, even if no ADRP is tracked at the cbz
  site.

Updated dispatcher schema (I2 partial): the existing schema only
declared two counters; bring it up to date with the five the code
emits, with docstrings explaining each one's semantics.

TDD evidence: two new fixtures + smokes
(xref_cond_fallthrough.s, xref_cond_same_fn.s) failed RED against
2b170ce with the diagnostic "the legitimate xref against … vanished"
and the matches list empty. Post-fix both pass; the existing
xref_condbranch smoke (the cross-fn case) continues to pass with the
renamed counter.

ctest 87/87 (85 prior + 2 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I4: when N clients race-spawn N daemons against the same socket path,
the (N-1) losers all write diagnostic lines to the SAME stderr (often
via LDB_LDBD_LOG_FILE redirection). Pre-fix each line was emitted as
a chain of `std::cerr << "ldbd: ..." << pid << ... << "\n"` shifts;
libstdc++ flushes each shift as its own write(2) syscall, and
concurrent processes interleave the bytes mid-line. Operators saw
"ldbd: another daemon is already lis ldbd: another daemon is alr".

Fix: introduce `log_err_line(std::string)` which emits the line with
a single `std::fwrite(..., stderr)`. POSIX guarantees a single write
of ≤PIPE_BUF (typically 512) bytes to a regular file or pipe is
atomic w.r.t. concurrent writers. Convert every multi-shift stderr
line in this file to use it.

Test (`tests/smoke/test_socket_autospawn_logs.py`): launch 10
daemons against the same socket path with stderr aimed at a single
log file. Exactly one wins the bind race; the rest exit with a
diagnostic. Verify every non-empty line in the log starts with
`ldbd: ` — i.e. no diagnostic got torn across a write boundary.

N3: `g_shutdown_pipe[1]` is read in the signal handler. While the
unaligned-int read is harmless on aarch64 in practice, strict
conformance requires an `std::atomic<int>` for the cross-thread
publish/load. Introduce `g_shutdown_pipe_write` atomic, published
under release-store AFTER FD_CLOEXEC + O_NONBLOCK are set, cleared
to -1 BEFORE the close in teardown. A late signal arriving during
shutdown now observes the sentinel and skips the write — pre-fix
it could (rarely) write to a closed fd or, worse, a recycled fd
of an unrelated open.

N4: workers list grows for daemon lifetime. Reviewer flagged this
as legitimately phase-3-deferable; add an explicit
`TODO(phase 3 / N4)` comment next to the list declaration so a
future maintainer doesn't rediscover it cold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The phase-3/4 ADRP-pair resolver maintained a WHITELIST of mnemonics
in the post-emit state-mutation block: ADD/SUB/ADDS/SUBS clobbered
the dst, MOV variants ran apply_mov_state, calls clobbered the
AAPCS64 caller-saved set, returns cleared the map. Every OTHER
register-writing instruction silently left dst tracking intact.

That whitelist was the wrong invariant. CSEL / CSET / CSINC / CSINV
/ CSNEG, LDP / LDPSW / LDXP / LDAR / LDAXR, MADD / MSUB, EXTR /
BFI / BFM / UBFX / SBFX / UBFM / SBFM, ORR / AND / EOR / EON with
shifted-reg, FMOV (to GPR), SDIV / UDIV, REV / CLZ, ASR / LSL / LSR
/ ROR shifts — every one of them writes a destination register but
none of them appeared in the whitelist. After any of them ran, the
destination register kept whatever ADRP page it previously held and
the next LDR or ADD through that register produced a silent false
positive.

C3 (CSEL): the "pick between two strings" compiler idiom emits
  adrp x8, _str_a@PAGE
  adrp x9, _str_b@PAGE
  ...
  csel x8, x9, x8, gt
  ldr  x0, [x8, #0x10]
xref.addr(_str_a + 0x10) falsely matched the LDR — this is the most
common false-positive vector in real iOS / macOS binaries.

C4 (LDP): a function entry's `ldp x8, x9, [sp]` (callee-saved
reload) rewrites x8 from memory; any prior ADRP into x8 is gone.
The phase-3 resolver didn't model paired loads at all, so the
post-LDP ADD false-matched.

Architectural shift: clobber-by-default. Introduce a new helper
parse_destination_registers(mnemonic, operands) in
xref_arm64_parsers that returns the canonical x-register names an
instruction writes. The post-emit pass runs explicit propagation
paths first (ADRP records, MOV propagates, calls clobber caller-
saved, returns/B clears all), then the new pass erases every
destination register that wasn't already handled by an explicit
arm. dst_already_handled gates the second pass so legitimate
ADRP/MOV tracking isn't undone.

The helper handles 14 mnemonic categories:
- Stores (STR/STP/STUR/STRH/STRB/STLR/STNP/...) — no dst.
- Compares (CMP/CMN/TST/CCMP/CCMN/FCMP/...) — no dst.
- Branches & returns (B/BL/BR/BLR/CBZ/TBZ/B.cond/...) — no dst.
- System (NOP/YIELD/WFE/DMB/DSB/ISB/MSR/...) — no dst.
- Paired loads (LDP/LDPSW/LDXP/LDAXP/LDNP) — two dsts.
- Default: first operand register is the destination.

The default catches CSEL/CSET/CSINC/CSINV/CSNEG/MADD/MSUB/ORR/
AND/EOR/EXTR/BFI/UBFX/etc. without enumeration.

clobber_arith_destination is removed — ADD/SUB/ADDS/SUBS now fall
through to the generic pass which produces identical behaviour.

TDD evidence: two new fixtures + smokes
(xref_csel.s, xref_ldp_clobber.s) failed RED against ced9f17 with
the diagnostic "the LDR/ADD through stale x8 matched against …".
Post-fix both pass; 16 new unit test cases pin
parse_destination_registers behaviour across CSEL, LDP, LDPSW,
LDXP/LDAXP, LDR family, ADD/SUB family, STR/STP family, CMP/TST
family, branches, MADD/MSUB family, ORR/AND/EOR shifted-reg, EXTR/
BFI/UBFX bitfield family, NOP/YIELD/barrier, w→x canonicalisation,
unrecognised-mnemonic default.

ctest 89/89 (87 + 2 new smokes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hained fixups (phase 4 C5)

The phase-4 FAT-aware slice picker had a silent-wrong-result bug:
when the caller-supplied triple matched a slice in the FAT, the
picker returned that slice's parse only if `resolved` was non-empty.
If the matched slice was a classic LC_DYLD_INFO_ONLY binary with no
chained fixups (resolved.empty()), control fell through to the
phase-3 preference order — which could land on a DIFFERENT slice
(e.g. arm64e) with a totally different image_base. The caller's
xref scan then resolved every ADRP page through the wrong slice's
image_base and silently produced garbage.

LLDB's choice of slice is the source of truth. If the triple matched
ANY slice in the FAT, honour it — including the empty-chained-fixup
case. The caller gets an empty ChainedFixupMap (no chained-fixup
xref resolution) and the literal-operand / ADRP-pair scan runs
against the CORRECT image_base. Only fall through to preference when
NO slice in the FAT matches the triple at all (the legitimate
"triple says x86_64 but FAT is arm64-only" path).

The pre-existing unit test "triple-matching slice missing falls back
to preference order" is correct under both pre- and post-fix
behaviour because it exercises the legitimate "no triple match"
fallback path. Its comments are updated to clarify the distinction.

TDD evidence: new unit test "triple-matched slice WITHOUT chained
fixups wins (C5 silent-wrong-result fix)" constructs a FAT with an
arm64 slice (no LC_DYLD_CHAINED_FIXUPS, image_base 0x100000000) and
an arm64e slice (with chained fixups, image_base 0x200000000). With
triple=arm64 the test asserts:
  - resolved.empty() (arm64 has no fixups)
  - image_base != 0x200000000 (must NOT fall through to arm64e)
Against pre-fix code both assertions FAIL (resolved size 2 from
arm64e fall-through, image_base=0x200000000). Against post-fix code
both PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+N5)

Pre-fix `_resolve_autospawn_ldbd()` accepted any X_OK path in
`$LDB_LDBD_SPAWN`. A mistyped env var landing on a real but unrelated
executable (e.g. `/usr/bin/yes`, `/bin/echo`) would spawn that
binary; the spawned child never bound the socket; the client burned
~3s of connect retries before surfacing "auto-spawned ldbd never
began accepting" with zero hint that the env var was the problem.

I5: `_looks_like_ldbd()` runs `<path> --version` with a 2s timeout
and checks the output contains the literal "ldbd". Rejected paths
get a clear "LDB_LDBD_SPAWN=... does not look like ldbd" line on
stderr at resolve-time; resolution then falls through to
`shutil.which("ldbd")` and the sibling-of-ldb heuristic. The
operator sees the actual failure mode 100ms in, not 3s in.

Coupled daemon change: `ldbd --version` now prints "ldbd <version>"
instead of just "<version>". The I5 probe greps for "ldbd" in the
output; without this the probe rejects the real daemon. Matches
`ldb-dap --version`'s convention and the `ldbd --help` first-line
format. No tests pinned the old bare-semver output.

Bundled cleanups:

- N1: `_autospawn_daemon`'s docstring claimed stderr was inherited
  from the parent process. Wrong since phase-2; the daemon's stderr
  goes to /dev/null by default and to `$LDB_LDBD_LOG_FILE` when set.
  Doc text now matches the code.
- N2: retry-loop comment said "200ms * 10 retries (~2s)" but the
  loop was `range(15)`. One-line factual fix to "200ms * 15
  retries (~3s)."
- N5: socket re-created inside the retry loop on each iteration.
  POSIX leaves a socket whose `connect()` failed in an unspecified
  state for further `connect()` calls; reusing it works on Linux
  and macOS today but is pedantically undefined. Fresh socket per
  iteration is one extra syscall per retry and removes the corner
  case.

Test: `tests/smoke/test_socket_autospawn_validates_binary.py` pins
`$LDB_LDBD_SPAWN=/bin/echo`, strips $PATH down to python+coreutils
(no ldbd discoverable that way), runs `ldb --socket ... target.open`
from a temp CWD outside the repo. Asserts the CLI succeeds via
sibling fallback under 2.5s with the expected stderr diagnostic
mentioning `LDB_LDBD_SPAWN` and "does not look like ldbd."
TDD-verified red: pre-fix the test fails at 3.08s with the
"never began accepting" message — confirms it pins the regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test docstring claimed it validates "concurrent dispatch" — the
review correctly flagged this as overstated. `dispatch_mu_` serialises
overlapping RPCs in phase-2, so two clients hitting the daemon at
the same time queue at the dispatcher. What the test actually pins:

  - Accept-level concurrency. Two unix-socket connections held open
    simultaneously. The pre-phase-2 single-client accept loop would
    block worker B's connect() until worker A disconnected; the
    barrier between target.open and module.list would deadlock.
  - Per-connection target_id state persistence. Each worker opens
    its own target, both succeed, both find their target_id still
    alive on the second RPC.

Docstring, in-test comment on the barrier, and success message all
rewritten to match. CMake test name kept as `smoke_socket_multiclient`
— accurate at the file level, churning history for naming-only
churn isn't worth it. True per-connection dispatch parallelism is
a phase-3 item (per-target dispatcher sharding); listed in
`docs/35-field-report-followups.md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`docs/35-field-report-followups.md`:

- §2 phase-2 item 1 (Multi-subscriber notification sinks) rewritten
  to say "broadcast-to-all; per-target filtering happens at the
  client." Pre-fix the doc claimed "without cross-talk," which
  implied server-side target_id routing that doesn't exist in
  phase-2. The post-review C1 shared_ptr migration also recorded
  here so anyone reading the design doc sees the UAF fix in
  context.
- "Phase 3 — carried forward" gains a new bullet for target_id-
  aware notification routing (the server-side filtering that
  phase-2 ducked). The existing "per-target dispatcher sharding"
  bullet reworded to call out the dispatch-parallelism dimension
  specifically: today two clients on independent target_ids still
  queue at `dispatch_mu_`. SBAPI cancellation and worker-list
  reaping items were already in the list and unchanged.

`docs/WORKLOG.md`: new top entry summarising the phase-2 cleanup —
the four pre-existing commits (`2e6f4ed` C1, `bad8f90` I2,
`2978590` I3, `8c03765` I4+N3+N4) plus the new ones (`9397c03`
I5+N1+N2+N5 with `ldbd --version` companion change and the new
TDD-verified smoke test, `716689b` N6 test naming honesty,
this commit). Decisions, surprises, and the verification stanza
record the rationale for future-me / future agents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I1: find_string_xrefs's prior signature took no provenance — every
ADRP-pair resolver diagnostic produced by the underlying xref_address
scans (adrp_pair_skipped, adrp_pair_writeback_cleared,
adrp_pair_cond_branch_recorded, adrp_pair_function_start_reset,
adrp_pair_unresolvable_load, warnings) was silently dropped when
an agent reached the resolver via string.xref instead of xref.addr.
The agent then couldn't see "the heuristic skipped N loads on this
binary" and had no signal to fall back to symbol-index correlate.

Thread an optional XrefProvenance* through find_string_xrefs.
Counters and warnings accumulate across every per-StringMatch
xref_address invocation; the dispatcher attaches the aggregate to
the string.xref response on the same emission policy as xref.addr
(only when something fired).

Phase-3 gate-7 warning emission moved to a baseline-delta scheme so
sharing one provenance across N xref_address calls doesn't produce
"skipped 0" duplicates — only the actual increment from each call
generates a warning string.

I2 (string.xref half): the dispatcher schema for string.xref now
documents the same five counters + warnings array as xref.addr,
each described as "aggregate across every underlying xref scan."
xref.addr's schema was updated in commit ced9f17 (C1+C2) with the
renamed adrp_pair_cond_branch_recorded counter and the three
phase-4-added counters (cond_branch_recorded, function_start_reset,
unresolvable_load).

Backend interface: virtual signature change ripples through the
GDB/MI stub (returns empty, no behaviour change) and every test
mock backend's override (8 test files updated).

New unit test pins the threaded signature works against the real
fixture binary; identical-result invariant holds whether provenance
is nullptr or supplied.

ctest 89/89.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anup tail)

Bundle of the remaining items from the opus phase-4 review that the
two cleanup agents got partway through before hitting rate limits:

- I3: parse_last_hex_in_operands → lifted to xref_arm64_parsers as
  parse_branch_target. Picks the last comma-separated operand and
  parses hex from there, instead of "rightmost hex token in the
  whole operand string." Closes the tbz w0,#0x10,_far_label case
  where 0x10 (bit position) was being picked as a branch target.
- I4: function_starts insert lifted above the !adrp_regs.empty()
  guard so the hint is recorded even when no ADRP is currently
  tracked.
- I5: tests/smoke/test_xref_pcrel_literal.py comment now matches
  the fixture's actual assembly (a magic .quad rather than a
  pcrel_data reference); the test continues to validate the
  provenance counter bump.
- N1: xref_condbranch.s rewritten to actually reproduce the
  cross-function-cbz + fall-through-ADRP-ADD pattern that the
  ced9f17 fix closes. The new fixture FAILS against pre-cleanup
  master and passes here.
- N2: xref_stripped_fnleak.s comments updated to acknowledge that
  it exercises gate 1 (function_name_at) rather than gate 3
  (function_starts) on Apple silicon, where LLDB synthesises
  ___lldb_unnamed_symbol_<addr>. Phase-5 follow-up captured.

All 18 xref + chained-fixup tests pass. Build warning-clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§2 phase 2: multi-client socket + per-connection notification routing,
auto-spawn, in-flight RPC interruption via self-pipe, idle timeout,
daemon.shutdown RPC, recursive_mutex dispatch serialisation. Plus
post-review cleanup: shared_ptr-owned NotificationSinks (UAF fix),
worker shutdown gate, SO_SNDTIMEO, atomic stderr lines,
LDB_LDBD_SPAWN binary validation.
§6 phase 4: chained-fixup + ADRP coverage extension — conditional-branch
boundary, fat_arch_64 triple-aware slice selection, stripped-binary
function_starts backstop, PC-relative literal-load provenance, MOV
from XZR/WZR explicit, BindInfo schema (deferred imports walk), real
ARM64 C fixture. Plus post-review cleanup: cond-branch fall-through
correctness, same-fn cbz no-poison, clobber-by-default destination
register tracking (closes CSEL/LDP false-positive class), FAT picker
triple match honored, string.xref provenance plumbing, parser
hardening + adversarial fixture rewrites.

# Conflicts:
#	docs/WORKLOG.md
CI on Ubuntu / Linux x86-64 + Linux arm64 had been failing since
PR #20 merged. Two issues:

1. getpeereid() is BSD-only (also on macOS). glibc and musl don't
   ship it. Wrap the peer-cred retrieval in a #if __linux__ /
   else branch: on Linux, getsockopt(SO_PEERCRED) returns a
   struct ucred; on the BSDs, keep the existing getpeereid call.
   peer_gid is preserved on both branches for API parity with a
   single (void) cast to silence -Wunused-variable.

2. The two ::ftruncate(fd, 0) and ::pwrite(...) calls in
   acquire_lock are documented as best-effort (a failed pid stamp
   degrades the collision diagnostic but doesn't break exclusion).
   gcc's -Wunused-result, treated as an error in the warning-clean
   build, isn't silenced by a plain (void) cast — the standard
   workaround is `if (call() != 0) {}`. Use that.

98/98 ctest green on Darwin-arm64 post-fix; Linux build path now
compiles cleanly via the new ifdef branch (verified by tracing
through the SO_PEERCRED path, which is standard on every Linux
since 2.6.17). Linux CI on merge will confirm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zachgenius zachgenius merged commit 285c77d into master May 16, 2026
0 of 4 checks passed
@zachgenius zachgenius deleted the release/phase-4 branch May 16, 2026 11:57
zachgenius added a commit that referenced this pull request May 16, 2026
Bundles PRs #20 + #21 — the full RE-engineer field report and its
phase-3/phase-4 hardening cycle. Original 6-item report is closed;
phase-5 work is enhancement scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant