Skip to content

Latest commit

 

History

History
306 lines (238 loc) · 12.9 KB

File metadata and controls

306 lines (238 loc) · 12.9 KB

v0.4: live-fork via userfaultfd write-protect

Status: IMPLEMENTED — the design described below is wired up end-to-end on the user surface (Phases 6 + 7, May 2026). REST mode: "live", CLI --live, Python / TypeScript / MCP SDKs, and forkd doctor capability checks all shipped via PRs #194#207. The vendored Firecracker dependency lives at deeplethe/firecracker:forkd-v0.4-mem-backend-shared-v1.12; upstream proposal is open (FIRECRACKER-UPSTREAM-PROPOSAL.md). Clean-parent bench (bench/live-fork-pause-window.md) still pending — Phase 6 E2E saw pause_ms = 41-48 ms, but on a parent with pre-baked guest Oopses contaminating the measurement.

The original DRAFT below is preserved verbatim as the architecture record; the implementation tracks it closely. Tracking issue: #101

Motivation

v0.3.4 BRANCH (diff snapshot) takes ~150–300 ms on ext4 + SSD, of which essentially all is a hard pause window — the source VM cannot execute guest code while memory.bin is being written. For an agent that does interactive inference, 150 ms straddles the perceptible-delay boundary. For an agent that BRANCHes often (speculative-execution patterns, live-rollout evaluation), it compounds: every branch point freezes the parent.

The pause is structural in v0.3.4. The daemon issues Firecracker's Snapshot.Create, which:

  1. Pauses the source VM (microseconds).
  2. Writes vmstate JSON (KB-scale, microseconds).
  3. Writes memory.bin (500 MiB+ for a typical Python+JIT parent, tens of milliseconds even on tmpfs, hundreds of milliseconds on ext4 — see bench/pause-window/PROBE-multi-branch-anomaly.md for the v0.3.4 fix story).
  4. Resumes the source VM.

Step 3 dominates. As long as memory.bin is written synchronously inside the pause, we can only optimize within the disk-write cost. v0.3.4 squeezed out the ext4 metadata penalty via posix_fallocate; that's about as far as the synchronous path can go.

Goal

Reduce the BRANCH pause window from ~150 ms to < 10 ms by removing the synchronous memory write entirely. The vCPU + device state dump still requires a pause (KVM_GET_REGS, KVM_GET_SREGS, virtio descriptor snapshotting, kvmclock fixup), but that's a few KB of state and tens of microseconds, not hundreds of milliseconds.

Stretch goal: pause < 1 ms.

Non-goals

  • Cross-host BRANCH (deferred to v0.5).
  • Non-Linux backends (libkrun port is its own multi-month effort).
  • Reducing child-spawn latency (already ~20 ms/child, not the bottleneck — children just mmap(MAP_PRIVATE) the snapshot).
  • Lazy-restore on the child side (children already inherit memory via CoW, the cost is in BRANCH not in spawn).

Proposed approach

Three building blocks:

1. memfd_create for source RAM

Replace the current file-backed guest memory mmap with anonymous memfd. This is necessary because UFFDIO_WRITEPROTECT is supported on anonymous and shmem-backed VMAs but not on arbitrary host-filesystem-backed mmaps. memfd is technically tmpfs-backed and qualifies. (Reference: kernel commit 1df319f0837c, "userfaultfd: wp: add WP support for shmem".)

Practically this is a swap of the backing in forkd-vmm's memory setup — the guest still sees a contiguous physical address space, the host backing just changes from a file to a memfd.

2. UFFDIO_WRITEPROTECT on the source memfd before BRANCH

Register a userfaultfd against the source's memory region, then issue UFFDIO_WRITEPROTECT over the full guest physical address space in one syscall. The source VM continues running. Any subsequent guest write to a still-WP'd page traps into the userspace handler before the write commits.

The WP-arming cost is approximately O(VMA size / page-table walk cost). On tested kernels (6.14, 5.7+) this is sub-millisecond for multi-GiB regions when THPs are split appropriately.

3. Async dirty-page copier

A handler thread polls the uffd file descriptor. For each WP fault:

1. Read the page out of the source memfd at (faulting_addr - base).
2. Append the page (with its offset) to the in-flight snapshot file.
3. Clear the WP bit for that page (UFFDIO_WRITEPROTECT with mode=0).
4. Wake the faulting thread (UFFDIO_WAKE).

In parallel, a bulk copier reads still-clean pages from the source memfd directly (no faulting involved, the memfd is just memory) and writes them to the snapshot file. The two flows coordinate through a per-page state map (clean / dirty-copying / final) so each page is written exactly once.

The snapshot file is therefore complete some time after the BRANCH pause exits, but it represents the consistent point-in-time view from the moment WP was armed.

What the pause window contains

After the changes above, the BRANCH critical section reduces to:

  • vCPU dump: KVM_GET_REGS + KVM_GET_SREGS + a few model-specific registers, microseconds.
  • Device state dump: virtio descriptor heads, MMIO state, microseconds.
  • WP arming: UFFDIO_WRITEPROTECT over the whole RAM region, target sub-millisecond.
  • kvmclock + TSC offset snapshot for guest time continuity, microseconds.

Total: well under 10 ms, and most of it independent of guest RAM size.

Alternatives considered

A) Status quo: pause-based snapshot

What we have today. Simple, robust, well-understood. Cost: ~150 ms pause per BRANCH on ext4 + SSD. Becomes prohibitive when BRANCHing

1/s, which is exactly the speculative-execution pattern this project exists to enable.

B) Pre-copy (à la live migration)

Iteratively dirty-track pages via KVM_GET_DIRTY_LOG and copy them in rounds while the source keeps running, ending with a small "stop and copy" final pass. This is the standard cross-host VM migration design (Clark et al. NSDI 2005).

Downsides for our use case:

  • KVM_GET_DIRTY_LOG requires KVM_MEM_LOG_DIRTY_PAGES to be set on memslots, which has its own per-KVM_RUN overhead.
  • The "convergence" problem: if the guest's dirty rate exceeds copy bandwidth, pre-copy never finishes. Some agent workloads (memset-heavy initialization, large allocations during training) hit this regime.
  • More implementation surface than uffd_wp.

C) Full memcpy-out-then-snapshot

Pause briefly, memcpy() the entire guest RAM into a second buffer, resume the guest, then async-write the buffer to disk. Pause cost: memcpy time, roughly 5 ms/GiB on modern DDR. Memory cost: 2× peak RAM usage.

The 2× RAM cost is a dealbreaker for the AI fan-out use case, where parent VMs are routinely 4-8 GiB and the host already runs many of them.

D) Block-device CoW (LVM, dm-snapshot, btrfs reflink)

Snapshot the underlying block device, not the RAM. Doesn't apply: guest RAM lives in memfd/file mappings, not on a block device. The disk-backed virtio-blk content could be CoW'd this way, but that's a separate problem from RAM snapshots.

uffd_wp is the right choice because it's the only mechanism that gives us per-page lazy copy with no pause for clean pages and no second memory buffer.

Open questions

These are genuine unknowns. Reach out via issue if you have experience here:

  1. Behavior of UFFD_WP on memfd-backed VMAs under KVM_RUN. Are there any KVM paths that bypass userspace faulting and access guest memory directly (e.g., for MMIO emulation, virtio descriptor walking, kvmclock updates from the host side)? If so, do those paths get UFFD_WP write-faults, or do they silently violate the WP invariant? My current reading of kvm_main.c is that gfn_to_hva_* paths do go through the WP, but I haven't verified empirically.

  2. Interaction with transparent hugepages. If the source memfd is backed by THPs, UFFD_WP works at the 4 KiB level — does the kernel split the hugepage on the first WP-fault, or does it WP the whole 2 MiB region? Splitting on each fault could be expensive for sparse-write workloads. May need to disable THP for source VMAs explicitly.

  3. vCPU dirty-bitmap vs uffd_wp. KVM tracks its own dirty pages via KVM_GET_DIRTY_LOG. Is there value in combining both (e.g., pre-write the KVM-dirty subset eagerly, then arm WP only on the clean remainder) or does uffd_wp on the whole region subsume it? The combined approach saves faults for the hottest pages but doubles the bookkeeping.

  4. Snapshot file format compatibility. v0.3.4's snapshot is vmstate JSON + memory.bin (contiguous raw 4 KiB pages). v0.4 needs either (a) sparse memory.bin with page offsets, or (b) a chunked/segmented memory.bin format. Leaning (a) since stock Firecracker's restore expects contiguous; (b) breaks restore compatibility.

  5. Children spawned mid-BRANCH. A child could in principle start mmap'ing the snapshot file before all dirty pages have been flushed, since the parent's pre-BRANCH state is consistent the moment WP is armed. Implementation requires the snapshot reader to block on in-flight pages with proper synchronization. Out of scope for v0.4 first cut, but a fast follow.

Implementation phases

Phase 1: standalone PoC (Week 1-2)

A separate Rust binary, not yet integrated with forkd. Allocates a 1 GiB memfd, populates with patterns, registers uffd, arms WP, forks a writer process that randomly writes the memfd, captures faults, copies dirty pages to a snapshot file, validates that the snapshot is a consistent point-in-time view. Goal: prove the kernel mechanics work as expected outside the KVM context.

Phase 2: integrate into forkd-uffd crate (Week 3-4)

Extend the existing crates/forkd-uffd/ (currently used for restore-side lazy paging) with a snapshot-side WP path. Plumb the new flow through forkd-controller::branch_sandbox. Add a --live-fork feature flag (default off) so the v0.3.4 pause-based path remains available during stabilization.

Phase 3: pause-window benchmarking (Week 5)

Reproduce the v0.3.4 multi-BRANCH sweep (bench/pause-window/sweep-diff.sh) but with --live-fork. Target: pause < 10 ms across all 10 consecutive BRANCHes. Compare distribution, not just mean — the v0.3.4 fix was a story about tail behavior.

Phase 4: hardening (Week 6-7)

Edge cases to specifically test:

  • Write-heavy guest (stress-ng --vm 1 --vm-bytes 90% running inside).
  • NUMA cross-node guest RAM (force memfd allocations across nodes).
  • Concurrent BRANCHes on different parents (shared uffd handler thread pool? Or one handler per BRANCH?).
  • Kernel < 5.7 (no UFFD_WP) — graceful detection + fallback to v0.3.4 pause-based path.
  • THP enabled/disabled.
  • Memory pressure during BRANCH (host actively swapping).

Phase 5: launch (Week 8)

  • Switch --live-fork to default-on after a stabilization pass.
  • Write up the implementation as a post-mortem-style article (same cadence as the v0.3.4 ext4 story).
  • Ship v0.4.
  • File any upstream kernel/Firecracker issues discovered along the way.

Risks

  • Kernel < 5.7 doesn't have UFFDIO_WRITEPROTECT. Mitigation: detect at startup, fall back to v0.3.4 path, document minimum supported kernel. Ubuntu 20.04 LTS has 5.4 — that's a real deployment hit. Possible workaround: backport detection so 5.4 users transparently get v0.3.4 behavior.

  • Write-fault storms. A guest scribbling all of RAM during BRANCH generates one fault per page. At 4 KiB pages × 1 GiB RAM that's 262,144 faults. Each fault is microseconds of kernel + userspace work; bound is ~1 s to drain — worse than v0.3.4 pause for this pathological case. Mitigation: measure, document the regime, add a "give up, fall back to pause" escape hatch when fault rate exceeds threshold.

  • Snapshot consistency under uffd_wp ordering. Need careful proof that the snapshot represents a consistent point-in-time even with async page copying. Plan: write a model + property test using loom or similar to fuzz the page-state machine.

  • Restore-time regression. The new snapshot format (if it ends up different from v0.3.4) might restore slower. Need to bench both paths under the same workload before declaring v0.4 a win end-to-end.

References