Status: IMPLEMENTED — the design described below is wired up end-to-end
on the user surface (Phases 6 + 7, May 2026). REST mode: "live", CLI
--live, Python / TypeScript / MCP SDKs, and forkd doctor capability
checks all shipped via PRs
#194–#207.
The vendored Firecracker dependency lives at
deeplethe/firecracker:forkd-v0.4-mem-backend-shared-v1.12;
upstream proposal is open
(FIRECRACKER-UPSTREAM-PROPOSAL.md).
Clean-parent bench (bench/live-fork-pause-window.md) still pending —
Phase 6 E2E saw pause_ms = 41-48 ms, but on a parent with pre-baked
guest Oopses contaminating the measurement.
The original DRAFT below is preserved verbatim as the architecture record; the implementation tracks it closely. Tracking issue: #101
v0.3.4 BRANCH (diff snapshot) takes ~150–300 ms on ext4 + SSD, of
which essentially all is a hard pause window — the source VM cannot
execute guest code while memory.bin is being written. For an agent
that does interactive inference, 150 ms straddles the perceptible-delay
boundary. For an agent that BRANCHes often (speculative-execution
patterns, live-rollout evaluation), it compounds: every branch point
freezes the parent.
The pause is structural in v0.3.4. The daemon issues Firecracker's
Snapshot.Create, which:
- Pauses the source VM (microseconds).
- Writes
vmstateJSON (KB-scale, microseconds). - Writes
memory.bin(500 MiB+ for a typical Python+JIT parent, tens of milliseconds even on tmpfs, hundreds of milliseconds on ext4 — seebench/pause-window/PROBE-multi-branch-anomaly.mdfor the v0.3.4 fix story). - Resumes the source VM.
Step 3 dominates. As long as memory.bin is written synchronously
inside the pause, we can only optimize within the disk-write cost.
v0.3.4 squeezed out the ext4 metadata penalty via posix_fallocate;
that's about as far as the synchronous path can go.
Reduce the BRANCH pause window from ~150 ms to < 10 ms by removing the synchronous memory write entirely. The vCPU + device state dump still requires a pause (KVM_GET_REGS, KVM_GET_SREGS, virtio descriptor snapshotting, kvmclock fixup), but that's a few KB of state and tens of microseconds, not hundreds of milliseconds.
Stretch goal: pause < 1 ms.
- Cross-host BRANCH (deferred to v0.5).
- Non-Linux backends (libkrun port is its own multi-month effort).
- Reducing child-spawn latency (already ~20 ms/child, not the
bottleneck — children just
mmap(MAP_PRIVATE)the snapshot). - Lazy-restore on the child side (children already inherit memory via CoW, the cost is in BRANCH not in spawn).
Three building blocks:
Replace the current file-backed guest memory mmap with anonymous memfd.
This is necessary because UFFDIO_WRITEPROTECT is supported on
anonymous and shmem-backed VMAs but not on arbitrary
host-filesystem-backed mmaps. memfd is technically tmpfs-backed and
qualifies. (Reference: kernel commit 1df319f0837c, "userfaultfd: wp:
add WP support for shmem".)
Practically this is a swap of the backing in forkd-vmm's memory
setup — the guest still sees a contiguous physical address space, the
host backing just changes from a file to a memfd.
Register a userfaultfd against the source's memory region, then
issue UFFDIO_WRITEPROTECT over the full guest physical address space
in one syscall. The source VM continues running. Any subsequent guest
write to a still-WP'd page traps into the userspace handler before
the write commits.
The WP-arming cost is approximately O(VMA size / page-table walk cost). On tested kernels (6.14, 5.7+) this is sub-millisecond for multi-GiB regions when THPs are split appropriately.
A handler thread polls the uffd file descriptor. For each WP fault:
1. Read the page out of the source memfd at (faulting_addr - base).
2. Append the page (with its offset) to the in-flight snapshot file.
3. Clear the WP bit for that page (UFFDIO_WRITEPROTECT with mode=0).
4. Wake the faulting thread (UFFDIO_WAKE).In parallel, a bulk copier reads still-clean pages from the source memfd directly (no faulting involved, the memfd is just memory) and writes them to the snapshot file. The two flows coordinate through a per-page state map (clean / dirty-copying / final) so each page is written exactly once.
The snapshot file is therefore complete some time after the BRANCH pause exits, but it represents the consistent point-in-time view from the moment WP was armed.
After the changes above, the BRANCH critical section reduces to:
- vCPU dump:
KVM_GET_REGS+KVM_GET_SREGS+ a few model-specific registers, microseconds. - Device state dump: virtio descriptor heads, MMIO state, microseconds.
- WP arming:
UFFDIO_WRITEPROTECTover the whole RAM region, target sub-millisecond. - kvmclock + TSC offset snapshot for guest time continuity, microseconds.
Total: well under 10 ms, and most of it independent of guest RAM size.
What we have today. Simple, robust, well-understood. Cost: ~150 ms pause per BRANCH on ext4 + SSD. Becomes prohibitive when BRANCHing
1/s, which is exactly the speculative-execution pattern this project exists to enable.
Iteratively dirty-track pages via KVM_GET_DIRTY_LOG and copy them in
rounds while the source keeps running, ending with a small "stop and
copy" final pass. This is the standard cross-host VM migration design
(Clark et al. NSDI 2005).
Downsides for our use case:
KVM_GET_DIRTY_LOGrequiresKVM_MEM_LOG_DIRTY_PAGESto be set on memslots, which has its own per-KVM_RUNoverhead.- The "convergence" problem: if the guest's dirty rate exceeds copy
bandwidth, pre-copy never finishes. Some agent workloads
(
memset-heavy initialization, large allocations during training) hit this regime. - More implementation surface than uffd_wp.
Pause briefly, memcpy() the entire guest RAM into a second buffer,
resume the guest, then async-write the buffer to disk. Pause cost:
memcpy time, roughly 5 ms/GiB on modern DDR. Memory cost: 2× peak
RAM usage.
The 2× RAM cost is a dealbreaker for the AI fan-out use case, where parent VMs are routinely 4-8 GiB and the host already runs many of them.
Snapshot the underlying block device, not the RAM. Doesn't apply: guest RAM lives in memfd/file mappings, not on a block device. The disk-backed virtio-blk content could be CoW'd this way, but that's a separate problem from RAM snapshots.
uffd_wp is the right choice because it's the only mechanism that gives us per-page lazy copy with no pause for clean pages and no second memory buffer.
These are genuine unknowns. Reach out via issue if you have experience here:
-
Behavior of
UFFD_WPon memfd-backed VMAs underKVM_RUN. Are there any KVM paths that bypass userspace faulting and access guest memory directly (e.g., for MMIO emulation, virtio descriptor walking, kvmclock updates from the host side)? If so, do those paths getUFFD_WPwrite-faults, or do they silently violate the WP invariant? My current reading ofkvm_main.cis thatgfn_to_hva_*paths do go through the WP, but I haven't verified empirically. -
Interaction with transparent hugepages. If the source memfd is backed by THPs,
UFFD_WPworks at the 4 KiB level — does the kernel split the hugepage on the first WP-fault, or does it WP the whole 2 MiB region? Splitting on each fault could be expensive for sparse-write workloads. May need to disable THP for source VMAs explicitly. -
vCPU dirty-bitmap vs uffd_wp. KVM tracks its own dirty pages via
KVM_GET_DIRTY_LOG. Is there value in combining both (e.g., pre-write the KVM-dirty subset eagerly, then arm WP only on the clean remainder) or does uffd_wp on the whole region subsume it? The combined approach saves faults for the hottest pages but doubles the bookkeeping. -
Snapshot file format compatibility. v0.3.4's snapshot is
vmstate JSON + memory.bin (contiguous raw 4 KiB pages). v0.4 needs either (a) sparse memory.bin with page offsets, or (b) a chunked/segmented memory.bin format. Leaning (a) since stock Firecracker's restore expects contiguous; (b) breaks restore compatibility. -
Children spawned mid-BRANCH. A child could in principle start
mmap'ing the snapshot file before all dirty pages have been flushed, since the parent's pre-BRANCH state is consistent the moment WP is armed. Implementation requires the snapshot reader to block on in-flight pages with proper synchronization. Out of scope for v0.4 first cut, but a fast follow.
A separate Rust binary, not yet integrated with forkd. Allocates a 1 GiB memfd, populates with patterns, registers uffd, arms WP, forks a writer process that randomly writes the memfd, captures faults, copies dirty pages to a snapshot file, validates that the snapshot is a consistent point-in-time view. Goal: prove the kernel mechanics work as expected outside the KVM context.
Extend the existing crates/forkd-uffd/ (currently used for
restore-side lazy paging) with a snapshot-side WP path. Plumb the new
flow through forkd-controller::branch_sandbox. Add a --live-fork
feature flag (default off) so the v0.3.4 pause-based path remains
available during stabilization.
Reproduce the v0.3.4 multi-BRANCH sweep
(bench/pause-window/sweep-diff.sh) but with --live-fork. Target:
pause < 10 ms across all 10 consecutive BRANCHes. Compare distribution,
not just mean — the v0.3.4 fix was a story about tail behavior.
Edge cases to specifically test:
- Write-heavy guest (
stress-ng --vm 1 --vm-bytes 90%running inside). - NUMA cross-node guest RAM (force memfd allocations across nodes).
- Concurrent BRANCHes on different parents (shared uffd handler thread pool? Or one handler per BRANCH?).
- Kernel < 5.7 (no
UFFD_WP) — graceful detection + fallback to v0.3.4 pause-based path. - THP enabled/disabled.
- Memory pressure during BRANCH (host actively swapping).
- Switch
--live-forkto default-on after a stabilization pass. - Write up the implementation as a post-mortem-style article (same cadence as the v0.3.4 ext4 story).
- Ship v0.4.
- File any upstream kernel/Firecracker issues discovered along the way.
-
Kernel < 5.7 doesn't have
UFFDIO_WRITEPROTECT. Mitigation: detect at startup, fall back to v0.3.4 path, document minimum supported kernel. Ubuntu 20.04 LTS has 5.4 — that's a real deployment hit. Possible workaround: backport detection so 5.4 users transparently get v0.3.4 behavior. -
Write-fault storms. A guest scribbling all of RAM during BRANCH generates one fault per page. At 4 KiB pages × 1 GiB RAM that's 262,144 faults. Each fault is microseconds of kernel + userspace work; bound is ~1 s to drain — worse than v0.3.4 pause for this pathological case. Mitigation: measure, document the regime, add a "give up, fall back to pause" escape hatch when fault rate exceeds threshold.
-
Snapshot consistency under uffd_wp ordering. Need careful proof that the snapshot represents a consistent point-in-time even with async page copying. Plan: write a model + property test using
loomor similar to fuzz the page-state machine. -
Restore-time regression. The new snapshot format (if it ends up different from v0.3.4) might restore slower. Need to bench both paths under the same workload before declaring v0.4 a win end-to-end.
- Linux kernel docs:
Documentation/admin-guide/mm/userfaultfd.rst userfaultfd(2),ioctl_userfaultfd(2)man pages- CRIU lazy-migration implementation:
github.com/checkpoint-restore/criu
(especially
criu/lib/uffd.c) - Firecracker UFFD restore support:
github.com/firecracker-microvm/firecracker
(
src/vmm/src/persist.rs) - "Live Migration of Virtual Machines" — Clark et al., NSDI 2005 (the original pre-copy paper, for the alternative-design comparison)
- forkd v0.3.4 ext4 fix retrospective:
bench/pause-window/PROBE-multi-branch-anomaly.md - Tracking issue: #101