Motivation
In embedded single-writer deployments, backup_full is a stop-the-world operation for the duration of the copy. do_backup_full checkpoints and then copies the whole .grafeo container through the locked file handle (GrafeoFileManager::copy_to), and an embedded host that serializes access through a single writer must hold that writer off for the entire checkpoint-plus-copy. For a small database this is a blip; for a multi-GB container it is a multi-second (or longer) write stall that scales with database size — every writer blocks until the full-file copy finishes.
That stall is the cost we keep running into: a scheduled full backup of a live embedded database freezes all writes for as long as the copy takes. Incremental backups help bound steady-state cost, but the periodic full still has to copy the whole container under the lock.
Why this looks architecturally reachable
The README already advertises the two ingredients an online backup would need:
- "MVCC transactions with snapshot isolation" (Features) — i.e. a transaction can already read a consistent view of the database as of a committed epoch while other transactions advance past it.
- "Async storage (
async-storage feature): non-blocking WAL and snapshot I/O via tokio" — i.e. there is already machinery for non-blocking snapshot I/O.
If a backup could pin an MVCC read view at the current epoch and stream the sections constituting that snapshot while writers continue committing at later epochs (copy-on-write / section-versioned, rather than a locked whole-file copy), the write stall would collapse to the cost of establishing the read view rather than the cost of copying the whole container. The existing restore_to_epoch is already the natural restore counterpart: an online backup that records its end_epoch restores through the exact same epoch-bounded path that backup_full outputs feed today.
Proposed direction (for discussion, not a fixed design)
A non-blocking full-backup entry point, e.g.:
db.backup_full_online("/backups/full")
-> pins a read view at the current committed epoch
-> streams the sections of that snapshot to the backup dir while writes continue
-> records end_epoch in the manifest (same shape backup_full writes today)
-> restores via the existing restore_to_epoch(output, target_epoch)
Open questions worth maintainer input:
- Can the section/
LayeredStore model expose a consistent epoch-pinned read view to a backup reader without holding the writer lock for the copy (the v2 paged/packed section formats and swap_to_mmap look like they may already share refcounted byte slices that could serve this)?
- Does the
async-storage non-blocking snapshot I/O path already get partway there, or is it orthogonal (WAL/checkpoint I/O rather than a consistent point-in-time reader)?
- Is a fully online full backup in scope for the embedded single-writer target, or is the preferred answer "take frequent incrementals and accept the periodic full's stall"?
Duplicate check
Checked existing issues — this is a new feature request, distinct from the prior backup bugs:
None proposes an online / non-blocking / MVCC-read-view backup.
Offer to contribute
Happy to help here — I've contributed upstream before (reported #323 with byte-level analysis from a production embedded database, and #324's read_snapshot v2 fix landed cherry-picked). If there's appetite for an online-backup path I'm glad to prototype against the section/MVCC model and bring a design or PR, guided by your sense of where the read-view boundary should sit.
Motivation
In embedded single-writer deployments,
backup_fullis a stop-the-world operation for the duration of the copy.do_backup_fullcheckpoints and then copies the whole.grafeocontainer through the locked file handle (GrafeoFileManager::copy_to), and an embedded host that serializes access through a single writer must hold that writer off for the entire checkpoint-plus-copy. For a small database this is a blip; for a multi-GB container it is a multi-second (or longer) write stall that scales with database size — every writer blocks until the full-file copy finishes.That stall is the cost we keep running into: a scheduled full backup of a live embedded database freezes all writes for as long as the copy takes. Incremental backups help bound steady-state cost, but the periodic full still has to copy the whole container under the lock.
Why this looks architecturally reachable
The README already advertises the two ingredients an online backup would need:
async-storagefeature): non-blocking WAL and snapshot I/O via tokio" — i.e. there is already machinery for non-blocking snapshot I/O.If a backup could pin an MVCC read view at the current epoch and stream the sections constituting that snapshot while writers continue committing at later epochs (copy-on-write / section-versioned, rather than a locked whole-file
copy), the write stall would collapse to the cost of establishing the read view rather than the cost of copying the whole container. The existingrestore_to_epochis already the natural restore counterpart: an online backup that records itsend_epochrestores through the exact same epoch-bounded path thatbackup_fulloutputs feed today.Proposed direction (for discussion, not a fixed design)
A non-blocking full-backup entry point, e.g.:
Open questions worth maintainer input:
LayeredStoremodel expose a consistent epoch-pinned read view to a backup reader without holding the writer lock for the copy (the v2 paged/packed section formats andswap_to_mmaplook like they may already share refcounted byte slices that could serve this)?async-storagenon-blocking snapshot I/O path already get partway there, or is it orthogonal (WAL/checkpoint I/O rather than a consistent point-in-time reader)?Duplicate check
Checked existing issues — this is a new feature request, distinct from the prior backup bugs:
backup_full() fails on Windows and read-only databases(closed) — correctness, not blocking behavior.backup_incrementalalways fails with "no new WAL records since last backup" #267backup_incremental always fails with "no new WAL records since last backup"(closed).WAL not replayed on database reopen(closed).None proposes an online / non-blocking / MVCC-read-view backup.
Offer to contribute
Happy to help here — I've contributed upstream before (reported #323 with byte-level analysis from a production embedded database, and #324's
read_snapshotv2 fix landed cherry-picked). If there's appetite for an online-backup path I'm glad to prototype against the section/MVCC model and bring a design or PR, guided by your sense of where the read-view boundary should sit.