Skip to content

perf(md): derive WorkspaceIndex from op log instead of walking .md files #81

@avelino

Description

@avelino

reframes #33 / #34 against invariant #1 (op log is source of truth, .md is a projection).

WorkspaceIndex::build(workspace_root) (crates/outl-md/src/index.rs:73) does:

  1. walkdir pages/ + journals/
  2. read_to_string on every .md
  3. parse(comrak) on each one
  4. reads the sidecar to map IDs
  5. populates PageEntry + BlockIndex

Called on boot by:

  • outl-tui on a background thread (spawn_index_rebuild, crates/outl-tui/src/actions/lifecycle/index_build.rs:37)
  • outl-cli/cmd/doctor.rs:234
  • outl-cli/cmd/search.rs:67
  • outl-cli/cmd/backlinks.rs:86
  • outl-cli/mcp/mod.rs:128

Cost: ~250ms debug on 1500 pages (crates/outl-md/tests/index_perf.rs:18).

The problem

Everything the index needs already lives in the tree CRDT after the op-log replay:

Data Already in the tree
Page list Tree::iter_nodes filtering root children with slug::
slug / title / icon / pinned / type Tree::property(page_id, key)
is_journal kind:: property on the page node (PageKind::Journal)
Block text (search + ref extraction) Workspace::block_text(id)
((blk-XXXXXX)) handle ↔ NodeId derive_ref_handle(NodeId) — deterministic, no sidecar needed
Reverse refs and backlinks outl_md::inline::tokenize over block_text

None of this needs comrak, walkdir, or sidecar reads.

The index is deriving from the projection (.md) instead of from the source of truth (op log). That's the actual bug.

Proposal

impl WorkspaceIndex {
    // before
    pub fn build(workspace_root: &Path) -> Self { /* walkdir + parse + sidecar */ }

    // after
    pub fn derive(workspace: &Workspace) -> Self { /* tree walk + block_text + tokenize */ }
}

Boot becomes:

  1. Replay op log → tree (already runs).
  2. WorkspaceIndex::derive(&workspace) → zero I/O, zero global parse.
  3. Lazy parse of .md still happens only in load_current (one page at a time, when the user opens it).

Edge cases

  • .md edited externally (vim, iCloud peer drop): orphan scanner detects → reconcile_md → ops → tree → patch_page derive version (picks up the refreshed tree, not the filesystem).
  • Fresh workspace post Logseq import: outl_actions::ingest_md_file already creates ops; the index derives normally after that.
  • text_fold (lowercased cache for search_block_text): populated from the tree's block_text.
  • parse_warnings: stays per-page, only when a .md is opened. Doesn't enter the index.

Migration

  • Replace WorkspaceIndex::build(path) with WorkspaceIndex::derive(&workspace) at every caller.
  • CLI / MCP load the workspace first (they already do) and derive the index instead of walking + parsing.
  • Tests in crates/outl-md/tests/workspace_index.rs migrate to building an in-memory workspace fixture + derive, instead of writing .md files in a tempdir.

Non-goals

  • No persisted cache on disk (.outl/cache/).
  • No sidecar change.
  • No op-log format change.
  • No LRU AST cache.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:mdoutl-md: parse, render, sidecar, matchingkind:perfPerformance improvementpriority:P2Important, next batch

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions