#embedding #umap #manifold #deterministic #dimension-reduction

holomap

Deterministic UMAP — the bulk, on the boundary. Same input, same params, same seed: bit-identical embeddings.

2 unstable releases

Uses new Rust 2024

0.2.0 Jun 8, 2026
0.1.0 Jun 7, 2026

#1410 in Algorithms

MIT/Apache

100KB
2K SLoC

holomap

crates.io docs.rs CI

Deterministic UMAP in Rust. The bulk, on the boundary.

The holographic principle says the information of an N-dimensional volume can be encoded on its (N−1)-dimensional surface. holomap does that for your data: UMAP-class dimensionality reduction whose defining feature is not speed — it's the contract.

The contract

Same input + same params + same seed → bit-identical embedding.

On the same platform/toolchain, two runs produce byte-equal output, verified in CI by running twice and comparing raw bytes. Cross-platform, embeddings are structurally identical (floats may differ at ULP level). There is no unseeded constructorseed: u64 is a required builder argument, by design, forever.

Why: every UMAP crate in Rust is non-deterministic

This isn't a gap we guessed at — we read the source. As of 2026-06, no UMAP-class Rust crate exposes a seed:

Crate Version Why it can't be reproduced
annembed 0.1.6 EmbedderParams has no seed field; the embedder draws from rand::rng() (OS-entropy thread-local) at init and throughout the gradient loop; the hnsw_rs kNN backend seeds from OS entropy too
umap-rs 0.4 no seed in the public API
fast-umap 1.6 no seed in the public API
petal-decomposition 0.7 PCA only (and needs system LAPACK) — not UMAP-class

Python's umap-learn has had random_state since the start. The Rust ecosystem never did.

What that costs you

Non-determinism is invisible until it isn't:

  • You can't regression-test an embedding. "Did my refactor change the output?" is unanswerable when every run differs anyway.
  • You can't reproduce research. A paper, a notebook, a result that says "embed with these params" doesn't replay.
  • Eval harnesses flake. Anything downstream of the embedding — clustering counts, neighbourhood metrics, a golden-row suite — inherits the noise and starts failing intermittently.
  • Debugging is a guessing game. You can't bisect a quality regression when the baseline moves under you.

The usual workaround — run it many times and average, or pin a process and never touch it — is a tax you pay forever. A seed removes the tax.

How holomap makes the contract hold

Determinism here is structural, not bolted on:

  • One PRNG, one place. All pipeline randomness — SGD negative sampling, optional random init, the spectral init's noise — comes from a single seeded PCG64 stream in a fixed draw order. There is no second source of entropy to forget about.
  • No unseeded path exists. seed: u64 is a required builder argument. You cannot construct a run that draws from the OS, because the type system doesn't offer one.
  • Deterministic by construction where it can be. Exact brute-force kNN (ties broken by index); a fixed Lanczos start vector for the spectral init (no RNG in the eigensolve at all); edge iteration over sorted CSR structure, never hash-map order.

The result is checked, not asserted: CI runs fit_transform twice and compares raw bytes, and a property test asserts byte-identity across 64 randomized seeds/shapes/params each run.

The honest envelope

The trade for exactness is scale: brute-force kNN is O(N²·d), so the honest ceiling is ≤ ~50k points. Same-platform output is byte-identical; cross-platform it's structurally identical — and in practice the staged intermediates match the umap-learn/scipy references to within 1e-5 on Linux, macOS, and Windows alike (the parity suite runs on all three). Seeded approximate-NN for larger N is a future direction, not a v1 promise.

holomap exists because we hit this wall building a concept-formation clusterer and needed the contract immediately. We filled the gap rather than working around it.

Install

cargo add holomap
use holomap::Holomap;

let embedding = Holomap::builder(42)   // the seed is a required argument
    .n_neighbors(15)
    .min_dist(0.1)
    .fit_transform(&data, n_features)?;

Status: v0.2.0 — shipped

Milestone Exit test
M1 exact kNN + fuzzy simplicial set stage intermediates match umap-learn 0.5.12 on fixtures
M2 spectral (Lanczos) initialization eigenvector parity vs scipy; deterministic double-run
M3 seeded SGD + end-to-end fit_transform trustworthiness vs umap-learn on blobs/swiss-roll; bit-identity CI gate
M4 API polish, docs, crates.io publish

v0.2.0 adds: non-finite input rejection, an optional ndarray front door (fit_transform_array), a property-tested determinism invariant, and a cross-platform CI matrix (Linux/macOS/Windows) backing the parity claim with evidence.

Measured (k=15 trustworthiness, same data, same params): blobs 0.954 vs umap-learn's 0.955; swiss roll 0.991 vs 0.990. Wall-clock at 1k×50-d points: ~3 s release-mode vs umap-learn's ~28 s on the same machine; at 10k×50-d, ~26 s vs ~69 s (cargo run --release --example bench -- 10000).

Scope (v1)

  • fit_transform via a builder: n_components, n_neighbors, min_dist, spread, metric (euclidean | cosine), n_epochs, init (spectral | random), seed (required)
  • Exact brute-force kNN — deterministic by construction; honest envelope is ≤ ~50k points
  • Serial seeded SGD (single PCG64 stream — all pipeline randomness lives in one place)
  • Dependencies: rand + rand_pcg, nalgebra (pure-Rust eigensolves for the spectral init; Lanczos itself is in-crate). No BLAS, no LAPACK, no C.
  • Optional features: serde (serialize the config, seed included — a stored config replays bit-identically); ndarray (fit_transform_array taking ArrayView2<f32>Array2<f32>).

Deliberately out of scope: GPU, parametric/supervised UMAP, densMAP, plotting, unseeded code paths. The crate's identity is small, auditable, deterministic — generality is resisted on purpose.

License

MIT OR Apache-2.0, at your option.

Dependencies

~12MB
~315K SLoC