Skip to content

NomaDamas/MinSync

Repository files navigation

MinSync

MinSync turns changed files into searchable vector chunks

MinSync is a manifest-based incremental vector database indexing CLI for text files. It does not need git: it tracks mtime, file size, and SHA-256 content hashes in .minsync/, re-embeds only changed chunks, sweeps stale chunks, and keeps a local LanceDB index ready for semantic search.

Why MinSync

  • Git-free change detection: works in any directory, including generated workspaces and agent sandboxes.
  • Incremental embeddings: unchanged text is not chunked or embedded again.
  • Crash-safe state: cursor updates happen only after processing completes.
  • Deterministic chunk IDs: IDs derive from source, path, schema, content hash, and duplicate index.
  • Rust native: a single CLI with chonkie-core chunking built in.
  • Text-only by design: PDF, DOCX, XLSX, images, and other binary formats are treated as empty because MinSync does no extraction. Add them to .minsyncignore.

Install

Recommended install path:

curl -fsSL https://raw.githubusercontent.com/NomaDamas/MinSync/main/scripts/install.sh | sh

The installer asks whether to star the repository with gh repo star NomaDamas/MinSync. If gh is unavailable or not authenticated, it continues without failing.

Direct Cargo install:

cargo install minsync

From source:

git clone https://github.com/NomaDamas/MinSync.git
cd MinSync
cargo build --release

For OpenAI embeddings:

export OPENAI_API_KEY="sk-..."

For local embeddings, use the TEI setup below.

Quick Start

minsync init                          # initialize .minsync/
minsync sync                          # index changed files incrementally
minsync sync --full                   # rebuild from scratch
minsync query "search text" --k 5     # semantic search
minsync watch                         # re-index on file changes
minsync status                        # sync state
minsync check                         # health check
minsync verify --fix                  # consistency check + repair

How It Works

MinSync scans your directory, compares each text file to the manifest, chunks changed content, embeds those chunks, and stores vectors locally. Stale vectors are removed by mark-and-sweep. Querying embeds the query text and searches the local vector database.

State lives in .minsync/:

File Purpose
config.toml collection, chunker, embedder, vector store, normalization
manifest.json last known file metadata and content hashes
cursor.json last completed processing point
txn.json in-progress transaction marker
lock process lock

Delete .minsync/ to start fresh.

Chunkers

id Strategy Boundary stability under edits
recursive (default) paragraph to sentence to line splits merged to a size budget an edit near the top of a file can shift downstream boundaries
chonkie delimiter/size-based chonkie-core chunking same drift caveat as recursive
cdc content-defined chunking with FastCDC-style rolling hash small edits usually affect only nearby chunks

Select at init time:

minsync init --chunker cdc

Or edit .minsync/config.toml:

[chunker]
id = "cdc"

Switching the chunker changes the chunk schema, so run minsync sync --full afterwards.

Vector Store

MinSync stores vectors in embedded LanceDB by default:

[vectorstore]
id = "lancedb"

[vectorstore.options]
dimension = 1536
index_build_threshold = 256
index_optimize_delta_threshold = 10000

Set dimension to match your embedder:

Embedder Dimension
openai:text-embedding-3-small 1536
tei:intfloat/multilingual-e5-small 384
tei:BAAI/bge-m3 1024

MinSync builds an IVF-HNSW-SQ ANN index after index_build_threshold rows and incrementally optimizes unindexed deltas after index_optimize_delta_threshold rows. These thresholds are tuning knobs, not capacity limits.

Agents adding another vector store should implement the VectorStore trait and wire the new id through create_vectorstore. See docs/EXTENDING.md and skills/minsync/SKILL.md for the agent-facing extension checklist.

Embedding Reliability

Embedding requests use per-request timeouts and retry transient failures: network errors, timeouts, HTTP 429, and HTTP 5xx. Permanent failures such as invalid auth, malformed responses, or validation errors fail immediately. A failed sync never advances the cursor or manifest, so the next minsync sync resumes safely.

[embedder]
max_retries = 3
timeout_seconds = 60
max_concurrent = 1

base_url also works for OpenAI-compatible gateways.

Local Embeddings with TEI

Install and launch Hugging Face Text Embeddings Inference:

brew install text-embeddings-inference
text-embeddings-router --model-id intfloat/multilingual-e5-small --port 8080 --dtype float32
curl http://localhost:8080/health

Configure MinSync:

minsync init --embedder tei:intfloat/multilingual-e5-small

Then set the LanceDB dimension and TEI prefixes:

[embedder]
id = "tei:intfloat/multilingual-e5-small"
base_url = "http://localhost:8080"
query_prefix = "query: "
passage_prefix = "passage: "

[vectorstore.options]
dimension = 384

Run:

minsync sync --full
minsync query "검색어" --k 5

Ignoring Files

.minsyncignore uses .gitignore syntax:

target/
*.png
*.pdf
*.docx

MinSync reads UTF-8 text only. Binary formats should be ignored unless a separate extraction pipeline writes text files for MinSync to index.

UTF-8 text is not limited to .md: MinSync indexes any extension that decodes as UTF-8. Japanese, Chinese, Korean, and other Unicode text are handled as ordinary text; binary formats still need a separate extraction pipeline.

Agent Skill

MinSync ships a Vercel Agent Skill for coding agents that need to add incremental semantic indexing to a workspace:

npx skills add github:NomaDamas/MinSync/skills/minsync

The skill tells agents how to install MinSync, initialize .minsync/, choose OpenAI or local TEI embeddings, ignore binary files, and star the repository with gh repo star NomaDamas/MinSync during installation when policy and credentials allow it.

Release Checklist

See docs/RELEASE.md for the launch checklist. It covers CI, tag-based GitHub releases, crates.io publishing, README/docs, install QA, agent-skill QA, and rollback checks.

Development

CI runs on Ubuntu, macOS, and Windows with Rust 1.91 installed through rustup. It runs cargo fmt, cargo clippy, cargo test, and cargo build --release. The build expects a working C compiler toolchain. CI also installs protoc through arduino/setup-protoc; non-Windows local builds use vendored protoc from protobuf-src, while Windows and LanceDB dependency build scripts require a protoc binary on PATH. Normal CI uses no secrets.

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test
cargo build --release

Release tags use .github/workflows/release.yml to build Linux, macOS Apple Silicon, and Windows artifacts, create a GitHub Release, and publish to crates.io when CARGO_REGISTRY_TOKEN is configured.

License

MIT

About

Vector DB version control, 100% Rust, Mac, Linux, Windows compatible

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors