MinSync is a manifest-based incremental vector database indexing CLI for text files. It does not need git: it tracks mtime, file size, and SHA-256 content hashes in .minsync/, re-embeds only changed chunks, sweeps stale chunks, and keeps a local LanceDB index ready for semantic search.
- Git-free change detection: works in any directory, including generated workspaces and agent sandboxes.
- Incremental embeddings: unchanged text is not chunked or embedded again.
- Crash-safe state: cursor updates happen only after processing completes.
- Deterministic chunk IDs: IDs derive from source, path, schema, content hash, and duplicate index.
- Rust native: a single CLI with chonkie-core chunking built in.
- Text-only by design: PDF, DOCX, XLSX, images, and other binary formats are treated as empty because MinSync does no extraction. Add them to
.minsyncignore.
Recommended install path:
curl -fsSL https://raw.githubusercontent.com/NomaDamas/MinSync/main/scripts/install.sh | shThe installer asks whether to star the repository with gh repo star NomaDamas/MinSync. If gh is unavailable or not authenticated, it continues without failing.
Direct Cargo install:
cargo install minsyncFrom source:
git clone https://github.com/NomaDamas/MinSync.git
cd MinSync
cargo build --releaseFor OpenAI embeddings:
export OPENAI_API_KEY="sk-..."For local embeddings, use the TEI setup below.
minsync init # initialize .minsync/
minsync sync # index changed files incrementally
minsync sync --full # rebuild from scratch
minsync query "search text" --k 5 # semantic search
minsync watch # re-index on file changes
minsync status # sync state
minsync check # health check
minsync verify --fix # consistency check + repairMinSync scans your directory, compares each text file to the manifest, chunks changed content, embeds those chunks, and stores vectors locally. Stale vectors are removed by mark-and-sweep. Querying embeds the query text and searches the local vector database.
State lives in .minsync/:
| File | Purpose |
|---|---|
config.toml |
collection, chunker, embedder, vector store, normalization |
manifest.json |
last known file metadata and content hashes |
cursor.json |
last completed processing point |
txn.json |
in-progress transaction marker |
lock |
process lock |
Delete .minsync/ to start fresh.
| id | Strategy | Boundary stability under edits |
|---|---|---|
recursive (default) |
paragraph to sentence to line splits merged to a size budget | an edit near the top of a file can shift downstream boundaries |
chonkie |
delimiter/size-based chonkie-core chunking | same drift caveat as recursive |
cdc |
content-defined chunking with FastCDC-style rolling hash | small edits usually affect only nearby chunks |
Select at init time:
minsync init --chunker cdcOr edit .minsync/config.toml:
[chunker]
id = "cdc"Switching the chunker changes the chunk schema, so run minsync sync --full afterwards.
MinSync stores vectors in embedded LanceDB by default:
[vectorstore]
id = "lancedb"
[vectorstore.options]
dimension = 1536
index_build_threshold = 256
index_optimize_delta_threshold = 10000Set dimension to match your embedder:
| Embedder | Dimension |
|---|---|
openai:text-embedding-3-small |
1536 |
tei:intfloat/multilingual-e5-small |
384 |
tei:BAAI/bge-m3 |
1024 |
MinSync builds an IVF-HNSW-SQ ANN index after index_build_threshold rows and incrementally optimizes unindexed deltas after index_optimize_delta_threshold rows. These thresholds are tuning knobs, not capacity limits.
Agents adding another vector store should implement the VectorStore trait and wire the new id through create_vectorstore. See docs/EXTENDING.md and skills/minsync/SKILL.md for the agent-facing extension checklist.
Embedding requests use per-request timeouts and retry transient failures: network errors, timeouts, HTTP 429, and HTTP 5xx. Permanent failures such as invalid auth, malformed responses, or validation errors fail immediately. A failed sync never advances the cursor or manifest, so the next minsync sync resumes safely.
[embedder]
max_retries = 3
timeout_seconds = 60
max_concurrent = 1base_url also works for OpenAI-compatible gateways.
Install and launch Hugging Face Text Embeddings Inference:
brew install text-embeddings-inference
text-embeddings-router --model-id intfloat/multilingual-e5-small --port 8080 --dtype float32
curl http://localhost:8080/healthConfigure MinSync:
minsync init --embedder tei:intfloat/multilingual-e5-smallThen set the LanceDB dimension and TEI prefixes:
[embedder]
id = "tei:intfloat/multilingual-e5-small"
base_url = "http://localhost:8080"
query_prefix = "query: "
passage_prefix = "passage: "
[vectorstore.options]
dimension = 384Run:
minsync sync --full
minsync query "검색어" --k 5.minsyncignore uses .gitignore syntax:
target/
*.png
*.pdf
*.docxMinSync reads UTF-8 text only. Binary formats should be ignored unless a separate extraction pipeline writes text files for MinSync to index.
UTF-8 text is not limited to .md: MinSync indexes any extension that decodes as UTF-8. Japanese, Chinese, Korean, and other Unicode text are handled as ordinary text; binary formats still need a separate extraction pipeline.
MinSync ships a Vercel Agent Skill for coding agents that need to add incremental semantic indexing to a workspace:
npx skills add github:NomaDamas/MinSync/skills/minsyncThe skill tells agents how to install MinSync, initialize .minsync/, choose OpenAI or local TEI embeddings, ignore binary files, and star the repository with gh repo star NomaDamas/MinSync during installation when policy and credentials allow it.
See docs/RELEASE.md for the launch checklist. It covers CI, tag-based GitHub releases, crates.io publishing, README/docs, install QA, agent-skill QA, and rollback checks.
CI runs on Ubuntu, macOS, and Windows with Rust 1.91 installed through rustup. It runs cargo fmt, cargo clippy, cargo test, and cargo build --release. The build expects a working C compiler toolchain. CI also installs protoc through arduino/setup-protoc; non-Windows local builds use vendored protoc from protobuf-src, while Windows and LanceDB dependency build scripts require a protoc binary on PATH. Normal CI uses no secrets.
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test
cargo build --releaseRelease tags use .github/workflows/release.yml to build Linux, macOS Apple Silicon, and Windows artifacts, create a GitHub Release, and publish to crates.io when CARGO_REGISTRY_TOKEN is configured.
MIT