QMD — Query Markup Documents

A hybrid search engine for your markdown knowledge base — local-first, cloud-capable, fully observable.

QMD indexes your markdown notes, meeting transcripts, documentation, and journals, then lets you search them with keywords, semantic similarity, or a hybrid of both. It is designed for agentic workflows (LLMs calling it over MCP), interactive use (CLI), and embedded use (SDK).

This repository is an enhanced fork of the original tobi/qmd by Tobi Lutke. It preserves every feature of the upstream project and adds, on top of it, a polymorphic provider backend (Jina AI integration), a full observability layer (usage tracking, quota warnings, benchmarking, histograms), and a secrets hygiene workflow (.env loader + pre-commit secret scanner). All enhancements are opt-in — if you set no new environment variables and touch no new YAML fields, QMD behaves identically to upstream.

For a detailed walkthrough of the design rationale behind every change in this fork, see docs/01-architecture-changes.md.

What QMD Does

At its core, QMD is a hybrid retrieval pipeline:

BM25 full-text search via SQLite FTS5 — fast exact-keyword matching.
Vector semantic search via sqlite-vec — finds conceptually related content even when words don't match.
LLM query expansion — rewrites your query into alternative phrasings to improve recall.
Reciprocal Rank Fusion (RRF) — merges ranked lists from all queries, with a top-rank bonus that preserves exact matches.
LLM re-ranking — a cross-encoder scores the top candidates for final precision, with position-aware blending so the reranker can't destroy a high-confidence retrieval result.

Every stage runs locally by default, backed by small GGUF models loaded via node-llama-cpp. The result is a search engine that is private, offline-capable, agent-friendly, and accurate — without sending a single byte of your corpus to the cloud.

The upstream project was designed around a single principle: keep everything on-device. This fork extends that principle to keep everything on-device by default, with clean escape hatches when local resources aren't enough.

What This Fork Adds (at a glance)

Capability	Upstream QMD	This fork
Embedding backend	Local `embeddinggemma-300M` only (384 dim, 2048 ctx)	Local + Jina v3 (1024 dim, 8192 ctx, Matryoshka truncation, 89 languages)
Rerank backend	Local `qwen3-reranker-0.6b` only	Local + Jina reranker v2 multilingual
Vietnamese / CJK quality	Limited (English-biased embedder)	Strong via Jina v3
Context window	2048 tokens (silent truncation)	2048 (local) or 8192 (Jina)
Usage observability	None	`qmd usage` — text, JSON, CSV, ASCII histogram
Quota tracking	None	`QMD_JINA_QUOTA=1B` with severity levels (ok → warn → critical → over)
Performance measurement	Search quality bench only	`qmd bench jina` — latency + throughput with `--runs N` stddev statistics
Secret handling	Shell env vars only	`.env` auto-load + gitignored template + pre-commit secret scanner
Test count	~399	467 (+68 new tests)
Runtime dependencies added	—	0 (zero new deps)
Regressions	—	0

Key design principle: additive, never subtractive. Every enhancement is opt-in. If you don't set any new environment variable and don't touch any new YAML field, this fork is byte-for-byte compatible with upstream.

Read the full design rationale in docs/01-architecture-changes.md.

Quick Start

# Install globally (Node or Bun)
npm install -g @tobilu/qmd
# or
bun install -g @tobilu/qmd

# Or run directly
npx @tobilu/qmd ...
bunx @tobilu/qmd ...

# Create collections for your notes, docs, and meeting transcripts
qmd collection add ~/notes --name notes
qmd collection add ~/Documents/meetings --name meetings
qmd collection add ~/work/docs --name docs

# Add context so that sub-documents inherit descriptive metadata.
# Context is returned alongside search results and helps LLMs pick
# documents contextually. Do not skip this step — it is the secret
# sauce of QMD's retrieval quality.
qmd context add qmd://notes "Personal notes and ideas"
qmd context add qmd://meetings "Meeting transcripts and notes"
qmd context add qmd://docs "Work documentation"

# Generate embeddings for semantic search
qmd embed

# Search across everything
qmd search "project timeline"           # Fast BM25 keyword search
qmd vsearch "how to deploy"              # Vector semantic search
qmd query "quarterly planning process"   # Hybrid + reranking (best quality)

# Retrieve documents
qmd get "meetings/2024-01-15.md"         # By path
qmd get "#abc123"                         # By docid (shown in search results)
qmd multi-get "journals/2025-05*.md"     # Glob pattern

# Scope search to a specific collection
qmd search "API" -c notes

# Export all matches above a threshold (for agents)
qmd search "API" --all --files --min-score 0.3

Optional: use Jina AI for embedding / reranking

# 1. Copy the template and fill in your key
cp .env.example .env
# Edit .env:  JINA_API_KEY=jina_xxx
#             QMD_EMBED_PROVIDER=jina
#             QMD_RERANK_PROVIDER=jina

# 2. Run QMD — the .env file is auto-loaded, no `source` needed
qmd status                  # Shows active providers
qmd embed -f                # Re-embed with the new backend
qmd query "auth flow"       # Same CLI, remote backend
qmd usage                   # See how many tokens you have consumed

See Remote Providers: Jina AI and Secrets & .env Workflow for the full picture.

Using QMD with AI Agents

QMD's --json and --files output formats are designed for agentic workflows where an LLM calls the tool programmatically.

# Structured results for an LLM
qmd search "authentication" --json -n 10

# All relevant files above a threshold
qmd query "error handling" --all --files --min-score 0.4

# Full document content
qmd get "docs/api-reference.md" --full

MCP Server

QMD exposes a Model Context Protocol server so agents (Claude Desktop, Claude Code, Cursor, etc.) can call it natively.

Tools exposed:

query — Search with typed sub-queries (lex/vec/hyde), combined via RRF + reranking
get — Retrieve a document by path or docid (with fuzzy matching suggestions)
multi_get — Batch retrieve by glob pattern, comma-separated list, or docids
status — Index health and collection info

Claude Desktop

In ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

Claude Code

Install the plugin:

claude plugin marketplace add tobi/qmd
claude plugin install qmd@qmd

Or configure MCP manually in ~/.claude/settings.json:

{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

HTTP Transport

By default, QMD's MCP server uses stdio (launched as a subprocess by each client). For a shared, long-lived server that avoids repeated model loading, use the HTTP transport:

# Foreground (Ctrl-C to stop)
qmd mcp --http                    # localhost:8181
qmd mcp --http --port 8080        # custom port

# Background daemon
qmd mcp --http --daemon           # writes PID to ~/.cache/qmd/mcp.pid
qmd mcp stop                      # stop via PID file
qmd status                        # shows "MCP: running (PID ...)"

Endpoints:

POST /mcp — MCP Streamable HTTP (JSON responses, stateless)
GET /health — liveness check with uptime

LLM models stay loaded in VRAM across requests. Embedding/reranking contexts are disposed after 5 minutes idle and transparently recreated on the next request (~1 s penalty, models remain loaded).

SDK / Library Usage

QMD can be embedded in any Node.js or Bun application.

Installation

npm install @tobilu/qmd

Quick Start

import { createStore } from '@tobilu/qmd'

const store = await createStore({
  dbPath: './my-index.sqlite',
  config: {
    collections: {
      docs: { path: '/path/to/docs', pattern: '**/*.md' },
    },
  },
})

const results = await store.search({ query: 'authentication flow' })
console.log(results.map((r) => `${r.title} (${Math.round(r.score * 100)}%)`))

await store.close()

Store Creation

createStore() accepts three modes:

// 1. Inline config — no files needed besides the DB
const store = await createStore({
  dbPath: './index.sqlite',
  config: {
    collections: {
      docs: { path: '/path/to/docs', pattern: '**/*.md' },
      notes: { path: '/path/to/notes' },
    },
  },
})

// 2. YAML config file
const store2 = await createStore({
  dbPath: './index.sqlite',
  configPath: './qmd.yml',
})

// 3. DB-only — reopen a previously configured store
const store3 = await createStore({ dbPath: './index.sqlite' })

Search

The unified search() method handles simple and pre-expanded queries:

// Simple query — auto-expanded via LLM, then BM25 + vector + reranking
const results = await store.search({ query: 'authentication flow' })

// With options
const results2 = await store.search({
  query: 'rate limiting',
  intent: 'API throttling and abuse prevention',
  collection: 'docs',
  limit: 5,
  minScore: 0.3,
  explain: true,
})

// Pre-expanded queries — skip auto-expansion, control each sub-query
const results3 = await store.search({
  queries: [
    { type: 'lex', query: '"connection pool" timeout -redis' },
    { type: 'vec', query: 'why do database connections time out under load' },
  ],
  collections: ['docs', 'notes'],
})

// Skip reranking for faster results
const fast = await store.search({ query: 'auth', rerank: false })

For direct backend access:

const lexResults = await store.searchLex('auth middleware', { limit: 10 })
const vecResults = await store.searchVector('how users log in', { limit: 10 })
const expanded = await store.expandQuery('auth flow', { intent: 'user login' })
const results4 = await store.search({ queries: expanded })

Retrieval

const doc = await store.get('docs/readme.md')
const byId = await store.get('#abc123')

if (!('error' in doc)) {
  console.log(doc.title, doc.displayPath, doc.context)
}

const body = await store.getDocumentBody('docs/readme.md', {
  fromLine: 50,
  maxLines: 100,
})

const { docs, errors } = await store.multiGet('docs/**/*.md', {
  maxBytes: 20480,
})

Collections

await store.addCollection('myapp', {
  path: '/src/myapp',
  pattern: '**/*.ts',
  ignore: ['node_modules/**', '*.test.ts'],
})

const collections = await store.listCollections()
const defaults = await store.getDefaultCollectionNames()
await store.removeCollection('myapp')
await store.renameCollection('old-name', 'new-name')

Context

Context adds descriptive metadata that improves search relevance and is returned alongside results:

await store.addContext('docs', '/api', 'REST API reference documentation')
await store.setGlobalContext('Internal engineering documentation')
const contexts = await store.listContexts()
await store.removeContext('docs', '/api')

Indexing

// Re-scan filesystem and update the index
const result = await store.update({
  collections: ['docs'],
  onProgress: ({ collection, file, current, total }) => {
    console.log(`[${collection}] ${current}/${total} ${file}`)
  },
})

// Generate vector embeddings
const embedResult = await store.embed({
  force: false,
  chunkStrategy: 'auto', // 'regex' | 'auto' (AST-aware for code files)
  onProgress: ({ current, total, collection }) => {
    console.log(`Embedding ${current}/${total}`)
  },
})

Types

import type {
  QMDStore,
  SearchOptions,
  HybridQueryResult,
  SearchResult,
  DocumentResult,
  MultiGetResult,
  EmbedResult,
  StoreOptions,
  CollectionConfig,
  IndexStatus,
} from '@tobilu/qmd'

The SDK requires an explicit dbPath — no default paths are assumed. This makes it safe to embed in any application without side effects.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         QMD Hybrid Search Pipeline                          │
└─────────────────────────────────────────────────────────────────────────────┘

                              ┌─────────────────┐
                              │   User Query    │
                              └────────┬────────┘
                                       │
                        ┌──────────────┴──────────────┐
                        ▼                             ▼
               ┌────────────────┐            ┌────────────────┐
               │ Query Expansion│            │ Original Query │
               │  (fine-tuned)  │            │   (×2 weight)  │
               └───────┬────────┘            └───────┬────────┘
                       │                             │
                       │ 2 alternative queries       │
                       └──────────────┬──────────────┘
                                      │
              ┌───────────────────────┼───────────────────────┐
              ▼                       ▼                       ▼
     ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
     │ Original Query  │     │ Expanded Query 1│     │ Expanded Query 2│
     └────────┬────────┘     └────────┬────────┘     └────────┬────────┘
              │                       │                       │
      ┌───────┴───────┐       ┌───────┴───────┐       ┌───────┴───────┐
      ▼               ▼       ▼               ▼       ▼               ▼
  ┌───────┐       ┌───────┐ ┌───────┐     ┌───────┐ ┌───────┐     ┌───────┐
  │ BM25  │       │Vector │ │ BM25  │     │Vector │ │ BM25  │     │Vector │
  │(FTS5) │       │Search │ │(FTS5) │     │Search │ │(FTS5) │     │Search │
  └───┬───┘       └───┬───┘ └───┬───┘     └───┬───┘ └───┬───┘     └───┬───┘
      │               │         │             │         │             │
      └───────┬───────┘         └──────┬──────┘         └──────┬──────┘
              │                        │                       │
              └────────────────────────┼───────────────────────┘
                                       │
                                       ▼
                          ┌───────────────────────┐
                          │   RRF Fusion + Bonus  │
                          │  Original query: ×2   │
                          │  Top-rank bonus: +0.05│
                          │     Top 30 Kept       │
                          └───────────┬───────────┘
                                      │
                                      ▼
                          ┌───────────────────────┐
                          │    LLM Re-ranking     │
                          │  (qwen3-reranker or   │
                          │   jina-reranker-v2)   │
                          └───────────┬───────────┘
                                      │
                                      ▼
                          ┌───────────────────────┐
                          │  Position-Aware Blend │
                          │  Top 1-3:  75% RRF    │
                          │  Top 4-10: 60% RRF    │
                          │  Top 11+:  40% RRF    │
                          └───────────────────────┘

Vector search uses embeddinggemma-300M locally by default, or jina-embeddings-v3 over the cloud API when opted in. Re-ranking uses qwen3-reranker-0.6b locally by default, or jina-reranker-v2-base-multilingual when opted in. Query expansion always runs locally, for latency reasons.

Score Normalization & Fusion

Search Backends

Backend	Raw Score	Conversion	Range
FTS (BM25)	SQLite FTS5 BM25	`Math.abs(score)`	0 to ~25+
Vector	Cosine distance	`1 / (1 + distance)`	0.0 to 1.0
Reranker	LLM 0-10 rating	`score / 10`	0.0 to 1.0

Fusion Strategy

The query command uses Reciprocal Rank Fusion (RRF) with position-aware blending:

Query Expansion: Original query (×2 for weighting) + 1 LLM variation.
Parallel Retrieval: Each query searches both FTS and vector indexes.
RRF Fusion: Combine all result lists using score = Σ(1/(k+rank+1)) where k=60.
Top-Rank Bonus: Documents ranking #1 in any list get +0.05, #2-3 get +0.02.
Top-K Selection: Take top 30 candidates for reranking.
Re-ranking: LLM scores each document (yes/no with logprob confidence).
Position-Aware Blending:
- RRF rank 1-3: 75% retrieval, 25% reranker (preserves exact matches)
- RRF rank 4-10: 60% retrieval, 40% reranker
- RRF rank 11+: 40% retrieval, 60% reranker (trust reranker more)

Why this approach. Pure RRF can dilute exact matches when expanded queries don't match. The top-rank bonus preserves documents that score #1 for the original query. Position-aware blending prevents the reranker from destroying high-confidence retrieval results.

Score Interpretation

Score	Meaning
0.8 – 1.0	Highly relevant
0.5 – 0.8	Moderately relevant
0.2 – 0.5	Somewhat relevant
0.0 – 0.2	Low relevance

Requirements

Node.js ≥ 22
Bun ≥ 1.0.0 (optional but recommended for development)
macOS: Homebrew SQLite for extension support
```
brew install sqlite
```

Local Models

QMD ships three local GGUF models (auto-downloaded on first use):

Model	Purpose	Size
`embeddinggemma-300M-Q8_0`	Vector embeddings (default)	~300 MB
`qwen3-reranker-0.6b-q8_0`	Re-ranking	~640 MB
`qmd-query-expansion-1.7B-q4_k_m`	Query expansion (fine-tuned)	~1.1 GB

Models are downloaded from HuggingFace and cached in ~/.cache/qmd/models/.

Custom Embedding Model

Override the default local embedding model via QMD_EMBED_MODEL. This is useful for multilingual corpora where embeddinggemma-300M has limited coverage:

# Use Qwen3-Embedding-0.6B for multilingual (CJK) support
export QMD_EMBED_MODEL="hf:Qwen/Qwen3-Embedding-0.6B-GGUF/Qwen3-Embedding-0.6B-Q8_0.gguf"

# Re-embed all collections after switching
qmd embed -f

Supported local model families:

embeddinggemma (default) — English-optimized, small footprint
Qwen3-Embedding — Multilingual (119 languages including CJK), MTEB top-ranked

Note. When switching embedding models, you must re-index with qmd embed -f — vectors are not cross-compatible between models. The prompt format is auto-adjusted for each model family.

Remote Providers: Jina AI

QMD can delegate embedding and/or reranking to the Jina AI cloud API instead of running local GGUF models. You can enable either or both independently; query expansion always stays local.

When is this worth it?

The target machine is CPU-only and local models are too slow to index a large corpus.
You want stronger multilingual coverage (jina-embeddings-v3 supports 89 languages, including Vietnamese).
You want an 8192-token context window (vs 2048 for embeddinggemma), which matters for chunks containing long code blocks or conversational transcripts.
You are deploying QMD inside a constrained environment (serverless, minimal container) where bundling a GGUF model is wasteful.

Embedding only

export JINA_API_KEY="jina_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export QMD_EMBED_PROVIDER=jina
qmd embed -f   # re-embed required on provider switch (dimensions differ)

Rerank only

export JINA_API_KEY="jina_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export QMD_RERANK_PROVIDER=jina
# No re-embed needed — rerank runs only at query time

Both (fully remote for vector operations)

export JINA_API_KEY="jina_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export QMD_EMBED_PROVIDER=jina
export QMD_RERANK_PROVIDER=jina
qmd embed -f

Per-index YAML config

Instead of env vars, you can specify jina:* model URIs directly in the per-index YAML (~/.config/qmd/<index>.yml). The API key must still come from JINA_API_KEY:

collections:
  notes:
    path: ~/Documents/notes
    pattern: "**/*.md"
models:
  embed: "jina:jina-embeddings-v3"
  rerank: "jina:jina-reranker-v2-base-multilingual"

This lets you run different indexes with different providers — e.g. a sensitive personal index on local models, and a public research index on Jina — by switching with --index <name>.

Dimension constraint. Within a single index, all collections must share the same vector dimension. You cannot mix jina-embeddings-v3 (1024 dim) and embeddinggemma (384 dim) in one index — the vectors_vec table has a fixed schema. Use separate indexes if you need mixed providers.

All Jina-related env vars

Variable	Default	Purpose
`QMD_EMBED_PROVIDER`	(unset)	Set to `jina` to enable remote embedding
`QMD_RERANK_PROVIDER`	(unset)	Set to `jina` to enable remote reranking
`JINA_API_KEY` (or `QMD_JINA_API_KEY`)	—	Required when either provider is `jina`
`QMD_JINA_MODEL`	`jina-embeddings-v3`	Jina embedding model name
`QMD_JINA_DIMENSION`	`1024`	Output dimension (Matryoshka truncation on v3)
`QMD_JINA_RERANK_MODEL`	`jina-reranker-v2-base-multilingual`	Jina reranker model name
`QMD_JINA_BASE_URL`	`https://api.jina.ai/v1`	Override for self-hosted / proxy endpoints
`QMD_JINA_BATCH`	`128`	Inputs per embed HTTP request
`QMD_JINA_CONCURRENCY`	`4`	Parallel requests in a batch
`QMD_JINA_TIMEOUT_MS`	`60000`	Per-request timeout
`QMD_JINA_MAX_RETRIES`	`4`	Retries on 429/5xx/network errors (exponential backoff + jitter)

Security. The API key is read from the environment only — never store it in a committed file. If you ever paste a key into a shared session or accidentally commit one, rotate it immediately at https://jina.ai/.

Caveats

Documents leave your machine — do not use with sensitive or proprietary content.
Requires network access; fully offline workflows must stay on local backends.
Switching the embedding provider requires re-indexing (qmd embed -f) because vector dimensions differ.
Query expansion always runs locally — Jina does not replace it.

Speed trade-offs

Batch indexing on a CPU-only machine: Jina is typically 5–20× faster. The API runs on dedicated GPUs and batches up to 128 inputs per request.
Query-time embedding (per search): Jina adds network latency (~200–500 ms round-trip). Local embeddings are faster for interactive search.
Query-time reranking: Jina reranker-v2 may be faster than local qwen3-reranker on weak hardware, but adds RTT. Consider keeping rerank local if you have a GPU.
GPU machine: Batch throughput is roughly comparable; local wins on latency.

Don't guess — measure. See qmd bench jina below.

Secrets & `.env` Workflow

This fork adds a disciplined secrets workflow so you can enable cloud providers without worrying about accidentally leaking keys to git.

`.env` file (recommended for local development)

QMD auto-loads a .env file from the current working directory or any parent directory (up to 5 levels) on every CLI invocation. .env is gitignored by default, so it cannot be accidentally committed.

# 1. Copy the template
cp .env.example .env

# 2. Edit with your real values
$EDITOR .env

# 3. Run QMD — no `source` needed, the CLI picks it up automatically
qmd status

Example .env:

# Activate Jina for embeddings + reranking
QMD_EMBED_PROVIDER=jina
QMD_RERANK_PROVIDER=jina
JINA_API_KEY=jina_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Track your monthly quota
QMD_JINA_QUOTA=1B
QMD_JINA_WARN_PCT=70

Precedence. Shell environment variables always win over .env file values. This lets you override a single setting ad-hoc:

# Use the local backend for this one command, even though .env says jina
QMD_EMBED_PROVIDER=local qmd query "sensitive notes"

Custom location. Set QMD_ENV_FILE=/path/to/other.env to load from a non-default location — useful for per-environment configs:

QMD_ENV_FILE=.env.production qmd embed
QMD_ENV_FILE=.env.staging qmd query "..."

Format. The loader supports a subset of standard dotenv syntax:

# Simple
KEY=value

# Quoted (preserves whitespace, `#`, etc.)
KEY="value with spaces and # symbol"

# Single-quoted (raw, no escape interpretation)
KEY='literal \n stays as backslash-n'

# Escape sequences inside double quotes: \n \t \r \" \\
MULTILINE="line1\nline2"

# Optional `export` prefix (shell-compatible)
export KEY=value

# Comments
# This is a comment
KEY=value   # trailing comment on unquoted values

The loader is a zero-dependency custom implementation (~180 lines) in src/dotenv.ts, with 18 unit tests covering every parser corner case.

Secret scanner

QMD ships a pre-commit hook (scripts/scan-secrets.sh) that blocks commits containing recognisable API key patterns:

Jina (jina_…)
OpenAI (sk-…)
Anthropic (sk-ant-…)
Voyage (pa-…)
Cohere (co-…)
GitHub tokens (ghp_…, ghs_…, gho_…, ghu_…, ghr_…)
AWS access keys (AKIA…)
PEM private key blocks

Install the hooks:

./scripts/install-hooks.sh

Run the scanner manually at any time:

scripts/scan-secrets.sh                 # scan staged files (pre-commit mode)
scripts/scan-secrets.sh --tracked       # scan all files tracked by git
scripts/scan-secrets.sh --all <path>    # scan an arbitrary directory

The scanner reports file:line without re-leaking the key value. This is deliberate — running the scanner in a CI log stream does not re-expose the secret.

Defense in depth

Layer	Mechanism	Catches
0	Rotate key immediately at provider on any exposure	Keys already leaked
1	`.gitignore` excludes `.env`, `.env.`, `.key`, `*.pem`, `secrets/`, `credentials.json`	Accidental `git add`
2	`.env.example` template shows shape, never values	Copying the wrong file
3	Code reads env vars only — no path writes a key to a file	Logic errors persisting secrets
4	Secret scanner matches eight provider prefixes	Hardcoded keys in any file
5	Pre-commit hook auto-invokes the scanner on staged files	Forgetting to run the scanner manually

No single layer is 100%. Stacked, they approach it.

Usage Tracking & Quota Warnings

Every successful Jina API call records its token consumption to a local SQLite table (jina_usage). You can inspect your consumption with qmd usage.

`qmd usage` — human-readable summary

qmd usage

QMD Jina Usage

Embed provider:  jina (jina-embeddings-v3)
Rerank provider: jina (jina-reranker-v2-base-multilingual)

Quota (30d window)
  Used:       473.6K / 1.00B  (0.0%)  OK
  Remaining:  999.53M
  Warn threshold: 80%

Tokens consumed
  Last 24h:  473.6K
  Last 7d:   473.6K
  Last 30d:  473.6K
  All time:  473.6K

By operation
  embed_passage  jina-embeddings-v3                      142 calls    10.2M tok
  rerank         jina-reranker-v2-base-multilingual       89 calls     2.1M tok
  embed_query    jina-embeddings-v3                      412 calls    520.0K tok

`qmd usage --json` — stable schema for scripting

{
  "schema": "qmd.usage.v1",
  "generated_at": "2026-04-08T09:20:58.673Z",
  "provider": { "embed": "jina (jina-embeddings-v3)", "rerank": null },
  "totals": {
    "last_24h": 473650,
    "last_7d": 473650,
    "last_30d": 473650,
    "all_time": 473650
  },
  "by_operation": [
    { "operation": "embed_passage", "model": "jina-embeddings-v3", "calls": 142, "total_tokens": 10200000 }
  ],
  "quota": {
    "limit": 1000000000,
    "window": "30d",
    "used": 473650,
    "used_fraction": 0.00047365,
    "remaining": 999526350,
    "warn_fraction": 0.8,
    "severity": "ok"
  }
}

The severity field is designed for CI gates:

# Block a deploy when quota is critical or over
qmd usage --json | jq -e '.quota.severity == "ok" or .quota.severity == "warn"'

`qmd usage --csv` — spreadsheet export

operation,model,calls,total_tokens
embed_passage,jina-embeddings-v3,142,10200000
rerank,jina-reranker-v2-base-multilingual,89,2100000
embed_query,jina-embeddings-v3,412,520000

For totals and quota state, use --json — it includes everything.

`qmd usage chart` — ASCII histogram

qmd usage chart                 # last 30 days (default)
qmd usage chart --days 7        # last 7 days
qmd usage chart --json          # histogram as JSON

Jina daily usage (last 30 days, UTC)

  03-10  ███████▊                                     76.6K
  03-11                                                   0
  03-13  █████▎                                       51.6K
  03-24  █████████████████████████▌                  249.1K
  04-03  ████████████████████████████████████████    390.1K
  04-07  ████████                                     78.4K
  04-08                                                   0

Total: 1.86M  |  Active days: 21/30  |  Peak: 390.1K  |  Avg/active: 88.4K

Bars scale proportionally to the peak day so the shape is readable at any magnitude.

Quota warnings

Set QMD_JINA_QUOTA to your monthly token budget to enable quota warnings. QMD evaluates the quota against a rolling window (default 30d to match typical monthly plans):

export QMD_JINA_QUOTA=1B            # 1 billion tokens
export QMD_JINA_QUOTA_WINDOW=30d    # 24h | 7d | 30d | all
export QMD_JINA_WARN_PCT=80         # warn threshold (percent)

QMD_JINA_QUOTA accepts plain integers and short suffixes: 1000000000, 1B, 500M, 10k.

Severity levels

Level	Trigger	Display
`ok`	below warn threshold	green, no alert
`warn`	≥ warn threshold (default 80%)	yellow ⚠ with remaining
`critical`	≥ 95%	red ⚠ with remaining
`over`	> 100%	red, "Over quota by N"

`qmd usage reset`

Clear the usage history. Useful when you start a new billing period and want a clean slate:

qmd usage reset

qmd status shows a compact quota summary with colour-coded warnings when a remote provider is active and quota is configured.

Benchmarking (`qmd bench jina`)

Run a reproducible latency + throughput benchmark against your machine and your network. Designed to answer a single question:

"Is Jina actually faster for my setup?"

Because the answer depends on local hardware (CPU / GPU / RAM), geographic network RTT, workload (batch vs query-time), and content mix (code / prose / CJK), any universal claim in documentation will be wrong for some subset of users. qmd bench jina lets you stop trusting documentation and start trusting your own numbers.

Basic usage

qmd bench jina                        # both local and jina, 50 docs
qmd bench jina --size 200             # larger workload
qmd bench jina --provider jina        # remote only
qmd bench jina --provider local       # local only (zero API cost)
qmd bench jina --runs 5               # 5 runs with statistical summary
qmd bench jina --json                 # machine-readable output

Flags

Flag	Default	Purpose
`--size <n>`	`50`	Number of synthetic documents
`--doc-len <n>`	`500`	Characters per synthetic doc
`--provider <p>`	`both`	`local` \| `jina` \| `both`
`--runs <n>`	`1`	Repeat the full workload N times to reduce noise (max 100)
`--skip-rerank`	off	Skip the rerank stage
`--skip-single`	off	Skip the single-embed latency stage
`--json`	off	Emit `qmd.bench.jina.v1` JSON

What it measures

Three stages per provider:

embed_single — median single-embed latency over N samples (default 5).
embed_batch — wall time + throughput for one batch of all documents.
rerank — latency for one rerank call with the query + all documents.

With --runs > 1, every stage's samples are flattened across runs and summarised with median, mean, standard deviation, p95, min, and max. Stages with stddev > 20% of median are highlighted in yellow ("this measurement is noisy — re-run with more samples").

Example output

jina (jina-embeddings-v3)
  stage            median       p95      stddev   throughput    items
  embed (single)     312.4ms    405.2ms    ±42.8ms         3.2/s       5
  embed (batch)      981.3ms   1024.5ms    ±87.1ms        51.0/s      50
  rerank             287.1ms    312.8ms    ±19.2ms       174.2/s      50
  Total wall time: 7890ms

Stats computed over 5 runs. Stddev in yellow = high variance (>20% of median).

Comparison mode

When both providers run, a Comparison table shows the speedup ratio per stage and picks a winner:

Comparison (local / jina)
  stage            local       jina       speedup    winner
  embed (single)    45.2ms    312.4ms       0.14×    local
  embed (batch)   8432.1ms    981.3ms       8.59×    jina
  rerank           654.3ms    287.1ms       2.28×    jina

⚠ API cost

qmd bench jina with --provider jina or --provider both hits the real Jina API and consumes tokens from your quota. The default workload (50 × 500 chars) costs under ~10k tokens — well under a cent on typical plans — but if you crank --size high you will see it in qmd usage afterwards.

Installation

End users

npm install -g @tobilu/qmd
# or
bun install -g @tobilu/qmd

Development

git clone https://github.com/tobi/qmd
cd qmd
bun install            # or npm install
./scripts/install-hooks.sh   # installs pre-commit secret scanner + pre-push validator
cp .env.example .env   # configure local settings (optional)
bun link               # makes the `qmd` command available globally

Run tests:

bun test --preload ./src/test-preload.ts test/
# or
npx vitest run --reporter=verbose test/

Type-check:

node_modules/.bin/tsc -p tsconfig.build.json --noEmit

Command Reference

Collections

qmd collection add <path> --name <n> [--mask <glob>]
qmd collection list
qmd collection remove <name>
qmd collection rename <old> <new>
qmd ls [collection[/path]]

Context

qmd context add [path] "description"
qmd context add qmd://collection/path "description"
qmd context list
qmd context check                 # find collections/paths missing context
qmd context rm <path>

Documents

qmd get <file>[:line] [-l <max-lines>] [--from <line>]
qmd get "#abc123"                 # by docid
qmd multi-get "glob/pattern/*.md" [--max-bytes N]

Search

qmd search "query"                # BM25 full-text only (fast)
qmd vsearch "query"               # Vector semantic only (no rerank)
qmd query "query"                 # Hybrid + query expansion + rerank (best quality)

Common flags:

-n <num>              Number of results (default 5, 20 for --files/--json)
-c, --collection      Restrict to one or more collections
--all                 Return all matches (use with --min-score)
--min-score <num>     Minimum score threshold
--full                Show full document content
--line-numbers        Add line numbers to output
--explain             Include retrieval score traces
--index <name>        Use a named index (~/.config/qmd/<name>.yml)
--json | --csv | --md | --xml | --files   Output formats

Index maintenance

qmd status                        # health, collections, provider, quota
qmd update [--pull]               # re-scan filesystem (optionally git pull first)
qmd embed [-f] [--chunk-strategy auto|regex]
qmd pull [--refresh]              # download GGUF models (skips remote providers)
qmd cleanup                       # clear caches, vacuum DB

Usage & benchmarking (this fork)

qmd usage [--json|--csv]          # token consumption summary
qmd usage chart [--days N]        # ASCII daily histogram
qmd usage reset                   # clear usage history
qmd bench jina [options]          # latency/throughput benchmark

MCP server

qmd mcp                           # stdio transport (default)
qmd mcp --http [--port N]         # HTTP transport
qmd mcp --http --daemon           # background daemon
qmd mcp stop                      # stop background daemon

Data Storage

Index is stored at ~/.cache/qmd/index.sqlite. Schema overview:

collections     -- Indexed directories with name and glob patterns
path_contexts   -- Context descriptions by virtual path (qmd://...)
documents       -- Markdown content with metadata and docid (6-char hash)
documents_fts   -- FTS5 full-text index
content_vectors -- Embedding chunks (hash, seq, pos, 900 tokens each)
vectors_vec     -- sqlite-vec vector index (hash_seq key)
llm_cache       -- Cached LLM responses (query expansion, rerank scores)
jina_usage      -- THIS FORK: append-only log of Jina API token usage

The jina_usage table is additive — upstream users who don't enable remote providers will never write a row to it.

Environment Variables

Variable	Default	Description
Core
`XDG_CACHE_HOME`	`~/.cache`	Cache directory location
`QMD_EMBED_MODEL`	`embeddinggemma-300M`	Override the local embedding model URI (HF format)
`QMD_GENERATE_MODEL`	(fine-tuned query expansion)	Override local generation model
`QMD_RERANK_MODEL`	`qwen3-reranker-0.6B`	Override local rerank model
`QMD_EMBED_CONTEXT_SIZE`	`2048`	Local embed model context window
`QMD_LLAMA_GPU`	`auto`	Set to `false` to force CPU inference
Remote providers (this fork)
`QMD_EMBED_PROVIDER`	(unset)	Set to `jina` to use Jina AI instead of a local model
`QMD_RERANK_PROVIDER`	(unset)	Set to `jina` to use Jina reranker
`JINA_API_KEY`	—	Required when any `*_PROVIDER=jina`
`QMD_JINA_MODEL`	`jina-embeddings-v3`	Jina embedding model name
`QMD_JINA_RERANK_MODEL`	`jina-reranker-v2-base-multilingual`	Jina reranker model name
`QMD_JINA_DIMENSION`	`1024`	Output dimension (Matryoshka truncation on v3)
`QMD_JINA_BASE_URL`	`https://api.jina.ai/v1`	Override for self-hosted / proxy
`QMD_JINA_BATCH`	`128`	Inputs per embed HTTP request
`QMD_JINA_CONCURRENCY`	`4`	Parallel requests in a batch
`QMD_JINA_TIMEOUT_MS`	`60000`	Per-request timeout
`QMD_JINA_MAX_RETRIES`	`4`	Retries on 429/5xx errors
Quota tracking (this fork)
`QMD_JINA_QUOTA`	(unset)	Monthly token quota (e.g. `1B`, `500M`, `10k`)
`QMD_JINA_QUOTA_WINDOW`	`30d`	Rolling window: `24h` \| `7d` \| `30d` \| `all`
`QMD_JINA_WARN_PCT`	`80`	Warn threshold as percent of quota
Dotenv (this fork)
`QMD_ENV_FILE`	(auto-search)	Explicit path to a `.env` file; skips the upward walk
Misc
`QMD_EDITOR_URI`	`vscode://file/{path}:{line}:{col}`	Editor link template for TTY output
`NO_COLOR`	(unset)	Disable colourised output
`CI`	(unset)	Set to `true` to disable all LLM operations

How It Works

Indexing flow

Collection ──► Glob Pattern ──► Markdown Files ──► Parse Title ──► Hash Content
    │                                                   │              │
    │                                                   │              ▼
    │                                                   │         Generate docid
    │                                                   │         (6-char hash)
    │                                                   │              │
    └──────────────────────────────────────────────────►└──► Store in SQLite
                                                                       │
                                                                       ▼
                                                                  FTS5 Index

Embedding flow

Documents are chunked into ~900-token pieces with 15% overlap, using smart boundary detection that prefers markdown headings over arbitrary text positions:

Document ──► Smart Chunk (~900 tokens) ──► Format chunk ──► Embed ──► Store Vectors
                │                          "title | text"     │
                │                                               │
                │                       (local: node-llama-cpp)
                │                       (remote: Jina API)
                │
                └─► Each chunk stored with:
                    - hash: document hash
                    - seq:  chunk sequence (0, 1, 2, …)
                    - pos:  character position in original

Smart chunking

Instead of cutting at hard token boundaries, QMD uses a scoring algorithm to find natural markdown break points. This keeps semantic units (sections, paragraphs, code blocks) together.

Break point scores:

Pattern	Score	Description
`# Heading`	100	H1 — major section
`## Heading`	90	H2 — subsection
`### Heading`	80	H3
`#### Heading`	70	H4
`##### Heading`	60	H5
`###### Heading`	50	H6
```	80	Code block boundary
`---` / `***`	60	Horizontal rule
Blank line	20	Paragraph boundary
`- item` / `1. item`	5	List item
Line break	1	Minimal break

Algorithm:

Scan the document for all break points with scores.
When approaching the 900-token target, search a 200-token window before the cutoff.
Score each break point: finalScore = baseScore × (1 - (distance/window)² × 0.7).
Cut at the highest-scoring break point.

The squared-distance decay means a heading 200 tokens back (score ~30) still beats a simple line break at the target (score 1), but a closer heading wins over a distant one.

Code fence protection: break points inside code blocks are ignored — code stays together. If a code block exceeds the chunk size, it is kept whole when possible.

AST-aware chunking (code files)

For supported code files (.ts, .tsx, .js, .jsx, .py, .go, .rs), QMD optionally parses the source with tree-sitter and adds AST-derived break points that are merged with the regex scores above:

AST node	Score
Class / interface / struct / impl / trait	100
Function / method	90
Type alias / enum	80
Import / use declaration	60

Enable with --chunk-strategy auto. Markdown and other file types always use regex chunking regardless of the strategy flag.

Note. Tree-sitter grammars are optional dependencies. If they are not installed, --chunk-strategy auto falls back to regex-only chunking automatically. Tested on both Node.js and Bun.

Query flow (hybrid)

Query ──► LLM Expansion ──► [Original, Variant 1, Variant 2]
                │
      ┌─────────┴─────────┐
      ▼                   ▼
   For each query:     FTS (BM25)
      │                   │
      ▼                   ▼
   Vector Search      Ranked List
      │
      ▼
   Ranked List
      │
      └─────────┬─────────┘
                ▼
         RRF Fusion (k=60)
         Original query ×2 weight
         Top-rank bonus: +0.05/#1, +0.02/#2–3
                │
                ▼
         Top 30 candidates
                │
                ▼
         LLM Re-ranking
         (yes/no + logprob confidence)
                │
                ▼
         Position-Aware Blend
         Rank 1-3:  75% RRF / 25% reranker
         Rank 4-10: 60% RRF / 40% reranker
         Rank 11+:  40% RRF / 60% reranker
                │
                ▼
         Final Results

Credits & License

Original QMD

The hybrid retrieval pipeline, smart chunking, MCP server, SDK, and every pre-existing piece of this codebase are the work of Tobi Lutke — upstream author of tobi/qmd. This fork would not exist without that foundation. Huge credit.

Fork enhancements

The polymorphic provider backend, Jina AI integration (embedding + reranking), .env loader, secret scanner, usage tracking, quota warnings, ASCII histogram, benchmarking harness, and this documentation were added by:

Nguyen Ngoc Tuan Founder, Transform Group Lark Platinum Partner

All changes are additive and opt-in — if you set no new environment variables and touch no new YAML fields, this fork behaves identically to upstream. See docs/01-architecture-changes.md for the full design rationale behind each change.

License

MIT — identical to upstream. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 435 Commits
.claude-plugin		.claude-plugin
.github/workflows		.github/workflows
.pi		.pi
assets		assets
bin		bin
docs		docs
finetune		finetune
scripts		scripts
skills		skills
src		src
test		test
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
example-index.yml		example-index.yml
flake.lock		flake.lock
flake.nix		flake.nix
migrate-schema.ts		migrate-schema.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

QMD — Query Markup Documents

Table of Contents

What QMD Does

What This Fork Adds (at a glance)

Quick Start

Optional: use Jina AI for embedding / reranking

Using QMD with AI Agents

MCP Server

Claude Desktop

Claude Code

HTTP Transport

SDK / Library Usage

Installation

Quick Start

Store Creation

Search

Retrieval

Collections

Context

Indexing

Types

Architecture

Score Normalization & Fusion

Search Backends

Fusion Strategy

Score Interpretation

Requirements

Local Models

Custom Embedding Model

Remote Providers: Jina AI

When is this worth it?

Embedding only

Rerank only

Both (fully remote for vector operations)

Per-index YAML config

All Jina-related env vars

Caveats

Speed trade-offs

Secrets & .env Workflow

.env file (recommended for local development)

Secret scanner

Defense in depth

Usage Tracking & Quota Warnings

qmd usage — human-readable summary

qmd usage --json — stable schema for scripting

qmd usage --csv — spreadsheet export

qmd usage chart — ASCII histogram

Quota warnings

Severity levels

qmd usage reset

Benchmarking (qmd bench jina)

Basic usage

Flags

What it measures

Example output

Comparison mode

⚠ API cost

Installation

End users

Development

Command Reference

Collections

Context

Documents

Search

Index maintenance

Usage & benchmarking (this fork)

MCP server

Data Storage

Environment Variables

How It Works

Indexing flow

Embedding flow

Smart chunking

AST-aware chunking (code files)

Query flow (hybrid)

Secrets & `.env` Workflow

`.env` file (recommended for local development)

`qmd usage` — human-readable summary

`qmd usage --json` — stable schema for scripting

`qmd usage --csv` — spreadsheet export

`qmd usage chart` — ASCII histogram

`qmd usage reset`

Benchmarking (`qmd bench jina`)

Packages