zulcrawl mirrors Zulip org data into local SQLite and provides fast FTS5 search.
- TOML config (
~/.zulcrawl/config.toml) + env overrides (ZULIP_URL,ZULIP_EMAIL,ZULIP_API_KEY) - Zulip API client (auth, streams, users, messages + narrow filters + pagination + rate-limit retries)
- SQLite schema from
SPEC.md(including FTS5 tables + sync state) - Cobra CLI commands:
initdoctorsync --full --streams --since [--quiet] [--with-media]searchtopics— list topics or topic-level hybrid searchtopics search— hybrid search: FTS on topic names + message content (see below)statssqlmessages— direct archive slice queries (see below)digest— local-only topic activity digest (see below)attachments/attachments fetch— list indexed uploads and cache Zulip media locallybackfill-indexes— rebuild mention/attachment indexes for existing messagesembeddings backfill— build/update Ollama topic embeddings (disabled by default)embeddings status— show embedding coveragetopics search --semantic— vector/semantic topic search when embeddings enabled
- Syncer with parallel stream workers and incremental sync (
sync_state) - HTML-to-text conversion for indexing
- Search ranking with FTS BM25, resolved-topic boost, and recency boost
- Topic-level hybrid search: two-leg FTS (topic names + message content) with activity/recency scoring
- Mention and attachment indexes parsed from rendered HTML
Attachment indexing covers /user_uploads/... links found in Zulip-rendered
message HTML, plus any text already present in inline preview blocks. zulcrawl
can also download the original upload bytes into a local-only cache using the
same Zulip API credentials:
zulcrawl attachments # list indexed attachments and cache status
zulcrawl attachments --status pending --stream engineering
zulcrawl attachments fetch # download pending/error uploads
zulcrawl sync --with-media # sync, then fetch mediaCached media defaults to ~/.zulcrawl/media and is tracked in SQLite with path,
status, fetched timestamp, content type, byte count, and last error. It is never
published or uploaded by zulcrawl.
Full file-content extraction/parsing (PDFs, images, Office documents, etc.) is still a future enhancement; downloaded media is cached as bytes only.
go build ./cmd/zulcrawl~/.zulcrawl/config.toml
[zulip]
url = "https://your-org.zulipchat.com"
email = "bot@your-org.zulipchat.com"
api_key = ""
[database]
path = "~/.zulcrawl/zulcrawl.db"
[media]
cache_dir = "~/.zulcrawl/media"
[sync]
concurrency = 4
streams = []
exclude_streams = ["social", "random"]Environment variables override file values:
ZULIP_URLZULIP_EMAILZULIP_API_KEY
# Initialize config + DB
zulcrawl init --url https://your-org.zulipchat.com
# Health checks
zulcrawl doctor
# Full sync of all selected streams
zulcrawl sync --full
# Sync only specific streams
zulcrawl sync --streams general,engineering
# Sync only messages from date onward (YYYY-MM-DD)
zulcrawl sync --since 2026-01-01
# Sync silently (suppress progress lines, show only final summary or errors)
zulcrawl sync --quiet
# Sync and then cache indexed Zulip upload bytes locally
zulcrawl sync --with-media
# List and fetch attachment media
zulcrawl attachments --status pending
zulcrawl attachments fetch
# Search
zulcrawl search "database migration"
zulcrawl search --stream engineering --resolved "deploy script"
# Topic listing
zulcrawl topics --stream engineering
zulcrawl topics --unresolved
# Topic-level hybrid search (FTS on topic names + message content)
# Finds topics whose name or messages match the query
zulcrawl topics search "database migration"
zulcrawl topics search --stream engineering "deploy script"
zulcrawl topics search --unresolved --limit 10 "onboarding"
# Local topic digest
zulcrawl digest --stream engineering --since 2026-05-01
zulcrawl digest --stream engineering --since 2026-05-01 --until 2026-05-15 --json
# Stats
zulcrawl stats
# Raw SQL
zulcrawl sql "SELECT stream_id, COUNT(*) FROM messages GROUP BY stream_id ORDER BY 2 DESC LIMIT 10"zulcrawl topics search <query> performs a hybrid topic-level search over the
local archive. It runs two legs:
- Topic-name FTS (
topics_fts) — finds topics whose name matches the query. - Message-content FTS (
messages_fts) aggregated by topic — finds topics that contain relevant messages.
Results from both legs are merged and de-duplicated by topic ID, then scored:
| Signal | Weight |
|---|---|
| BM25 relevance (best of name/message leg) | base |
| Activity bonus: log₂(1 + message_count) × 0.1 | up to ~0.7 for large topics |
| Recency decay: 0.5 / (1 + days_since_last_msg / 30) | 0–0.5 |
| Resolved bonus | +0.15 |
No remote embedding APIs are called. This is a local FTS/hybrid approach. Vector-embedding provider integration is deferred to a future chunk (issue #9).
Flags
| Flag | Description |
|---|---|
--stream NAME |
Filter by stream name |
--unresolved |
Exclude resolved topics |
--limit N |
Maximum topics to return (default 20) |
Output format
#stream > topic name [resolved] (N msgs, last YYYY-MM-DD)
…best matching message snippet…
zulcrawl digest summarizes a bounded slice of the local SQLite archive without
hitting the Zulip API or any remote summarization service. It groups matching
messages by topic and reports counts, first/last timestamps, participants, and
a latest-message preview. It also highlights mention-heavy and attachment-heavy
topics using the local message_mentions and message_attachments indexes.
--stream and --since are required so digest runs stay intentionally scoped.
Flags
| Flag | Description |
|---|---|
--stream NAME |
Stream name to summarize (required) |
--since DATE |
Include messages at or after this time (RFC3339 or YYYY-MM-DD; required) |
--until DATE |
Include messages at or before this time (RFC3339 or YYYY-MM-DD) |
--limit N |
Maximum topics to return (default 20) |
--json |
Emit machine-readable JSON instead of text |
Examples
zulcrawl digest --stream engineering --since 2026-05-01
zulcrawl digest --stream support --since 2026-05-01T00:00:00Z --until 2026-05-15T23:59:59Z --limit 10
zulcrawl digest --stream engineering --since 2026-05-01 --jsonText output format:
#stream > topic (N messages, YYYY-MM-DD HH:MM to YYYY-MM-DD HH:MM)
participants: Alice, Bob
latest: latest message preview
Mention-heavy topics:
#stream > topic (N mentions, last YYYY-MM-DD HH:MM)
Attachment-heavy topics:
#stream > topic (N attachments, last YYYY-MM-DD HH:MM)
JSON output is self-contained: the top-level object includes stream, topics,
mention_heavy_topics, and attachment_heavy_topics; each topic row also
includes its stream.
zulcrawl messages queries the local archive without hitting the Zulip API.
At least one narrowing filter is required to prevent accidental full-archive dumps.
Flags
| Flag | Description |
|---|---|
--stream NAME |
Filter by stream name (exact match) |
--topic NAME |
Filter by topic name (exact match) |
--sender NAME |
Filter by sender full name (substring match) |
--since DATE |
Only messages at or after this time (RFC3339 or YYYY-MM-DD) |
--until DATE |
Only messages at or before this time (RFC3339 or YYYY-MM-DD) |
--days N |
Only messages from the last N days |
--hours N |
Only messages from the last N hours |
--last N |
Return the N most recent messages (oldest-first output) |
--limit N |
Maximum messages to return (default 200) |
--all |
Remove safety limit and return all matching messages |
Examples
# Last 7 days in the #general stream
zulcrawl messages --stream general --days 7
# All messages in a specific topic
zulcrawl messages --stream engineering --topic "Q2 roadmap"
# Messages from a specific sender since a date
zulcrawl messages --sender "Alice" --since 2026-01-01
# Last 50 messages across the entire archive
zulcrawl messages --last 50
# Messages in the last 4 hours across all streams
zulcrawl messages --hours 4
# Combined: sender + date range, higher limit
zulcrawl messages --stream dev --sender "Bob" --since 2026-03-01 --until 2026-03-31 --limit 500
# Remove the safety cap entirely (use with care on large archives)
zulcrawl messages --stream general --days 365 --allOutput format per message:
[YYYY-MM-DD HH:MM:SS] #stream > topic | Sender Name
message content text...
- Uses
modernc.org/sqlite(pure Go, no CGO) - Current implementation focuses on core mirror/search flow
- Planned later (per spec): tail/event queue, MCP server, summarization, Q&A extraction
Embeddings are disabled by default. No embedding calls are ever made unless you explicitly enable them.
# 1. Add to ~/.zulcrawl/config.toml
cat >> ~/.zulcrawl/config.toml <<'EOF'
[embeddings]
enabled = true
model = "nomic-embed-text-v2-moe" # recommended; ~550 MB
# provider = "ollama" # only supported provider
# ollama_base = "http://localhost:11434" # default
# batch_size = 32 # texts per /api/embed request
# sample_messages = 50 # messages sampled per topic
EOF
# 2. Start Ollama and pull the model
ollama serve &
ollama pull nomic-embed-text-v2-moe
# 3. Build embeddings for all topics
zulcrawl embeddings backfill
# 4. Semantic topic search
zulcrawl topics search --semantic "database migration strategy"- First-run latency: embedding a large archive (thousands of topics) may take several minutes on first run. Subsequent incremental runs only embed new topics.
- No cloud calls: all embeddings are generated locally via Ollama.
- No Python: pure Go implementation using Ollama's HTTP API.
- No sqlite-vec / HNSW: brute-force cosine similarity is used. This is fast enough for typical Zulip archive sizes.
- Heavier alternative:
bge-m3can be used instead ofnomic-embed-text-v2-moefor potentially better recall on multilingual corpora; pull and update config. - Model change: if you switch models, re-embed everything with:
zulcrawl embeddings backfill --force - Dimension mismatch: mixing vectors from different models produces wrong results. zulcrawl detects this and emits a clear error asking you to re-embed.
# Build/update embeddings
zulcrawl embeddings backfill
zulcrawl embeddings backfill --batch 16 # smaller batches for low RAM
zulcrawl embeddings backfill --limit 500 # embed at most N topics
zulcrawl embeddings backfill --force # re-embed everything (model change)
# Check embedding coverage
zulcrawl embeddings status
# Semantic topic search
zulcrawl topics search --semantic "deploy pipeline"
zulcrawl topics search --semantic --stream engineering "API rate limit"| Flag | Description |
|---|---|
--stream NAME |
Filter by stream name |
--unresolved |
Exclude resolved topics |
--limit N |
Maximum topics to return (default 20) |
--semantic |
Use Ollama vector search (requires embeddings enabled + backfilled) |
FTS-only behaviour is unchanged when --semantic is not set.