Codexa is an offline-first semantic indexing engine for large document collections.
- 🔍 Semantic search (embeddings, entities, clusters) + hybrid dense / sparse (corpus BM25) fusion (RRF default, weighted optional;
search.hybrid.quality_weight_enabledcan opt weighted fusion intoquality_scorediscounting) - 📚 Multi-format ingestion
- Documents — PDF, DOCX, PPTX, PPT, XLSX, XLS, ODT, ODG, EPUB, FB2, MOBI, AZW, AZW3, HTML / HTM / XHTML, CHM, PS, RTF. RTF needs the optional
[rtf]extra (pip install -e ".[rtf]"→striprtf); without it.rtfsoft-skips to empty rather than indexing raw control words. - Images (OCR + object detection) — PNG, JPG / JPEG / JFIF, TIFF / TIF, BMP, GIF, WEBP, JP2 / J2K (JPEG 2000), CBZ — all Pillow-native, no extra. CBR needs the optional
[cbr]extra plusunrar,unar, orbsdtar; without it.cbrsoft-skips instead of falling through to binary text. HEIC / HEIF (iPhone default since iOS 11) and AVIF decode with the optional[heif]extra (pip install -e ".[heif]"); without it those three soft-decode to empty rather than crashing extraction. OCR pulls words off the image; with the optional[vision]extra and an operator-supplied YOLOv8-class.onnx, an[objects: …]line of detected objects is appended so text-free photos stay searchable. - DjVu — DJVU / DJV scanned documents extract their embedded text layer via the DjVuLibre
djvutxtsystem tool, falling back to per-page OCR when the document has no text layer. Install DjVuLibre (apt install djvulibre-bin/brew install djvulibre); without it DjVu files soft-skip rather than crashing the indexer. - SVG — stdlib-XML text by default; with the optional
[svg]extra (pip install -e ".[svg]"), SVGs whose text is path-outlined or whose payload is an embedded raster are rasterised and OCR'd as a fallback. - Text & data — TXT, MD, RST, CSV, JSON, YAML / YML, LOG
- ZIM archives — drop a
.zim(Wikipedia / Kiwix bundle) underdata_dirsand Codexa walks every article via libzim and indexes them just like any other corpus file. Needs the[zim]extra (pip install -e ".[zim]"). - Plus anything you register via the Extractor Strategy — plugins can add their own formats with
register_extractor(extensions).
- Documents — PDF, DOCX, PPTX, PPT, XLSX, XLS, ODT, ODG, EPUB, FB2, MOBI, AZW, AZW3, HTML / HTM / XHTML, CHM, PS, RTF. RTF needs the optional
- ⚡ GPU-accelerated OCR and embeddings, with auto-batch sizing under RAM pressure and CPU-only chunking workers
- 🧱 Incremental indexing with hashing + SQLite manifest connection pool for >50k-file corpora
- 🗂️ Sharded embedding cache with hit/miss telemetry + wired caches track the six-point AGENTS.md cache contract (cap · TTL · lock · reset hook · stats accessor) through
AppContext.reset; the ColBERT scorer cache remains the documented open exception underCACHE-33 - 📊 Streamlit UI (search · semantic structure · index · chatbot ops · diagnostics)
- 🛡️ Health checks, config validation, structured JSON logs, atomic durable writes (
utils.io.atomic_write_text— tmp + fsync + rename + parent-dir fsync) - 🧠 Optional LLM support via Ollama OR an embedded
llama.cpp(local) provider, with cross-provider fallback - 🔁 Multi-stage RAG dispatch: query rewrite (BOT-1) → query decomposition / HyDE / multilingual / synonym fan-out → dense retrieval → horizon + quality gates → Jaccard + semantic dedup → PRF expansion → dense+sparse fusion → cross-encoder + ColBERT rerank → expand-to-parent → scoring + plugin filter → context-budget cut → prompt build → streaming or blocking LLM → orphan-citation strip + hallucination flag + RAG-7 regenerate-when-ungrounded
- 🛟 Multi-tier RAG enrichment: bundled Wikipedia seed (25 English topics plus pt/es/fr overlays) + on-demand libzim full-text + opt-in online fallback
- 💬 Multi-turn sessions with bounded prompt history, optional LLM-summarised drop-off, JSON or SQLite backend; per-passage pin / exclude / thumbs feedback + locale-aware follow-up suggestions
- 🤖 Chatbot operations dashboard — declared
ChatbotObjective(statement / owner / review cadence / success metrics), aggregated session counters, pilot-batch tracking (pilot_runs.jsonlwith O(1) tombstone-overlay updates) - 🌍 i18n (English, Spanish, French, Brazilian Portuguese — runtime-loaded
.pocatalogs) — locale directive threads into every LLM prompt (HyDE / decomp / history-summary / follow-ups / answer) so a pt_BR session never gets an English sub-query / recap / answer - 🧩 Plugin system (discovery, sandboxing, hook plugins, and
register_*seams for extractors / stores / OCR / LLM / chunkers) - 🔬 Semantic toolkit: entity normalization & linking, RAKE + embedding keyphrases, co-occurrence graphs, TF-IDF cluster labels, embedding near-dup detection, PRF query expansion, LexRank extractive summarization
- 🔐 Prompt-injection mitigation: user-supplied queries are fenced before LLM interpolation across
query_decomp/query_rewrite/hyde/fallback/history_compress; each prompt builder pins a regression test for the fence + locale + NO_ANSWER + length rule - 📈 Eval harness with ablation toggles: per-stage on/off (
run_case_with_stage_toggle), regen-vs-baseline comparison (run_case_with_regen_comparison), tuned-vs-baseline overlay sweep (compare_tuned_vs_baseline)
- Quick start
- Configuration
- Architecture → docs/architecture.md
- UI → docs/ui.md
- LLM providers → docs/llm.md
- Wikipedia / ZIM → docs/wikipedia.md
- Semantic toolkit → docs/semantics.md
- Performance → docs/performance.md
- Testing → docs/testing.md
- Troubleshooting & Logging → docs/troubleshooting.md
pip install -e .
cp config-example.yml config.yml # then edit data_dirs / store paths
codexa-validate-config --config config.yml # sanity-check first
codexa-index --config config.yml # build/refresh the index
codexa-ui # launch the dashboardconfig-example.yml is the committed template with generic placeholder paths. Copy it to config.yml (git-ignored) and set your real data_dirs + store.persist_dir — your private paths never get committed. --config defaults to the nearest config.yml above the cwd (or $CODEXA_CONFIG).
The order is load-bearing: codexa-ui boots a search-only frontend — without an existing index it opens but every query returns zero hits. You must run codexa-index at least once before searches return anything. Subsequent runs are incremental (mtime + hash diff against the manifest), so the second invocation only touches files that actually changed.
Every command exists in two equivalent forms — the legacy flat shim (codexa-index, codexa-ui, …) and the unified codexa <subcommand> group (codexa index, codexa ui, …). Pick whichever fits muscle memory; both bind to the same Click commands.
| command | what it does |
|---|---|
codexa index |
scan, extract, chunk, embed, persist (--force-reindex to rebuild from scratch) |
codexa ui |
Streamlit dashboard (search + diagnostics) |
codexa validate-config |
check config.yml for missing paths / invalid values |
codexa check-deps |
scan source for missing imports, optionally auto-install |
codexa cache stats |
print embedding-cache size, entries, hit rate |
codexa plugins list |
list plugins discovered in the current environment |
codexa fetch model |
download a curated GGUF for the local LLM provider |
codexa fetch embedding |
pre-cache the SentenceTransformer embedder under <persist_dir>/hf_cache |
codexa fetch zim |
download a Kiwix Wikipedia ZIM and wire cfg.search.wikipedia.zim_path |
codexa extract-strings |
scan source for _("…") calls and emit a .pot template |
codexa clear-stale-lock |
clear orphan chroma.sqlite3-shm / -wal after a crashed indexer (--dry-run to preview) |
codexa eval baseline / run / docs-baseline |
run live-index, case-file retrieval/RAG, or clean-checkout repo-doc quality gates |
To run straight from a checkout without pip install-ing the package, install just the third-party deps into a venv and invoke each command via python -m codexa.cli.<module>:
# from the repo root
python -m venv .venv && source .venv/bin/activate
# Pull in current core dependencies from [project.dependencies] without
# maintaining a second hand-written package list in this README.
python - <<'PY'
import subprocess
import sys
import tomllib
from pathlib import Path
deps = tomllib.loads(Path("pyproject.toml").read_text(encoding="utf-8"))["project"]["dependencies"]
subprocess.check_call([sys.executable, "-m", "pip", "install", *deps])
PY
# Now run each command via `python -m` instead of the console scripts.
python -m codexa.cli.config_validator --config config.yml
python -m codexa.cli.indexer_cli --config config.yml # ← REQUIRED before any UI search
python -m codexa.cli.cache_cli --config config.yml
python -m codexa.cli.plugins_cli --config config.yml
python -m codexa.cli.eval_cli baseline --config config.yml
python -m codexa.cli.deps_cli
python -m codexa.cli.fetch_model # interactive GGUF picker
python -m codexa.cli.zim_fetch # download a Wikipedia ZIM
# The UI is a Streamlit app — launch via the package entry point:
python -m codexa.uiThe python -m codexa.cli.* form maps 1:1 to the codexa-* console scripts above and accepts the same flags. Index before searching: codexa ui / python -m codexa.ui only reads the persisted Chroma store under cfg.store.persist_dir; never crawls data_dirs. Run python -m codexa.cli.indexer_cli --config config.yml (or codexa-index --config config.yml when installed) at least once before opening the dashboard, otherwise every query returns zero hits.
The UI ships in three equivalent launch forms — same Streamlit app, same codexa/ui/app.py entry point:
codexa ui/codexa-ui— Click console script (installed builds only). Thin wrapper aroundcodexa.ui.app:main.python -m codexa.ui— package entry point (codexa/ui/__main__.py). Samemain()as the console script; works withoutpip installas long as the repo root is onPYTHONPATH.python -m streamlit run codexa/ui/app.py— drop down to the Streamlit CLI directly. Useful when you want to pass Streamlit's own flags (--server.port,--server.headless,--browser.gatherUsageStats false, …) without threading them through Click — anything afterrun …goes straight to Streamlit.
All three honour the same cfg.yml resolution (relative to the working directory unless CODEXA_CONFIG is set).
config.yml is the single source of truth. Run codexa-validate-config to sanity-check it. The committed config-example.yml is the broad template; this README keeps the core operator keys and points at the plugin seams where a registry can extend them.
# ---- Locale ------------------------------------------------------------
# UI / CLI language. Resolved cfg → CODEXA_LOCALE env → LANG env → "en".
# Ships catalogs for: en, es, fr, pt_BR. A sub-tag (e.g. pt_BR) falls
# back to its base (pt) when its catalog is missing. Internal log event
# keys stay English regardless.
locale: "en"
# ---- Corpus ------------------------------------------------------------
# Directories to index. Multiple roots allowed. Symlinks followed.
data_dirs:
- "/path/to/corpus"
# ---- Vector store ------------------------------------------------------
store:
backend: "chroma" # built-in backend; plugins can
# register more with
# codexa.store.register_store_backend
persist_dir: "./chroma_db"
collection_name: "documents" # main chunk collection
file_collection: "files" # per-file metadata collection
entity_collection: "entities" # per-entity rollups
# ROB-1: when true, an HNSW segment-corrupt error from query() raises
# ChromaSegmentCorruptError instead of returning an empty hit list.
# Off by default so a single bad collection doesn't take the panel down.
raise_on_segment_corrupt: false
# ---- Embeddings --------------------------------------------------------
embeddings:
# HF identifier OR an absolute on-disk path. Identifier → pre-cache once
# with `codexa fetch embedding` (or opt a first run into download with
# allow_model_download / CODEXA_ALLOW_MODEL_DOWNLOAD); subsequent loads
# pin HF_HUB_OFFLINE=1 against the `<persist_dir>/hf_cache` snapshot.
# Point at an absolute path for a strictly-offline first run.
model_path: "sentence-transformers/all-MiniLM-L6-v2"
batch_size: 32 # encoder batch ceiling
auto_batch: true # halve at ≥60% RAM (≥80% VRAM on
# CUDA); double back to the ceiling
# at ≤40% RAM (≤55% VRAM); floor 1.
# "auto" probes cuda → mps → xpu → cpu. Override with "cuda"/"cuda:1"/
# "mps"/"xpu"/"cpu". AMD ROCm rides the cuda branch. Failed probes log
# device_probe_failed DEBUG so `--verbose` ops can tell torch-missing
# from ROCm-broken from ipex-import-explodes.
device: "auto"
# Cap PyTorch's per-process VRAM allocator (0.0 < x < 1.0) so a
# re-index can't starve the rest of the desktop of GPU memory.
# Logged once at init as cuda_memory_capped.
cuda_memory_fraction: 0.7
# BLAS / tokenizer / torch intra-op cap for the main-process embedder.
# Dominant throughput lever on CPU-only hosts; default/template value 8
# suits a large CPU host.
# Operator env vars (OMP_NUM_THREADS=…) win.
main_thread_cap: 8
# On-disk shard cache. Content-addressed; sha256(text) → vector.
cache:
enabled: true
dir: "{store.persist_dir}/emb_cache"
shards: 16
# ---- OCR ---------------------------------------------------------------
ocr:
enabled: true
# "both" (Paddle + Tesseract merged) | "paddle" | "tesseract".
# Paddle adapter targets PaddleOCR 3.x (use_textline_orientation,
# predict(path)); legacy use_gpu is gone — paddlepaddle picks
# CPU/GPU from the installed wheel.
engine: "tesseract"
languages: "eng" # Tesseract codes (ISO 639-2/T)
pdf_dpi: 200 # render DPI for image-only PDF pages
# When a PDF's text-layer averages fewer than this many chars per
# page, fall back to inline OCR. Lower = OCR less aggressively; higher
# = OCR more aggressively. Only consulted when ocr.enabled.
pdf_text_threshold_chars: 512
# Per-stream FlateDecode cap (MiB) used on retry when a PDF page blows
# past pypdf's 75 MB default. Per-call only — global pypdf cap stays
# safe. 4 GiB covers every legitimate textbook we've seen.
pdf_decompress_limit_mb: 4096
# Content-addressed OCR cache. Keyed on sha256(file_bytes) + (engine,
# languages, pdf_dpi). Default-off; flip on for OCR-heavy corpora.
cache:
enabled: false
dir: "{store.persist_dir}/ocr_cache"
shards: 8
# ---- SVG ---------------------------------------------------------------
svg:
# When an SVG's stdlib-XML pass returns < text_threshold_chars chars
# AND ocr.enabled, rasterise (render_dpi) and OCR the bitmap. Needs the
# [svg] extra (svglib + reportlab) or a system cairosvg; without one
# the fallback is a silent no-op.
render_fallback: true
render_dpi: 150
text_threshold_chars: 16
# ---- Vision (object detection) -----------------------------------------
# Appends an "[objects: …]" line to image extraction via an ONNX-Runtime
# detector. Independent of ocr.enabled (run either / both / neither).
# Operator-supplied model — no runtime fetch. Needs the [vision] extra
# (onnxruntime, no torch). Extra or model absent → silent no-op.
vision:
enabled: false
model_path: "" # "" → {persist_dir}/vision/model.onnx
class_names_path: "" # "" → bundled COCO-80
score_threshold: 0.35
max_labels: 12
# ---- LLM ---------------------------------------------------------------
llm:
provider: "none" # "none" | "ollama" | "local"
fallback: "local" # retried when primary unreachable
# Ollama (external server)
model: "llama3"
host: "http://localhost:11434"
# Intra-Ollama model chain — when `model` times out/errors the
# dispatcher walks each entry before falling over to `fallback`.
# Each miss emits one ollama_model_failed WARNING.
model_fallbacks: []
# Per-Ollama-call timeout (seconds). Default 300 covers cold-start
# weight loads on CPU-only hosts. Drop to 30–60 on fast-GPU boxes.
# Garbage values silently fall back to default.
timeout: 300
# Fast /api/tags availability probe timeout. Generation still uses
# `timeout`; raise this for slow remote Ollama hosts.
probe_timeout: 3.0
# Embedded llama.cpp — deterministic (temperature=0, top_k=1, fixed
# seed). `pip install -e ".[local-llm]"`, then point `model_path` at
# a GGUF on disk. Curated GGUFs land under `{persist_dir}/models/` via
# `codexa fetch model --name <entry>` (override dir with $CODEXA_MODEL_DIR
# or --out). `codexa fetch model --recommend` probes the host and
# ✓/✗-marks every catalog entry; `--largest` downloads the biggest
# entry that fits (size × 1.2 ≤ RAM).
#
# Catalog entries (8-bit / Q4_K_M GGUFs):
# name size license notes
# ------------------------ ------ -------------------- -----------------------------------------------
# qwen2.5-0.5b (default) 0.4 GB Apache-2.0 smallest / fastest first run
# llama-3.2-1b 0.8 GB Llama 3.2 Community 131 K context, runs anywhere
# qwen2.5-1.5b 1.0 GB Apache-2.0 balanced 32 K context
# qwen3-30b-a3b 19 GB Apache-2.0 MoE (3 B active) — fastest big-class on CPU
# qwen3-32b 20 GB Apache-2.0 dense, hybrid thinking/non-thinking modes
# llama-3.3-70b 42 GB Llama 3.3 Community highest-quality dense for ≥64 GB hosts
# deepseek-r1-distill-70b 42 GB MIT reasoning-tuned (chain-of-thought baked in)
# qwen2.5-72b 47 GB Apache-2.0 multi-part download (auto-concatenated)
# mixtral-8x22b 85 GB Apache-2.0 multi-part; MoE (~39 B active); RAM-hungry
local:
model_path: "{store.persist_dir}/models/qwen2.5-0.5b-instruct-q4_k_m.gguf"
n_ctx: 4096 # raise per-model when catalog suggests
seed: 1337 # fixed seed → reproducible answers
max_tokens: 512
max_summary_chars: 500 # truncate LLM-generated corpus summary
# ---- Search / retrieval ------------------------------------------------
search:
history_turns: 6 # 3 user + 3 assistant; 0 disables
history_compress: false # LLM-summarise dropped turns; opt-in
highlight_entities: false # bold recognised entities in passages
passage_thumbnail_dpi: 100 # PDF page render DPI in expanders
# Cross-encoder reranker over the retrieved top-K. Disabled by default
# — costs ~50-200 ms per query on CPU (sub-50 ms on GPU) but catches
# nuance the bi-encoder misses. Auto-downloads on first use.
reranker:
enabled: false
model_id: "BAAI/bge-reranker-base"
top_k_keep: 5
device: "auto" # "auto" → cuda when available
# Drop paraphrase near-duplicates after the Jaccard dedup. Bounded to
# the over-fetched ≤30 rows. Costs one extra embed pass; set false on
# latency-tight hosts.
semantic_dedup:
enabled: true
threshold: 0.93 # cosine ≥ this ⇒ near-duplicate
# Pseudo-relevance feedback (Rocchio). Results merged, never replaced,
# so PRF only ever adds recall. Costs one extra embed + store query.
prf:
enabled: true
top_m: 5 # survivors fed into the centroid
alpha: 1.0 # original-query weight
beta: 0.75 # feedback-centroid weight
# Compound-question splitter — RRF-fuses 2–3 sub-query results.
# User query is wrapped in <<<USER_INPUT>>> fence so a hostile
# multi-line query can't break out of its role slot.
query_decomp:
enabled: false
# HyDE — embed a hypothetical answer paragraph as a second retrieval
# pass. Same fencing as query_decomp.
hyde:
enabled: false
n: 1 # clamped to [1, 5]
# Conversational rewrite — when history exists and the query looks like
# a follow-up, build a standalone query with the same prompt fences and
# locale directive as HyDE / query_decomp.
query_rewrite:
enabled: false
ab_fuse: false # RRF-fuse original + rewrite
# Optional fan-out variants. All ride the same batched sub-query embed
# + concurrent store.query path before RRF merge.
multilingual:
enabled: false
query_expansion:
enabled: false
synonyms_path: "" # YAML/JSON alias map
# FLARE — watch the streaming answer for low-confidence sentences and
# fire additional retrieval rounds. Each trigger logs flare_retrigger
# + flare_passages_added DEBUG events.
flare:
enabled: false
min_logprob: -2.5 # per-token logprob threshold
max_rounds: 3 # hard cap on re-retrieval rounds
top_k_extra: 5 # passages per re-retrieval
# Offline "🧠 Passage insights" expander (LexRank TL;DR + KeyBERT
# phrases) computed with the search embedder already in memory.
insights:
enabled: false
max_sentences: 3
top_keyphrases: 8
# Offline encyclopedic enrichment. Tier walk: ZIM → seed → online.
# codexa/data/wikipedia_seed*.json ships 25 English topics plus
# localized pt/es/fr overlays.
wikipedia:
enabled: true
data_dirs: [] # extra operator-supplied seed JSONs
zim_path: "" # single .zim or list of paths
zim_dir: "" # glob *.zim under a folder
online: false # opt-in fallback; breaks offline-first
lang: "auto" # "auto" derives from cfg.locale
# Optional hierarchical retrieval: indexer.hierarchical emits L0/L1/L2
# rows; this stage swaps L2 child hits for their L1 parent when present.
expand_to_parent:
enabled: false
# ---- Metadata / sessions -----------------------------------------------
metadata:
# Both manifest_file and sessions_file pick a backend from the suffix:
# `.json` → JSON (full-file rewrite per save), `.db`/`.sqlite` → SQLite
# (per-row UPSERT in one tx). On hosts without POSIX fcntl locks,
# `.json` sessions route to a sibling `.db` automatically for
# cross-process safety. Switch to SQLite explicitly at ~50k files /
# 1k saved threads.
manifest_file: "{store.persist_dir}/manifest.json"
sessions_file: "{store.persist_dir}/sessions.json"
index_info_file: "{store.persist_dir}/index_info.json"
enable_entity_extraction: true
tag_count: 50
# ---- Indexer -----------------------------------------------------------
indexer:
# Upper bounds — effective values are cores-aware capped (chunker:
# cores//4, floor 2, ceiling 8; OCR: 2). Over-cap values are silently
# downscaled and emit a workers_downscaled WARNING.
workers: 4
ocr_workers: 2
skip_dirs: [] # extra dir names to skip
# Optional ramp-up cap. 0 = process everything. Mirrors --max-files.
max_files: 0
# Manifest checkpoint cadence during the embed loop. Saves happen
# every N files OR every M seconds, whichever comes first.
checkpoint_every_files: 10
checkpoint_every_seconds: 5.0
# Optional L0/L1/L2 chunk tree. Default off keeps the flat chunk path.
hierarchical:
enabled: false
# ---- Chunking ----------------------------------------------------------
chunking:
# Per-chunk token cap. Overlap is computed dynamically per file
# (10% of token count, min 20). The optional hierarchical producer
# rebuilds flat chunks into L0/L1/L2 rows when indexer.hierarchical
# is enabled.
chunk_size: 512
# ---- Format gates ------------------------------------------------------
formats:
enabled: [] # allow-list; empty = all
disabled: [] # deny-list (always wins)
# When no registered extractor matches a file's extension, read its
# bytes as UTF-8 with errors='ignore'. True is backwards-compatible
# but unsafe on mixed-binary corpora — flip false to make unknown
# extensions return "" instead.
text_fallback: true
# Cap (MiB) on the text-fallback read; a multi-GB binary under
# data_dirs is decoded to its prefix only and logged
# text_fallback_truncated.
text_fallback_max_mb: 64
# ---- Plugins -----------------------------------------------------------
plugins:
disabled: [] # block individual plugin names
# Cap on the timeout-dispatch thread pool. Only matters when plugins
# opt into a `timeout_s`.
executor_workers: 4
# slow_threshold_ms: 1000 # WARN when a plugin hook exceeds
# Per-plugin slices land as siblings of `disabled`:
# plugins:
# word_counter:
# min_words: 100Two pipelines (indexing + search) over one persistent Chroma vector store, strategy-registry dispatch keyed off a single cfg dict, plugin hooks at pre_extract / chunk_filter / search_filter / post_index / ui_panel. ASCII diagram, layer map, cross-cutting conventions, and the plugin-system walkthrough (entry-point group, hook contracts, lifecycle) live in docs/architecture.md.
codexa ui boots a Streamlit dashboard. Layout: sidebar (saved-chat threads + filter) → main (search box + RAG answer + retrieved passages with source-aware thumbnails) → right-hand tab strip (🧬 Semantic Structure / 📂 Index / 🩺 Diagnostics).
Full walkthrough — passage rendering, image augment per source type, saved-chat schema + sidebar behaviour, Semantic Structure tab caching, Index tab controls (Run Indexer / Force reindex / stale-lock cleanup), and i18n catalog wiring (en / es / fr / pt_BR) — lives in docs/ui.md.
Three values of cfg.llm.provider: "none" (passages only), "ollama" (external server), "local" (embedded llama.cpp, deterministic). Cross-provider cfg.llm.fallback retries the unreachable primary. Curated 9-entry GGUF catalog with codexa fetch model --recommend / --largest host-probe helpers and multi-part download support lives in docs/llm.md.
Two distinct surfaces: search-time RAG enrichment (cfg.search.wikipedia.* walks ZIM → seed → online; bundled 25-topic seed always available) and ZIM-as-corpus ingestion (drop a .zim under data_dirs for full chunk+embed). codexa-fetch-zim / codexa-check-deps --fetch-zims walkthroughs and the precedence table live in docs/wikipedia.md.
codexa.semantic.* — pure-Python offline primitives: entity normalization, RAKE / KeyBERT keyphrases, co-occurrence graphs, TF-IDF cluster labels, embedding near-dup filter, Rocchio PRF, LexRank summary. Statistical and embedding-aware variants, plus where each surfaces in the UI (🧠 Passage insights, Semantic Structure graph, query-expansion captions, inline entity highlight) live in docs/semantics.md.
GPU device-probe order + per-vendor install hints, throughput knobs (auto-batch, cache stats, cross-file dedup, mtime-first change detection, SQLite manifest, cores-aware worker caps, CPU-only chunkers, CUDA OOM backoff, effective_budget pre-flight log), and the ramp-up benchmark (scripts/bench_indexer.py) live in docs/performance.md.
Local pytest invocations (pytest, --cov, -m bench), offline-by-default conftest discipline, the bench marker, GitHub Actions workflow (lint + test, cov ≥ 90%), and SonarCloud wiring live in docs/testing.md.
Chroma sqlite lock errors (ChromaInitError / database is locked / ChromaSegmentCorruptError) with the recovery steps, plus the structured JSON-lines log surface (file rotation, stable event keys like run_start / effective_budget / workers_downscaled / embed_oom_backoff, CODEXA_LOG_DIR override) live in docs/troubleshooting.md.