Codexa

Codexa is an offline-first semantic indexing engine for large document collections.

🔍 Semantic search (embeddings, entities, clusters) + hybrid dense / sparse (corpus BM25) fusion (RRF default, weighted optional; search.hybrid.quality_weight_enabled can opt weighted fusion into quality_score discounting)
📚 Multi-format ingestion
- Documents — PDF, DOCX, PPTX, PPT, XLSX, XLS, ODT, ODG, EPUB, FB2, MOBI, AZW, AZW3, HTML / HTM / XHTML, CHM, PS, RTF. RTF needs the optional [rtf] extra (pip install -e ".[rtf]" → striprtf); without it .rtf soft-skips to empty rather than indexing raw control words.
- Images (OCR + object detection) — PNG, JPG / JPEG / JFIF, TIFF / TIF, BMP, GIF, WEBP, JP2 / J2K (JPEG 2000), CBZ — all Pillow-native, no extra. CBR needs the optional [cbr] extra plus unrar, unar, or bsdtar; without it .cbr soft-skips instead of falling through to binary text. HEIC / HEIF (iPhone default since iOS 11) and AVIF decode with the optional [heif] extra (pip install -e ".[heif]"); without it those three soft-decode to empty rather than crashing extraction. OCR pulls words off the image; with the optional [vision] extra and an operator-supplied YOLOv8-class .onnx, an [objects: …] line of detected objects is appended so text-free photos stay searchable.
- DjVu — DJVU / DJV scanned documents extract their embedded text layer via the DjVuLibre djvutxt system tool, falling back to per-page OCR when the document has no text layer. Install DjVuLibre (apt install djvulibre-bin / brew install djvulibre); without it DjVu files soft-skip rather than crashing the indexer.
- SVG — stdlib-XML text by default; with the optional [svg] extra (pip install -e ".[svg]"), SVGs whose text is path-outlined or whose payload is an embedded raster are rasterised and OCR'd as a fallback.
- Text & data — TXT, MD, RST, CSV, JSON, YAML / YML, LOG
- ZIM archives — drop a .zim (Wikipedia / Kiwix bundle) under data_dirs and Codexa walks every article via libzim and indexes them just like any other corpus file. Needs the [zim] extra (pip install -e ".[zim]").
- Plus anything you register via the Extractor Strategy — plugins can add their own formats with register_extractor(extensions).
⚡ GPU-accelerated OCR and embeddings, with auto-batch sizing under RAM pressure and CPU-only chunking workers
🧱 Incremental indexing with hashing + SQLite manifest connection pool for >50k-file corpora
🗂️ Sharded embedding cache with hit/miss telemetry + wired caches track the six-point AGENTS.md cache contract (cap · TTL · lock · reset hook · stats accessor) through AppContext.reset; the ColBERT scorer cache remains the documented open exception under CACHE-33
📊 Streamlit UI (search · semantic structure · index · chatbot ops · diagnostics)
🛡️ Health checks, config validation, structured JSON logs, atomic durable writes (utils.io.atomic_write_text — tmp + fsync + rename + parent-dir fsync)
🧠 Optional LLM support via Ollama OR an embedded llama.cpp (local) provider, with cross-provider fallback
🔁 Multi-stage RAG dispatch: query rewrite (BOT-1) → query decomposition / HyDE / multilingual / synonym fan-out → dense retrieval → horizon + quality gates → Jaccard + semantic dedup → PRF expansion → dense+sparse fusion → cross-encoder + ColBERT rerank → expand-to-parent → scoring + plugin filter → context-budget cut → prompt build → streaming or blocking LLM → orphan-citation strip + hallucination flag + RAG-7 regenerate-when-ungrounded
🛟 Multi-tier RAG enrichment: bundled Wikipedia seed (25 English topics plus pt/es/fr overlays) + on-demand libzim full-text + opt-in online fallback
💬 Multi-turn sessions with bounded prompt history, optional LLM-summarised drop-off, JSON or SQLite backend; per-passage pin / exclude / thumbs feedback + locale-aware follow-up suggestions
🤖 Chatbot operations dashboard — declared ChatbotObjective (statement / owner / review cadence / success metrics), aggregated session counters, pilot-batch tracking (pilot_runs.jsonl with O(1) tombstone-overlay updates)
🌍 i18n (English, Spanish, French, Brazilian Portuguese — runtime-loaded .po catalogs) — locale directive threads into every LLM prompt (HyDE / decomp / history-summary / follow-ups / answer) so a pt_BR session never gets an English sub-query / recap / answer
🧩 Plugin system (discovery, sandboxing, hook plugins, and register_* seams for extractors / stores / OCR / LLM / chunkers)
🔬 Semantic toolkit: entity normalization & linking, RAKE + embedding keyphrases, co-occurrence graphs, TF-IDF cluster labels, embedding near-dup detection, PRF query expansion, LexRank extractive summarization
🔐 Prompt-injection mitigation: user-supplied queries are fenced before LLM interpolation across query_decomp / query_rewrite / hyde / fallback / history_compress; each prompt builder pins a regression test for the fence + locale + NO_ANSWER + length rule
📈 Eval harness with ablation toggles: per-stage on/off (run_case_with_stage_toggle), regen-vs-baseline comparison (run_case_with_regen_comparison), tuned-vs-baseline overlay sweep (compare_tuned_vs_baseline)

Quick start

pip install -e .
cp config-example.yml config.yml             # then edit data_dirs / store paths
codexa-validate-config --config config.yml   # sanity-check first
codexa-index --config config.yml             # build/refresh the index
codexa-ui                                    # launch the dashboard

config-example.yml is the committed template with generic placeholder paths. Copy it to config.yml (git-ignored) and set your real data_dirs + store.persist_dir — your private paths never get committed. --config defaults to the nearest config.yml above the cwd (or $CODEXA_CONFIG).

The order is load-bearing: codexa-ui boots a search-only frontend — without an existing index it opens but every query returns zero hits. You must run codexa-index at least once before searches return anything. Subsequent runs are incremental (mtime + hash diff against the manifest), so the second invocation only touches files that actually changed.

Every command exists in two equivalent forms — the legacy flat shim (codexa-index, codexa-ui, …) and the unified codexa <subcommand> group (codexa index, codexa ui, …). Pick whichever fits muscle memory; both bind to the same Click commands.

command	what it does
`codexa index`	scan, extract, chunk, embed, persist (`--force-reindex` to rebuild from scratch)
`codexa ui`	Streamlit dashboard (search + diagnostics)
`codexa validate-config`	check `config.yml` for missing paths / invalid values
`codexa check-deps`	scan source for missing imports, optionally auto-install
`codexa cache stats`	print embedding-cache size, entries, hit rate
`codexa plugins list`	list plugins discovered in the current environment
`codexa fetch model`	download a curated GGUF for the local LLM provider
`codexa fetch embedding`	pre-cache the SentenceTransformer embedder under `<persist_dir>/hf_cache`
`codexa fetch zim`	download a Kiwix Wikipedia ZIM and wire `cfg.search.wikipedia.zim_path`
`codexa extract-strings`	scan source for `_("…")` calls and emit a `.pot` template
`codexa clear-stale-lock`	clear orphan `chroma.sqlite3-shm` / `-wal` after a crashed indexer (`--dry-run` to preview)
`codexa eval baseline` / `run` / `docs-baseline`	run live-index, case-file retrieval/RAG, or clean-checkout repo-doc quality gates

To run straight from a checkout without pip install-ing the package, install just the third-party deps into a venv and invoke each command via python -m codexa.cli.<module>:

# from the repo root
python -m venv .venv && source .venv/bin/activate

# Pull in current core dependencies from [project.dependencies] without
# maintaining a second hand-written package list in this README.
python - <<'PY'
import subprocess
import sys
import tomllib
from pathlib import Path

deps = tomllib.loads(Path("pyproject.toml").read_text(encoding="utf-8"))["project"]["dependencies"]
subprocess.check_call([sys.executable, "-m", "pip", "install", *deps])
PY

# Now run each command via `python -m` instead of the console scripts.
python -m codexa.cli.config_validator --config config.yml
python -m codexa.cli.indexer_cli      --config config.yml    # ← REQUIRED before any UI search
python -m codexa.cli.cache_cli        --config config.yml
python -m codexa.cli.plugins_cli      --config config.yml
python -m codexa.cli.eval_cli baseline --config config.yml
python -m codexa.cli.deps_cli
python -m codexa.cli.fetch_model      # interactive GGUF picker
python -m codexa.cli.zim_fetch        # download a Wikipedia ZIM

# The UI is a Streamlit app — launch via the package entry point:
python -m codexa.ui

The python -m codexa.cli.* form maps 1:1 to the codexa-* console scripts above and accepts the same flags. Index before searching: codexa ui / python -m codexa.ui only reads the persisted Chroma store under cfg.store.persist_dir; never crawls data_dirs. Run python -m codexa.cli.indexer_cli --config config.yml (or codexa-index --config config.yml when installed) at least once before opening the dashboard, otherwise every query returns zero hits.

The UI ships in three equivalent launch forms — same Streamlit app, same codexa/ui/app.py entry point:

codexa ui / codexa-ui — Click console script (installed builds only). Thin wrapper around codexa.ui.app:main.
python -m codexa.ui — package entry point (codexa/ui/__main__.py). Same main() as the console script; works without pip install as long as the repo root is on PYTHONPATH.
python -m streamlit run codexa/ui/app.py — drop down to the Streamlit CLI directly. Useful when you want to pass Streamlit's own flags (--server.port, --server.headless, --browser.gatherUsageStats false, …) without threading them through Click — anything after run … goes straight to Streamlit.

All three honour the same cfg.yml resolution (relative to the working directory unless CODEXA_CONFIG is set).

Configuration

config.yml is the single source of truth. Run codexa-validate-config to sanity-check it. The committed config-example.yml is the broad template; this README keeps the core operator keys and points at the plugin seams where a registry can extend them.

# ---- Locale ------------------------------------------------------------
# UI / CLI language. Resolved cfg → CODEXA_LOCALE env → LANG env → "en".
# Ships catalogs for: en, es, fr, pt_BR. A sub-tag (e.g. pt_BR) falls
# back to its base (pt) when its catalog is missing. Internal log event
# keys stay English regardless.
locale: "en"

# ---- Corpus ------------------------------------------------------------
# Directories to index. Multiple roots allowed. Symlinks followed.
data_dirs:
  - "/path/to/corpus"

# ---- Vector store ------------------------------------------------------
store:
  backend: "chroma"                    # built-in backend; plugins can
                                       # register more with
                                       # codexa.store.register_store_backend
  persist_dir: "./chroma_db"
  collection_name: "documents"         # main chunk collection
  file_collection: "files"             # per-file metadata collection
  entity_collection: "entities"        # per-entity rollups
  # ROB-1: when true, an HNSW segment-corrupt error from query() raises
  # ChromaSegmentCorruptError instead of returning an empty hit list.
  # Off by default so a single bad collection doesn't take the panel down.
  raise_on_segment_corrupt: false

# ---- Embeddings --------------------------------------------------------
embeddings:
  # HF identifier OR an absolute on-disk path. Identifier → pre-cache once
  # with `codexa fetch embedding` (or opt a first run into download with
  # allow_model_download / CODEXA_ALLOW_MODEL_DOWNLOAD); subsequent loads
  # pin HF_HUB_OFFLINE=1 against the `<persist_dir>/hf_cache` snapshot.
  # Point at an absolute path for a strictly-offline first run.
  model_path: "sentence-transformers/all-MiniLM-L6-v2"
  batch_size: 32                       # encoder batch ceiling
  auto_batch: true                     # halve at ≥60% RAM (≥80% VRAM on
                                       # CUDA); double back to the ceiling
                                       # at ≤40% RAM (≤55% VRAM); floor 1.
  # "auto" probes cuda → mps → xpu → cpu. Override with "cuda"/"cuda:1"/
  # "mps"/"xpu"/"cpu". AMD ROCm rides the cuda branch. Failed probes log
  # device_probe_failed DEBUG so `--verbose` ops can tell torch-missing
  # from ROCm-broken from ipex-import-explodes.
  device: "auto"
  # Cap PyTorch's per-process VRAM allocator (0.0 < x < 1.0) so a
  # re-index can't starve the rest of the desktop of GPU memory.
  # Logged once at init as cuda_memory_capped.
  cuda_memory_fraction: 0.7
  # BLAS / tokenizer / torch intra-op cap for the main-process embedder.
  # Dominant throughput lever on CPU-only hosts; default/template value 8
  # suits a large CPU host.
  # Operator env vars (OMP_NUM_THREADS=…) win.
  main_thread_cap: 8
  # On-disk shard cache. Content-addressed; sha256(text) → vector.
  cache:
    enabled: true
    dir: "{store.persist_dir}/emb_cache"
    shards: 16

# ---- OCR ---------------------------------------------------------------
ocr:
  enabled: true
  # "both" (Paddle + Tesseract merged) | "paddle" | "tesseract".
  # Paddle adapter targets PaddleOCR 3.x (use_textline_orientation,
  # predict(path)); legacy use_gpu is gone — paddlepaddle picks
  # CPU/GPU from the installed wheel.
  engine: "tesseract"
  languages: "eng"                     # Tesseract codes (ISO 639-2/T)
  pdf_dpi: 200                         # render DPI for image-only PDF pages
  # When a PDF's text-layer averages fewer than this many chars per
  # page, fall back to inline OCR. Lower = OCR less aggressively; higher
  # = OCR more aggressively. Only consulted when ocr.enabled.
  pdf_text_threshold_chars: 512
  # Per-stream FlateDecode cap (MiB) used on retry when a PDF page blows
  # past pypdf's 75 MB default. Per-call only — global pypdf cap stays
  # safe. 4 GiB covers every legitimate textbook we've seen.
  pdf_decompress_limit_mb: 4096
  # Content-addressed OCR cache. Keyed on sha256(file_bytes) + (engine,
  # languages, pdf_dpi). Default-off; flip on for OCR-heavy corpora.
  cache:
    enabled: false
    dir: "{store.persist_dir}/ocr_cache"
    shards: 8

# ---- SVG ---------------------------------------------------------------
svg:
  # When an SVG's stdlib-XML pass returns < text_threshold_chars chars
  # AND ocr.enabled, rasterise (render_dpi) and OCR the bitmap. Needs the
  # [svg] extra (svglib + reportlab) or a system cairosvg; without one
  # the fallback is a silent no-op.
  render_fallback: true
  render_dpi: 150
  text_threshold_chars: 16

# ---- Vision (object detection) -----------------------------------------
# Appends an "[objects: …]" line to image extraction via an ONNX-Runtime
# detector. Independent of ocr.enabled (run either / both / neither).
# Operator-supplied model — no runtime fetch. Needs the [vision] extra
# (onnxruntime, no torch). Extra or model absent → silent no-op.
vision:
  enabled: false
  model_path: ""                       # "" → {persist_dir}/vision/model.onnx
  class_names_path: ""                 # "" → bundled COCO-80
  score_threshold: 0.35
  max_labels: 12

# ---- LLM ---------------------------------------------------------------
llm:
  provider: "none"                     # "none" | "ollama" | "local"
  fallback: "local"                    # retried when primary unreachable
  # Ollama (external server)
  model: "llama3"
  host: "http://localhost:11434"
  # Intra-Ollama model chain — when `model` times out/errors the
  # dispatcher walks each entry before falling over to `fallback`.
  # Each miss emits one ollama_model_failed WARNING.
  model_fallbacks: []
  # Per-Ollama-call timeout (seconds). Default 300 covers cold-start
  # weight loads on CPU-only hosts. Drop to 30–60 on fast-GPU boxes.
  # Garbage values silently fall back to default.
  timeout: 300
  # Fast /api/tags availability probe timeout. Generation still uses
  # `timeout`; raise this for slow remote Ollama hosts.
  probe_timeout: 3.0
  # Embedded llama.cpp — deterministic (temperature=0, top_k=1, fixed
  # seed). `pip install -e ".[local-llm]"`, then point `model_path` at
  # a GGUF on disk. Curated GGUFs land under `{persist_dir}/models/` via
  # `codexa fetch model --name <entry>` (override dir with $CODEXA_MODEL_DIR
  # or --out). `codexa fetch model --recommend` probes the host and
  # ✓/✗-marks every catalog entry; `--largest` downloads the biggest
  # entry that fits (size × 1.2 ≤ RAM).
  #
  # Catalog entries (8-bit / Q4_K_M GGUFs):
  #   name                      size    license               notes
  #   ------------------------  ------  --------------------  -----------------------------------------------
  #   qwen2.5-0.5b   (default)  0.4 GB  Apache-2.0            smallest / fastest first run
  #   llama-3.2-1b              0.8 GB  Llama 3.2 Community   131 K context, runs anywhere
  #   qwen2.5-1.5b              1.0 GB  Apache-2.0            balanced 32 K context
  #   qwen3-30b-a3b              19 GB  Apache-2.0            MoE (3 B active) — fastest big-class on CPU
  #   qwen3-32b                  20 GB  Apache-2.0            dense, hybrid thinking/non-thinking modes
  #   llama-3.3-70b              42 GB  Llama 3.3 Community   highest-quality dense for ≥64 GB hosts
  #   deepseek-r1-distill-70b    42 GB  MIT                   reasoning-tuned (chain-of-thought baked in)
  #   qwen2.5-72b                47 GB  Apache-2.0            multi-part download (auto-concatenated)
  #   mixtral-8x22b              85 GB  Apache-2.0            multi-part; MoE (~39 B active); RAM-hungry
  local:
    model_path: "{store.persist_dir}/models/qwen2.5-0.5b-instruct-q4_k_m.gguf"
    n_ctx: 4096                        # raise per-model when catalog suggests
    seed: 1337                         # fixed seed → reproducible answers
    max_tokens: 512
  max_summary_chars: 500               # truncate LLM-generated corpus summary

# ---- Search / retrieval ------------------------------------------------
search:
  history_turns: 6                     # 3 user + 3 assistant; 0 disables
  history_compress: false              # LLM-summarise dropped turns; opt-in
  highlight_entities: false            # bold recognised entities in passages
  passage_thumbnail_dpi: 100           # PDF page render DPI in expanders

  # Cross-encoder reranker over the retrieved top-K. Disabled by default
  # — costs ~50-200 ms per query on CPU (sub-50 ms on GPU) but catches
  # nuance the bi-encoder misses. Auto-downloads on first use.
  reranker:
    enabled: false
    model_id: "BAAI/bge-reranker-base"
    top_k_keep: 5
    device: "auto"                     # "auto" → cuda when available

  # Drop paraphrase near-duplicates after the Jaccard dedup. Bounded to
  # the over-fetched ≤30 rows. Costs one extra embed pass; set false on
  # latency-tight hosts.
  semantic_dedup:
    enabled: true
    threshold: 0.93                    # cosine ≥ this ⇒ near-duplicate

  # Pseudo-relevance feedback (Rocchio). Results merged, never replaced,
  # so PRF only ever adds recall. Costs one extra embed + store query.
  prf:
    enabled: true
    top_m: 5                           # survivors fed into the centroid
    alpha: 1.0                         # original-query weight
    beta: 0.75                         # feedback-centroid weight

  # Compound-question splitter — RRF-fuses 2–3 sub-query results.
  # User query is wrapped in <<<USER_INPUT>>> fence so a hostile
  # multi-line query can't break out of its role slot.
  query_decomp:
    enabled: false

  # HyDE — embed a hypothetical answer paragraph as a second retrieval
  # pass. Same fencing as query_decomp.
  hyde:
    enabled: false
    n: 1                               # clamped to [1, 5]

  # Conversational rewrite — when history exists and the query looks like
  # a follow-up, build a standalone query with the same prompt fences and
  # locale directive as HyDE / query_decomp.
  query_rewrite:
    enabled: false
    ab_fuse: false                     # RRF-fuse original + rewrite

  # Optional fan-out variants. All ride the same batched sub-query embed
  # + concurrent store.query path before RRF merge.
  multilingual:
    enabled: false
  query_expansion:
    enabled: false
    synonyms_path: ""                  # YAML/JSON alias map

  # FLARE — watch the streaming answer for low-confidence sentences and
  # fire additional retrieval rounds. Each trigger logs flare_retrigger
  # + flare_passages_added DEBUG events.
  flare:
    enabled: false
    min_logprob: -2.5                  # per-token logprob threshold
    max_rounds: 3                      # hard cap on re-retrieval rounds
    top_k_extra: 5                     # passages per re-retrieval

  # Offline "🧠 Passage insights" expander (LexRank TL;DR + KeyBERT
  # phrases) computed with the search embedder already in memory.
  insights:
    enabled: false
    max_sentences: 3
    top_keyphrases: 8

  # Offline encyclopedic enrichment. Tier walk: ZIM → seed → online.
  # codexa/data/wikipedia_seed*.json ships 25 English topics plus
  # localized pt/es/fr overlays.
  wikipedia:
    enabled: true
    data_dirs: []                      # extra operator-supplied seed JSONs
    zim_path: ""                       # single .zim or list of paths
    zim_dir: ""                        # glob *.zim under a folder
    online: false                      # opt-in fallback; breaks offline-first
    lang: "auto"                       # "auto" derives from cfg.locale

  # Optional hierarchical retrieval: indexer.hierarchical emits L0/L1/L2
  # rows; this stage swaps L2 child hits for their L1 parent when present.
  expand_to_parent:
    enabled: false

# ---- Metadata / sessions -----------------------------------------------
metadata:
  # Both manifest_file and sessions_file pick a backend from the suffix:
  # `.json` → JSON (full-file rewrite per save), `.db`/`.sqlite` → SQLite
  # (per-row UPSERT in one tx). On hosts without POSIX fcntl locks,
  # `.json` sessions route to a sibling `.db` automatically for
  # cross-process safety. Switch to SQLite explicitly at ~50k files /
  # 1k saved threads.
  manifest_file: "{store.persist_dir}/manifest.json"
  sessions_file: "{store.persist_dir}/sessions.json"
  index_info_file: "{store.persist_dir}/index_info.json"
  enable_entity_extraction: true
  tag_count: 50

# ---- Indexer -----------------------------------------------------------
indexer:
  # Upper bounds — effective values are cores-aware capped (chunker:
  # cores//4, floor 2, ceiling 8; OCR: 2). Over-cap values are silently
  # downscaled and emit a workers_downscaled WARNING.
  workers: 4
  ocr_workers: 2
  skip_dirs: []                        # extra dir names to skip
  # Optional ramp-up cap. 0 = process everything. Mirrors --max-files.
  max_files: 0
  # Manifest checkpoint cadence during the embed loop. Saves happen
  # every N files OR every M seconds, whichever comes first.
  checkpoint_every_files: 10
  checkpoint_every_seconds: 5.0
  # Optional L0/L1/L2 chunk tree. Default off keeps the flat chunk path.
  hierarchical:
    enabled: false

# ---- Chunking ----------------------------------------------------------
chunking:
  # Per-chunk token cap. Overlap is computed dynamically per file
  # (10% of token count, min 20). The optional hierarchical producer
  # rebuilds flat chunks into L0/L1/L2 rows when indexer.hierarchical
  # is enabled.
  chunk_size: 512

# ---- Format gates ------------------------------------------------------
formats:
  enabled: []                          # allow-list; empty = all
  disabled: []                         # deny-list (always wins)
  # When no registered extractor matches a file's extension, read its
  # bytes as UTF-8 with errors='ignore'. True is backwards-compatible
  # but unsafe on mixed-binary corpora — flip false to make unknown
  # extensions return "" instead.
  text_fallback: true
  # Cap (MiB) on the text-fallback read; a multi-GB binary under
  # data_dirs is decoded to its prefix only and logged
  # text_fallback_truncated.
  text_fallback_max_mb: 64

# ---- Plugins -----------------------------------------------------------
plugins:
  disabled: []                         # block individual plugin names
  # Cap on the timeout-dispatch thread pool. Only matters when plugins
  # opt into a `timeout_s`.
  executor_workers: 4
  # slow_threshold_ms: 1000            # WARN when a plugin hook exceeds
  # Per-plugin slices land as siblings of `disabled`:
  # plugins:
  #   word_counter:
  #     min_words: 100

Architecture

Two pipelines (indexing + search) over one persistent Chroma vector store, strategy-registry dispatch keyed off a single cfg dict, plugin hooks at pre_extract / chunk_filter / search_filter / post_index / ui_panel. ASCII diagram, layer map, cross-cutting conventions, and the plugin-system walkthrough (entry-point group, hook contracts, lifecycle) live in docs/architecture.md.

UI

codexa ui boots a Streamlit dashboard. Layout: sidebar (saved-chat threads + filter) → main (search box + RAG answer + retrieved passages with source-aware thumbnails) → right-hand tab strip (🧬 Semantic Structure / 📂 Index / 🩺 Diagnostics).

Full walkthrough — passage rendering, image augment per source type, saved-chat schema + sidebar behaviour, Semantic Structure tab caching, Index tab controls (Run Indexer / Force reindex / stale-lock cleanup), and i18n catalog wiring (en / es / fr / pt_BR) — lives in docs/ui.md.

LLM providers

Three values of cfg.llm.provider: "none" (passages only), "ollama" (external server), "local" (embedded llama.cpp, deterministic). Cross-provider cfg.llm.fallback retries the unreachable primary. Curated 9-entry GGUF catalog with codexa fetch model --recommend / --largest host-probe helpers and multi-part download support lives in docs/llm.md.

Wikipedia / ZIM

Two distinct surfaces: search-time RAG enrichment (cfg.search.wikipedia.* walks ZIM → seed → online; bundled 25-topic seed always available) and ZIM-as-corpus ingestion (drop a .zim under data_dirs for full chunk+embed). codexa-fetch-zim / codexa-check-deps --fetch-zims walkthroughs and the precedence table live in docs/wikipedia.md.

Semantic toolkit

codexa.semantic.* — pure-Python offline primitives: entity normalization, RAKE / KeyBERT keyphrases, co-occurrence graphs, TF-IDF cluster labels, embedding near-dup filter, Rocchio PRF, LexRank summary. Statistical and embedding-aware variants, plus where each surfaces in the UI (🧠 Passage insights, Semantic Structure graph, query-expansion captions, inline entity highlight) live in docs/semantics.md.

Performance

GPU device-probe order + per-vendor install hints, throughput knobs (auto-batch, cache stats, cross-file dedup, mtime-first change detection, SQLite manifest, cores-aware worker caps, CPU-only chunkers, CUDA OOM backoff, effective_budget pre-flight log), and the ramp-up benchmark (scripts/bench_indexer.py) live in docs/performance.md.

Testing

Local pytest invocations (pytest, --cov, -m bench), offline-by-default conftest discipline, the bench marker, GitHub Actions workflow (lint + test, cov ≥ 90%), and SonarCloud wiring live in docs/testing.md.

Troubleshooting & Logging

Chroma sqlite lock errors (ChromaInitError / database is locked / ChromaSegmentCorruptError) with the recovery steps, plus the structured JSON-lines log surface (file rotation, stable event keys like run_start / effective_budget / workers_downscaled / embed_oom_backoff, CODEXA_LOG_DIR override) live in docs/troubleshooting.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1,272 Commits
.github/workflows		.github/workflows
codexa		codexa
docs		docs
examples/plugins/word_counter		examples/plugins/word_counter
scripts		scripts
tests		tests
.gitignore		.gitignore
.sonarcloud.properties		.sonarcloud.properties
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
config-example.yml		config-example.yml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
sonar-project.properties		sonar-project.properties
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codexa

Table of contents

Quick start

Configuration

Architecture

UI

LLM providers

Wikipedia / ZIM

Semantic toolkit

Performance

Testing

Troubleshooting & Logging

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Codexa

Table of contents

Quick start

Configuration

Architecture

UI

LLM providers

Wikipedia / ZIM

Semantic toolkit

Performance

Testing

Troubleshooting & Logging

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages