Skip to content

geraldo-netto/codexa

Repository files navigation

Codexa

CI

Codexa is an offline-first semantic indexing engine for large document collections.

  • 🔍 Semantic search (embeddings, entities, clusters) + hybrid dense / sparse (corpus BM25) fusion (RRF default, weighted optional; search.hybrid.quality_weight_enabled can opt weighted fusion into quality_score discounting)
  • 📚 Multi-format ingestion
    • Documents — PDF, DOCX, PPTX, PPT, XLSX, XLS, ODT, ODG, EPUB, FB2, MOBI, AZW, AZW3, HTML / HTM / XHTML, CHM, PS, RTF. RTF needs the optional [rtf] extra (pip install -e ".[rtf]"striprtf); without it .rtf soft-skips to empty rather than indexing raw control words.
    • Images (OCR + object detection) — PNG, JPG / JPEG / JFIF, TIFF / TIF, BMP, GIF, WEBP, JP2 / J2K (JPEG 2000), CBZ — all Pillow-native, no extra. CBR needs the optional [cbr] extra plus unrar, unar, or bsdtar; without it .cbr soft-skips instead of falling through to binary text. HEIC / HEIF (iPhone default since iOS 11) and AVIF decode with the optional [heif] extra (pip install -e ".[heif]"); without it those three soft-decode to empty rather than crashing extraction. OCR pulls words off the image; with the optional [vision] extra and an operator-supplied YOLOv8-class .onnx, an [objects: …] line of detected objects is appended so text-free photos stay searchable.
    • DjVu — DJVU / DJV scanned documents extract their embedded text layer via the DjVuLibre djvutxt system tool, falling back to per-page OCR when the document has no text layer. Install DjVuLibre (apt install djvulibre-bin / brew install djvulibre); without it DjVu files soft-skip rather than crashing the indexer.
    • SVG — stdlib-XML text by default; with the optional [svg] extra (pip install -e ".[svg]"), SVGs whose text is path-outlined or whose payload is an embedded raster are rasterised and OCR'd as a fallback.
    • Text & data — TXT, MD, RST, CSV, JSON, YAML / YML, LOG
    • ZIM archives — drop a .zim (Wikipedia / Kiwix bundle) under data_dirs and Codexa walks every article via libzim and indexes them just like any other corpus file. Needs the [zim] extra (pip install -e ".[zim]").
    • Plus anything you register via the Extractor Strategy — plugins can add their own formats with register_extractor(extensions).
  • ⚡ GPU-accelerated OCR and embeddings, with auto-batch sizing under RAM pressure and CPU-only chunking workers
  • 🧱 Incremental indexing with hashing + SQLite manifest connection pool for >50k-file corpora
  • 🗂️ Sharded embedding cache with hit/miss telemetry + wired caches track the six-point AGENTS.md cache contract (cap · TTL · lock · reset hook · stats accessor) through AppContext.reset; the ColBERT scorer cache remains the documented open exception under CACHE-33
  • 📊 Streamlit UI (search · semantic structure · index · chatbot ops · diagnostics)
  • 🛡️ Health checks, config validation, structured JSON logs, atomic durable writes (utils.io.atomic_write_text — tmp + fsync + rename + parent-dir fsync)
  • 🧠 Optional LLM support via Ollama OR an embedded llama.cpp (local) provider, with cross-provider fallback
  • 🔁 Multi-stage RAG dispatch: query rewrite (BOT-1) → query decomposition / HyDE / multilingual / synonym fan-out → dense retrieval → horizon + quality gates → Jaccard + semantic dedup → PRF expansion → dense+sparse fusion → cross-encoder + ColBERT rerank → expand-to-parent → scoring + plugin filter → context-budget cut → prompt build → streaming or blocking LLM → orphan-citation strip + hallucination flag + RAG-7 regenerate-when-ungrounded
  • 🛟 Multi-tier RAG enrichment: bundled Wikipedia seed (25 English topics plus pt/es/fr overlays) + on-demand libzim full-text + opt-in online fallback
  • 💬 Multi-turn sessions with bounded prompt history, optional LLM-summarised drop-off, JSON or SQLite backend; per-passage pin / exclude / thumbs feedback + locale-aware follow-up suggestions
  • 🤖 Chatbot operations dashboard — declared ChatbotObjective (statement / owner / review cadence / success metrics), aggregated session counters, pilot-batch tracking (pilot_runs.jsonl with O(1) tombstone-overlay updates)
  • 🌍 i18n (English, Spanish, French, Brazilian Portuguese — runtime-loaded .po catalogs) — locale directive threads into every LLM prompt (HyDE / decomp / history-summary / follow-ups / answer) so a pt_BR session never gets an English sub-query / recap / answer
  • 🧩 Plugin system (discovery, sandboxing, hook plugins, and register_* seams for extractors / stores / OCR / LLM / chunkers)
  • 🔬 Semantic toolkit: entity normalization & linking, RAKE + embedding keyphrases, co-occurrence graphs, TF-IDF cluster labels, embedding near-dup detection, PRF query expansion, LexRank extractive summarization
  • 🔐 Prompt-injection mitigation: user-supplied queries are fenced before LLM interpolation across query_decomp / query_rewrite / hyde / fallback / history_compress; each prompt builder pins a regression test for the fence + locale + NO_ANSWER + length rule
  • 📈 Eval harness with ablation toggles: per-stage on/off (run_case_with_stage_toggle), regen-vs-baseline comparison (run_case_with_regen_comparison), tuned-vs-baseline overlay sweep (compare_tuned_vs_baseline)

Table of contents

  1. Quick start
  2. Configuration
  3. Architecturedocs/architecture.md
  4. UIdocs/ui.md
  5. LLM providersdocs/llm.md
  6. Wikipedia / ZIMdocs/wikipedia.md
  7. Semantic toolkitdocs/semantics.md
  8. Performancedocs/performance.md
  9. Testingdocs/testing.md
  10. Troubleshooting & Loggingdocs/troubleshooting.md

Quick start

pip install -e .
cp config-example.yml config.yml             # then edit data_dirs / store paths
codexa-validate-config --config config.yml   # sanity-check first
codexa-index --config config.yml             # build/refresh the index
codexa-ui                                    # launch the dashboard

config-example.yml is the committed template with generic placeholder paths. Copy it to config.yml (git-ignored) and set your real data_dirs + store.persist_dir — your private paths never get committed. --config defaults to the nearest config.yml above the cwd (or $CODEXA_CONFIG).

The order is load-bearing: codexa-ui boots a search-only frontend — without an existing index it opens but every query returns zero hits. You must run codexa-index at least once before searches return anything. Subsequent runs are incremental (mtime + hash diff against the manifest), so the second invocation only touches files that actually changed.

Every command exists in two equivalent forms — the legacy flat shim (codexa-index, codexa-ui, …) and the unified codexa <subcommand> group (codexa index, codexa ui, …). Pick whichever fits muscle memory; both bind to the same Click commands.

command what it does
codexa index scan, extract, chunk, embed, persist (--force-reindex to rebuild from scratch)
codexa ui Streamlit dashboard (search + diagnostics)
codexa validate-config check config.yml for missing paths / invalid values
codexa check-deps scan source for missing imports, optionally auto-install
codexa cache stats print embedding-cache size, entries, hit rate
codexa plugins list list plugins discovered in the current environment
codexa fetch model download a curated GGUF for the local LLM provider
codexa fetch embedding pre-cache the SentenceTransformer embedder under <persist_dir>/hf_cache
codexa fetch zim download a Kiwix Wikipedia ZIM and wire cfg.search.wikipedia.zim_path
codexa extract-strings scan source for _("…") calls and emit a .pot template
codexa clear-stale-lock clear orphan chroma.sqlite3-shm / -wal after a crashed indexer (--dry-run to preview)
codexa eval baseline / run / docs-baseline run live-index, case-file retrieval/RAG, or clean-checkout repo-doc quality gates

To run straight from a checkout without pip install-ing the package, install just the third-party deps into a venv and invoke each command via python -m codexa.cli.<module>:

# from the repo root
python -m venv .venv && source .venv/bin/activate

# Pull in current core dependencies from [project.dependencies] without
# maintaining a second hand-written package list in this README.
python - <<'PY'
import subprocess
import sys
import tomllib
from pathlib import Path

deps = tomllib.loads(Path("pyproject.toml").read_text(encoding="utf-8"))["project"]["dependencies"]
subprocess.check_call([sys.executable, "-m", "pip", "install", *deps])
PY

# Now run each command via `python -m` instead of the console scripts.
python -m codexa.cli.config_validator --config config.yml
python -m codexa.cli.indexer_cli      --config config.yml    # ← REQUIRED before any UI search
python -m codexa.cli.cache_cli        --config config.yml
python -m codexa.cli.plugins_cli      --config config.yml
python -m codexa.cli.eval_cli baseline --config config.yml
python -m codexa.cli.deps_cli
python -m codexa.cli.fetch_model      # interactive GGUF picker
python -m codexa.cli.zim_fetch        # download a Wikipedia ZIM

# The UI is a Streamlit app — launch via the package entry point:
python -m codexa.ui

The python -m codexa.cli.* form maps 1:1 to the codexa-* console scripts above and accepts the same flags. Index before searching: codexa ui / python -m codexa.ui only reads the persisted Chroma store under cfg.store.persist_dir; never crawls data_dirs. Run python -m codexa.cli.indexer_cli --config config.yml (or codexa-index --config config.yml when installed) at least once before opening the dashboard, otherwise every query returns zero hits.

The UI ships in three equivalent launch forms — same Streamlit app, same codexa/ui/app.py entry point:

  • codexa ui / codexa-ui — Click console script (installed builds only). Thin wrapper around codexa.ui.app:main.
  • python -m codexa.ui — package entry point (codexa/ui/__main__.py). Same main() as the console script; works without pip install as long as the repo root is on PYTHONPATH.
  • python -m streamlit run codexa/ui/app.py — drop down to the Streamlit CLI directly. Useful when you want to pass Streamlit's own flags (--server.port, --server.headless, --browser.gatherUsageStats false, …) without threading them through Click — anything after run … goes straight to Streamlit.

All three honour the same cfg.yml resolution (relative to the working directory unless CODEXA_CONFIG is set).

Configuration

config.yml is the single source of truth. Run codexa-validate-config to sanity-check it. The committed config-example.yml is the broad template; this README keeps the core operator keys and points at the plugin seams where a registry can extend them.

# ---- Locale ------------------------------------------------------------
# UI / CLI language. Resolved cfg → CODEXA_LOCALE env → LANG env → "en".
# Ships catalogs for: en, es, fr, pt_BR. A sub-tag (e.g. pt_BR) falls
# back to its base (pt) when its catalog is missing. Internal log event
# keys stay English regardless.
locale: "en"

# ---- Corpus ------------------------------------------------------------
# Directories to index. Multiple roots allowed. Symlinks followed.
data_dirs:
  - "/path/to/corpus"

# ---- Vector store ------------------------------------------------------
store:
  backend: "chroma"                    # built-in backend; plugins can
                                       # register more with
                                       # codexa.store.register_store_backend
  persist_dir: "./chroma_db"
  collection_name: "documents"         # main chunk collection
  file_collection: "files"             # per-file metadata collection
  entity_collection: "entities"        # per-entity rollups
  # ROB-1: when true, an HNSW segment-corrupt error from query() raises
  # ChromaSegmentCorruptError instead of returning an empty hit list.
  # Off by default so a single bad collection doesn't take the panel down.
  raise_on_segment_corrupt: false

# ---- Embeddings --------------------------------------------------------
embeddings:
  # HF identifier OR an absolute on-disk path. Identifier → pre-cache once
  # with `codexa fetch embedding` (or opt a first run into download with
  # allow_model_download / CODEXA_ALLOW_MODEL_DOWNLOAD); subsequent loads
  # pin HF_HUB_OFFLINE=1 against the `<persist_dir>/hf_cache` snapshot.
  # Point at an absolute path for a strictly-offline first run.
  model_path: "sentence-transformers/all-MiniLM-L6-v2"
  batch_size: 32                       # encoder batch ceiling
  auto_batch: true                     # halve at ≥60% RAM (≥80% VRAM on
                                       # CUDA); double back to the ceiling
                                       # at ≤40% RAM (≤55% VRAM); floor 1.
  # "auto" probes cuda → mps → xpu → cpu. Override with "cuda"/"cuda:1"/
  # "mps"/"xpu"/"cpu". AMD ROCm rides the cuda branch. Failed probes log
  # device_probe_failed DEBUG so `--verbose` ops can tell torch-missing
  # from ROCm-broken from ipex-import-explodes.
  device: "auto"
  # Cap PyTorch's per-process VRAM allocator (0.0 < x < 1.0) so a
  # re-index can't starve the rest of the desktop of GPU memory.
  # Logged once at init as cuda_memory_capped.
  cuda_memory_fraction: 0.7
  # BLAS / tokenizer / torch intra-op cap for the main-process embedder.
  # Dominant throughput lever on CPU-only hosts; default/template value 8
  # suits a large CPU host.
  # Operator env vars (OMP_NUM_THREADS=…) win.
  main_thread_cap: 8
  # On-disk shard cache. Content-addressed; sha256(text) → vector.
  cache:
    enabled: true
    dir: "{store.persist_dir}/emb_cache"
    shards: 16

# ---- OCR ---------------------------------------------------------------
ocr:
  enabled: true
  # "both" (Paddle + Tesseract merged) | "paddle" | "tesseract".
  # Paddle adapter targets PaddleOCR 3.x (use_textline_orientation,
  # predict(path)); legacy use_gpu is gone — paddlepaddle picks
  # CPU/GPU from the installed wheel.
  engine: "tesseract"
  languages: "eng"                     # Tesseract codes (ISO 639-2/T)
  pdf_dpi: 200                         # render DPI for image-only PDF pages
  # When a PDF's text-layer averages fewer than this many chars per
  # page, fall back to inline OCR. Lower = OCR less aggressively; higher
  # = OCR more aggressively. Only consulted when ocr.enabled.
  pdf_text_threshold_chars: 512
  # Per-stream FlateDecode cap (MiB) used on retry when a PDF page blows
  # past pypdf's 75 MB default. Per-call only — global pypdf cap stays
  # safe. 4 GiB covers every legitimate textbook we've seen.
  pdf_decompress_limit_mb: 4096
  # Content-addressed OCR cache. Keyed on sha256(file_bytes) + (engine,
  # languages, pdf_dpi). Default-off; flip on for OCR-heavy corpora.
  cache:
    enabled: false
    dir: "{store.persist_dir}/ocr_cache"
    shards: 8

# ---- SVG ---------------------------------------------------------------
svg:
  # When an SVG's stdlib-XML pass returns < text_threshold_chars chars
  # AND ocr.enabled, rasterise (render_dpi) and OCR the bitmap. Needs the
  # [svg] extra (svglib + reportlab) or a system cairosvg; without one
  # the fallback is a silent no-op.
  render_fallback: true
  render_dpi: 150
  text_threshold_chars: 16

# ---- Vision (object detection) -----------------------------------------
# Appends an "[objects: …]" line to image extraction via an ONNX-Runtime
# detector. Independent of ocr.enabled (run either / both / neither).
# Operator-supplied model — no runtime fetch. Needs the [vision] extra
# (onnxruntime, no torch). Extra or model absent → silent no-op.
vision:
  enabled: false
  model_path: ""                       # "" → {persist_dir}/vision/model.onnx
  class_names_path: ""                 # "" → bundled COCO-80
  score_threshold: 0.35
  max_labels: 12

# ---- LLM ---------------------------------------------------------------
llm:
  provider: "none"                     # "none" | "ollama" | "local"
  fallback: "local"                    # retried when primary unreachable
  # Ollama (external server)
  model: "llama3"
  host: "http://localhost:11434"
  # Intra-Ollama model chain — when `model` times out/errors the
  # dispatcher walks each entry before falling over to `fallback`.
  # Each miss emits one ollama_model_failed WARNING.
  model_fallbacks: []
  # Per-Ollama-call timeout (seconds). Default 300 covers cold-start
  # weight loads on CPU-only hosts. Drop to 30–60 on fast-GPU boxes.
  # Garbage values silently fall back to default.
  timeout: 300
  # Fast /api/tags availability probe timeout. Generation still uses
  # `timeout`; raise this for slow remote Ollama hosts.
  probe_timeout: 3.0
  # Embedded llama.cpp — deterministic (temperature=0, top_k=1, fixed
  # seed). `pip install -e ".[local-llm]"`, then point `model_path` at
  # a GGUF on disk. Curated GGUFs land under `{persist_dir}/models/` via
  # `codexa fetch model --name <entry>` (override dir with $CODEXA_MODEL_DIR
  # or --out). `codexa fetch model --recommend` probes the host and
  # ✓/✗-marks every catalog entry; `--largest` downloads the biggest
  # entry that fits (size × 1.2 ≤ RAM).
  #
  # Catalog entries (8-bit / Q4_K_M GGUFs):
  #   name                      size    license               notes
  #   ------------------------  ------  --------------------  -----------------------------------------------
  #   qwen2.5-0.5b   (default)  0.4 GB  Apache-2.0            smallest / fastest first run
  #   llama-3.2-1b              0.8 GB  Llama 3.2 Community   131 K context, runs anywhere
  #   qwen2.5-1.5b              1.0 GB  Apache-2.0            balanced 32 K context
  #   qwen3-30b-a3b              19 GB  Apache-2.0            MoE (3 B active) — fastest big-class on CPU
  #   qwen3-32b                  20 GB  Apache-2.0            dense, hybrid thinking/non-thinking modes
  #   llama-3.3-70b              42 GB  Llama 3.3 Community   highest-quality dense for ≥64 GB hosts
  #   deepseek-r1-distill-70b    42 GB  MIT                   reasoning-tuned (chain-of-thought baked in)
  #   qwen2.5-72b                47 GB  Apache-2.0            multi-part download (auto-concatenated)
  #   mixtral-8x22b              85 GB  Apache-2.0            multi-part; MoE (~39 B active); RAM-hungry
  local:
    model_path: "{store.persist_dir}/models/qwen2.5-0.5b-instruct-q4_k_m.gguf"
    n_ctx: 4096                        # raise per-model when catalog suggests
    seed: 1337                         # fixed seed → reproducible answers
    max_tokens: 512
  max_summary_chars: 500               # truncate LLM-generated corpus summary

# ---- Search / retrieval ------------------------------------------------
search:
  history_turns: 6                     # 3 user + 3 assistant; 0 disables
  history_compress: false              # LLM-summarise dropped turns; opt-in
  highlight_entities: false            # bold recognised entities in passages
  passage_thumbnail_dpi: 100           # PDF page render DPI in expanders

  # Cross-encoder reranker over the retrieved top-K. Disabled by default
  # — costs ~50-200 ms per query on CPU (sub-50 ms on GPU) but catches
  # nuance the bi-encoder misses. Auto-downloads on first use.
  reranker:
    enabled: false
    model_id: "BAAI/bge-reranker-base"
    top_k_keep: 5
    device: "auto"                     # "auto" → cuda when available

  # Drop paraphrase near-duplicates after the Jaccard dedup. Bounded to
  # the over-fetched ≤30 rows. Costs one extra embed pass; set false on
  # latency-tight hosts.
  semantic_dedup:
    enabled: true
    threshold: 0.93                    # cosine ≥ this ⇒ near-duplicate

  # Pseudo-relevance feedback (Rocchio). Results merged, never replaced,
  # so PRF only ever adds recall. Costs one extra embed + store query.
  prf:
    enabled: true
    top_m: 5                           # survivors fed into the centroid
    alpha: 1.0                         # original-query weight
    beta: 0.75                         # feedback-centroid weight

  # Compound-question splitter — RRF-fuses 2–3 sub-query results.
  # User query is wrapped in <<<USER_INPUT>>> fence so a hostile
  # multi-line query can't break out of its role slot.
  query_decomp:
    enabled: false

  # HyDE — embed a hypothetical answer paragraph as a second retrieval
  # pass. Same fencing as query_decomp.
  hyde:
    enabled: false
    n: 1                               # clamped to [1, 5]

  # Conversational rewrite — when history exists and the query looks like
  # a follow-up, build a standalone query with the same prompt fences and
  # locale directive as HyDE / query_decomp.
  query_rewrite:
    enabled: false
    ab_fuse: false                     # RRF-fuse original + rewrite

  # Optional fan-out variants. All ride the same batched sub-query embed
  # + concurrent store.query path before RRF merge.
  multilingual:
    enabled: false
  query_expansion:
    enabled: false
    synonyms_path: ""                  # YAML/JSON alias map

  # FLARE — watch the streaming answer for low-confidence sentences and
  # fire additional retrieval rounds. Each trigger logs flare_retrigger
  # + flare_passages_added DEBUG events.
  flare:
    enabled: false
    min_logprob: -2.5                  # per-token logprob threshold
    max_rounds: 3                      # hard cap on re-retrieval rounds
    top_k_extra: 5                     # passages per re-retrieval

  # Offline "🧠 Passage insights" expander (LexRank TL;DR + KeyBERT
  # phrases) computed with the search embedder already in memory.
  insights:
    enabled: false
    max_sentences: 3
    top_keyphrases: 8

  # Offline encyclopedic enrichment. Tier walk: ZIM → seed → online.
  # codexa/data/wikipedia_seed*.json ships 25 English topics plus
  # localized pt/es/fr overlays.
  wikipedia:
    enabled: true
    data_dirs: []                      # extra operator-supplied seed JSONs
    zim_path: ""                       # single .zim or list of paths
    zim_dir: ""                        # glob *.zim under a folder
    online: false                      # opt-in fallback; breaks offline-first
    lang: "auto"                       # "auto" derives from cfg.locale

  # Optional hierarchical retrieval: indexer.hierarchical emits L0/L1/L2
  # rows; this stage swaps L2 child hits for their L1 parent when present.
  expand_to_parent:
    enabled: false

# ---- Metadata / sessions -----------------------------------------------
metadata:
  # Both manifest_file and sessions_file pick a backend from the suffix:
  # `.json` → JSON (full-file rewrite per save), `.db`/`.sqlite` → SQLite
  # (per-row UPSERT in one tx). On hosts without POSIX fcntl locks,
  # `.json` sessions route to a sibling `.db` automatically for
  # cross-process safety. Switch to SQLite explicitly at ~50k files /
  # 1k saved threads.
  manifest_file: "{store.persist_dir}/manifest.json"
  sessions_file: "{store.persist_dir}/sessions.json"
  index_info_file: "{store.persist_dir}/index_info.json"
  enable_entity_extraction: true
  tag_count: 50

# ---- Indexer -----------------------------------------------------------
indexer:
  # Upper bounds — effective values are cores-aware capped (chunker:
  # cores//4, floor 2, ceiling 8; OCR: 2). Over-cap values are silently
  # downscaled and emit a workers_downscaled WARNING.
  workers: 4
  ocr_workers: 2
  skip_dirs: []                        # extra dir names to skip
  # Optional ramp-up cap. 0 = process everything. Mirrors --max-files.
  max_files: 0
  # Manifest checkpoint cadence during the embed loop. Saves happen
  # every N files OR every M seconds, whichever comes first.
  checkpoint_every_files: 10
  checkpoint_every_seconds: 5.0
  # Optional L0/L1/L2 chunk tree. Default off keeps the flat chunk path.
  hierarchical:
    enabled: false

# ---- Chunking ----------------------------------------------------------
chunking:
  # Per-chunk token cap. Overlap is computed dynamically per file
  # (10% of token count, min 20). The optional hierarchical producer
  # rebuilds flat chunks into L0/L1/L2 rows when indexer.hierarchical
  # is enabled.
  chunk_size: 512

# ---- Format gates ------------------------------------------------------
formats:
  enabled: []                          # allow-list; empty = all
  disabled: []                         # deny-list (always wins)
  # When no registered extractor matches a file's extension, read its
  # bytes as UTF-8 with errors='ignore'. True is backwards-compatible
  # but unsafe on mixed-binary corpora — flip false to make unknown
  # extensions return "" instead.
  text_fallback: true
  # Cap (MiB) on the text-fallback read; a multi-GB binary under
  # data_dirs is decoded to its prefix only and logged
  # text_fallback_truncated.
  text_fallback_max_mb: 64

# ---- Plugins -----------------------------------------------------------
plugins:
  disabled: []                         # block individual plugin names
  # Cap on the timeout-dispatch thread pool. Only matters when plugins
  # opt into a `timeout_s`.
  executor_workers: 4
  # slow_threshold_ms: 1000            # WARN when a plugin hook exceeds
  # Per-plugin slices land as siblings of `disabled`:
  # plugins:
  #   word_counter:
  #     min_words: 100

Architecture

Two pipelines (indexing + search) over one persistent Chroma vector store, strategy-registry dispatch keyed off a single cfg dict, plugin hooks at pre_extract / chunk_filter / search_filter / post_index / ui_panel. ASCII diagram, layer map, cross-cutting conventions, and the plugin-system walkthrough (entry-point group, hook contracts, lifecycle) live in docs/architecture.md.

UI

codexa ui boots a Streamlit dashboard. Layout: sidebar (saved-chat threads + filter) → main (search box + RAG answer + retrieved passages with source-aware thumbnails) → right-hand tab strip (🧬 Semantic Structure / 📂 Index / 🩺 Diagnostics).

Full walkthrough — passage rendering, image augment per source type, saved-chat schema + sidebar behaviour, Semantic Structure tab caching, Index tab controls (Run Indexer / Force reindex / stale-lock cleanup), and i18n catalog wiring (en / es / fr / pt_BR) — lives in docs/ui.md.

LLM providers

Three values of cfg.llm.provider: "none" (passages only), "ollama" (external server), "local" (embedded llama.cpp, deterministic). Cross-provider cfg.llm.fallback retries the unreachable primary. Curated 9-entry GGUF catalog with codexa fetch model --recommend / --largest host-probe helpers and multi-part download support lives in docs/llm.md.

Wikipedia / ZIM

Two distinct surfaces: search-time RAG enrichment (cfg.search.wikipedia.* walks ZIM → seed → online; bundled 25-topic seed always available) and ZIM-as-corpus ingestion (drop a .zim under data_dirs for full chunk+embed). codexa-fetch-zim / codexa-check-deps --fetch-zims walkthroughs and the precedence table live in docs/wikipedia.md.

Semantic toolkit

codexa.semantic.* — pure-Python offline primitives: entity normalization, RAKE / KeyBERT keyphrases, co-occurrence graphs, TF-IDF cluster labels, embedding near-dup filter, Rocchio PRF, LexRank summary. Statistical and embedding-aware variants, plus where each surfaces in the UI (🧠 Passage insights, Semantic Structure graph, query-expansion captions, inline entity highlight) live in docs/semantics.md.

Performance

GPU device-probe order + per-vendor install hints, throughput knobs (auto-batch, cache stats, cross-file dedup, mtime-first change detection, SQLite manifest, cores-aware worker caps, CPU-only chunkers, CUDA OOM backoff, effective_budget pre-flight log), and the ramp-up benchmark (scripts/bench_indexer.py) live in docs/performance.md.

Testing

Local pytest invocations (pytest, --cov, -m bench), offline-by-default conftest discipline, the bench marker, GitHub Actions workflow (lint + test, cov ≥ 90%), and SonarCloud wiring live in docs/testing.md.

Troubleshooting & Logging

Chroma sqlite lock errors (ChromaInitError / database is locked / ChromaSegmentCorruptError) with the recovery steps, plus the structured JSON-lines log surface (file rotation, stable event keys like run_start / effective_budget / workers_downscaled / embed_oom_backoff, CODEXA_LOG_DIR override) live in docs/troubleshooting.md.

About

Codexa is an offline-first semantic indexing engine for large document collections

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors