A fast, vectorless reasoning-RAG engine in Go — a rewrite of PageIndex.
pindex builds a hierarchical tree index from a PDF and answers questions by having an LLM reason over that structure — no embeddings, no fixed chunking. Retrieval walks the tree (section summaries → relevant branch → tight page range) and answers cite the specific pages they used.
Status: working v1.
index,ask, andevalall run live against OpenAI and Anthropic. Validated on FinanceBench — an accumulating open-source benchmark (see results). What's deferred is inroadmap.md; the design is indocs/PLAN.md.
📖 Full documentation: jjfantini.github.io/pindex —
installation, getting started, guides, architecture, and the generated CLI reference. A worked
example (input PDF, generated tree index, real Q&A transcript) lives in examples/.
Indexing is LLM-bound, so Go doesn't make the model faster — it buys the engineering envelope the Python original lacks: bounded concurrency + rate limiting, resumable batch indexing (never re-index a finished doc), a prompt-hash response cache (re-runs and crash recovery are nearly free; it doesn't burn the retry budget), graceful degradation when a provider stalls (a dead account circuit-breaks instead of draining retries), and a single binary an LLM can drive.
Homebrew (prebuilt, no toolchain needed):
brew install jjfantini/humbl/pindex # default: cgo/MuPDF, full fidelity (AGPL-3.0)
brew install jjfantini/humbl/pindex-lite # pure-Go, portable, no MuPDF (Apache-2.0)From source:
go build -o pindex ./cmd/pindex # default: go-fitz/MuPDF (needs a C toolchain)
CGO_ENABLED=0 go build -o pindex ./cmd/pindex # fully-static pure-Go build (then set extractor: purego)The default MuPDF extractor (gen2brain/go-fitz) links a bundled static libmupdf via cgo — no
system MuPDF needed, but you need a C compiler. The CGO_ENABLED=0 build excludes go-fitz entirely
(the pindex-lite artifact); set extractor: purego (see Extractors). Licensing
differs by build — see License and LICENSING.md.
index/ask/eval call a live LLM. Provide keys via the environment or a gitignored .env
(copy .env.example). The .env overrides the process environment, so it's
the reliable channel:
cp .env.example .env # then fill in OPENAI_API_KEY and/or ANTHROPIC_API_KEYThe model name picks the provider: claude* → Anthropic, otherwise OpenAI.
# Index a PDF into a workspace (prints the tree; persists to .pindex/workspace)
pindex index report.pdf --model gpt-4o
# Ask a question — pindex tree-searches, fetches tight pages, answers with citations
pindex ask "What was full-year revenue?" --model gpt-4o
# → "Full-year revenue was $10.2B." cited pages: [7]
# Batch-index a whole directory, in parallel, resumable (skips already-indexed docs)
pindex index ./filings/ --model gpt-4o-mini --concurrency 8
# Debug: dump per-page extracted text
pindex extract report.pdf --backend mupdf| Command | What it does |
|---|---|
pindex index <pdf|dir> |
Build a tree index for a file (prints tree) or a directory (batch, resumable). |
pindex ask <question> |
Reason over an indexed doc's tree, fetch pages, answer with citations. |
pindex eval |
Run the FinanceBench harness over a pre-indexed workspace. |
pindex extract <pdf> |
Dump per-page extracted text (extractor debugging). |
Common flags: --model, --workspace, --cache-dir, --env-file, --config.
The CLI separates payload from presentation so both humans and agents can drive it:
- stdout carries only machine-readable payload (JSON trees, answer text, extracted pages) — safe to pipe or parse.
- stderr carries the presentation layer: on a TTY it animates (spinner, colors, summary boxes); when piped or sandboxed it automatically degrades to plain line-oriented progress with zero control codes. Force it with
--plainorPINDEX_PLAIN=1;NO_COLORis honored. --verbosestreams under-the-hood diagnostics to stderr — per-call LLM timings, prompt-cache hits, retry backoff, and circuit-breaker state — so a failing run is diagnosable from its own output.
A YAML config (via --config) overrides the built-in defaults, which mirror PageIndex:
model: gpt-4o-2024-11-20 # or claude-... ; routes provider by name
extractor: mupdf # mupdf | poppler | purego
toc_check_page_num: 20
max_page_num_each_node: 10
max_token_num_each_node: 20000
if_add_node_id: true
if_add_node_summary: true
if_add_doc_description: false
if_add_node_text: falsePluggable via extractor: (or --backend):
| Backend | Engine | Notes |
|---|---|---|
mupdf (default) |
go-fitz / MuPDF | Highest fidelity. Needs a cgo build (bundled static libmupdf). |
poppler |
pdftotext -layout |
Strong on tables; needs poppler-utils on PATH. |
purego |
ledongthuc/pdf | 100% Go, lightest binary, the CGO_ENABLED=0 static path; lower table fidelity. |
cmd/pindex CLI (index | ask | eval | extract)
internal/
config typed config (mirrors PageIndex defaults)
extract pluggable Extractor (mupdf/poppler/purego)
tree pure tree ops (list→tree, page ranges, renderer)
prompts the ~15 ported prompts + typed schemas
llm Provider seam: OpenAI/Anthropic HTTP, resilience, prompt-hash cache, structured output
index the indexing engine (TOC-less structure generation → tree → enrich)
store SQLite catalog + per-doc JSON blobs
retrieve get_document / get_structure / get_page_content
ask select-pages-then-answer retrieval loop
pipeline batch indexing (bounded concurrency + resume)
eval/financebench FinanceBench harness (LLM-judge accuracy + evidence recall)
pindex benchmarks incrementally against the full FinanceBench set (150 questions, 84 documents).
One document at a time via ./eval/financebench/bench.sh <DOC_NAME> — fetch, index once, eval at
all four effort levels, fold into eval/financebench/results/. See
eval/financebench/results/README.md for layout, adjudication
workflow, and the live scoreboard (go run ./eval/financebench/aggregate regenerates it).
Accumulating scoreboard — claude-haiku-4-5-20251001 (index + ask), gpt-4o-2024-11-20 judge.
As of 2026-06-11: 7/84 docs, 18/150 questions.
| Effort | Raw accuracy | Adjusted accuracy | Evidence recall | Hallucination |
|---|---|---|---|---|
| low | 83.33% (15/18) | 100.0% | 88.89% | 16.67% |
| medium | 83.33% (15/18) | 100.0% | 88.89% | 16.67% |
| high | 94.44% (17/18) | 100.0% | 94.44% | 5.56% |
| ultra | 94.44% (17/18) | 100.0% | 94.44% | 5.56% |
- Raw = LLM-judge only. Adjusted = human-adjudicated relabels (
MVA/BE/SEDCcount correct; onlyNALcounts wrong) — the process behind Mafin 2.5's published 98.7%. high/ultrabeatlow/mediumon raw accuracy: the agentic loop (high) and verification pass (ultra) recover questions where fixed page selection misses.mediumhas matchedlowon every doc so far (refusal retry never fired).
Ad-hoc harness over a pre-indexed workspace:
pindex index ./financebench/pdfs/SOME_DOC.pdf --model gpt-4o-mini --workspace ws
pindex eval --questions financebench_open_source.jsonl --workspace ws \
--model gpt-4o --judge-model gpt-4o --limit 50It reports LLM-judge answer accuracy (the permissive Mafin 2.5 rubric, for comparability) and evidence recall (does the cited page text contain the gold evidence — alignment-free). A page-number recall is also printed but is alignment-sensitive (pindex's physical page index can differ from a filing's printed page label).
Smoke test on a single earnings release (ULTABEAUTY_2023Q4_EARNINGS, 9 pages, 4 questions):
| ask/judge model | answer accuracy | evidence recall |
|---|---|---|
| gpt-4o-mini | 0% | 0% |
| gpt-4o (same stored index) | 50% | 75% |
Swapping only the model (no re-index) recovered accuracy — the pipeline is sound and model-bound.
pindex is dual-licensed — the license depends on how the binary is built:
- First-party source code: Apache-2.0 (
LICENSE). pindex(default build): AGPL-3.0-or-later — it links MuPDF via go-fitz, which is AGPL (LICENSE.AGPL).pindex-lite(pure-Go,CGO_ENABLED=0): Apache-2.0 — excludes go-fitz/MuPDF, links no AGPL code.
See LICENSING.md for the full explanation and dependency provenance.