pindex

A fast, vectorless reasoning-RAG engine in Go — a rewrite of PageIndex.

pindex builds a hierarchical tree index from a PDF and answers questions by having an LLM reason over that structure — no embeddings, no fixed chunking. Retrieval walks the tree (section summaries → relevant branch → tight page range) and answers cite the specific pages they used.

Status: working v1. index, ask, and eval all run live against OpenAI and Anthropic. Validated on FinanceBench — an accumulating open-source benchmark (see results). What's deferred is in roadmap.md; the design is in docs/PLAN.md.

📖 Full documentation: jjfantini.github.io/pindex — installation, getting started, guides, architecture, and the generated CLI reference. A worked example (input PDF, generated tree index, real Q&A transcript) lives in examples/.

Why vectorless

Indexing is LLM-bound, so Go doesn't make the model faster — it buys the engineering envelope the Python original lacks: bounded concurrency + rate limiting, resumable batch indexing (never re-index a finished doc), a prompt-hash response cache (re-runs and crash recovery are nearly free; it doesn't burn the retry budget), graceful degradation when a provider stalls (a dead account circuit-breaks instead of draining retries), and a single binary an LLM can drive.

Install

Homebrew (prebuilt, no toolchain needed):

brew install jjfantini/humbl/pindex        # default: cgo/MuPDF, full fidelity (AGPL-3.0)
brew install jjfantini/humbl/pindex-lite   # pure-Go, portable, no MuPDF (Apache-2.0)

From source:

go build -o pindex ./cmd/pindex                 # default: go-fitz/MuPDF (needs a C toolchain)
CGO_ENABLED=0 go build -o pindex ./cmd/pindex   # fully-static pure-Go build (then set extractor: purego)

The default MuPDF extractor (gen2brain/go-fitz) links a bundled static libmupdf via cgo — no system MuPDF needed, but you need a C compiler. The CGO_ENABLED=0 build excludes go-fitz entirely (the pindex-lite artifact); set extractor: purego (see Extractors). Licensing differs by build — see License and LICENSING.md.

API keys

index/ask/eval call a live LLM. Provide keys via the environment or a gitignored .env (copy .env.example). The .env overrides the process environment, so it's the reliable channel:

cp .env.example .env   # then fill in OPENAI_API_KEY and/or ANTHROPIC_API_KEY

The model name picks the provider: claude* → Anthropic, otherwise OpenAI.

Quickstart

# Index a PDF into a workspace (prints the tree; persists to .pindex/workspace)
pindex index report.pdf --model gpt-4o

# Ask a question — pindex tree-searches, fetches tight pages, answers with citations
pindex ask "What was full-year revenue?" --model gpt-4o
#  → "Full-year revenue was $10.2B."   cited pages: [7]

# Batch-index a whole directory, in parallel, resumable (skips already-indexed docs)
pindex index ./filings/ --model gpt-4o-mini --concurrency 8

# Debug: dump per-page extracted text
pindex extract report.pdf --backend mupdf

Commands

Command	What it does
`pindex index <pdf\|dir>`	Build a tree index for a file (prints tree) or a directory (batch, resumable).
`pindex ask <question>`	Reason over an indexed doc's tree, fetch pages, answer with citations.
`pindex eval`	Run the FinanceBench harness over a pre-indexed workspace.
`pindex extract <pdf>`	Dump per-page extracted text (extractor debugging).

Common flags: --model, --workspace, --cache-dir, --env-file, --config.

Output modes

The CLI separates payload from presentation so both humans and agents can drive it:

stdout carries only machine-readable payload (JSON trees, answer text, extracted pages) — safe to pipe or parse.
stderr carries the presentation layer: on a TTY it animates (spinner, colors, summary boxes); when piped or sandboxed it automatically degrades to plain line-oriented progress with zero control codes. Force it with --plain or PINDEX_PLAIN=1; NO_COLOR is honored.
--verbose streams under-the-hood diagnostics to stderr — per-call LLM timings, prompt-cache hits, retry backoff, and circuit-breaker state — so a failing run is diagnosable from its own output.

Configuration

A YAML config (via --config) overrides the built-in defaults, which mirror PageIndex:

model: gpt-4o-2024-11-20      # or claude-... ; routes provider by name
extractor: mupdf              # mupdf | poppler | purego
toc_check_page_num: 20
max_page_num_each_node: 10
max_token_num_each_node: 20000
if_add_node_id: true
if_add_node_summary: true
if_add_doc_description: false
if_add_node_text: false

Extractors

Pluggable via extractor: (or --backend):

Backend	Engine	Notes
`mupdf` (default)	go-fitz / MuPDF	Highest fidelity. Needs a cgo build (bundled static libmupdf).
`poppler`	`pdftotext -layout`	Strong on tables; needs `poppler-utils` on PATH.
`purego`	ledongthuc/pdf	100% Go, lightest binary, the `CGO_ENABLED=0` static path; lower table fidelity.

Architecture

cmd/pindex        CLI (index | ask | eval | extract)
internal/
  config          typed config (mirrors PageIndex defaults)
  extract         pluggable Extractor (mupdf/poppler/purego)
  tree            pure tree ops (list→tree, page ranges, renderer)
  prompts         the ~15 ported prompts + typed schemas
  llm             Provider seam: OpenAI/Anthropic HTTP, resilience, prompt-hash cache, structured output
  index           the indexing engine (TOC-less structure generation → tree → enrich)
  store           SQLite catalog + per-doc JSON blobs
  retrieve        get_document / get_structure / get_page_content
  ask             select-pages-then-answer retrieval loop
  pipeline        batch indexing (bounded concurrency + resume)
eval/financebench FinanceBench harness (LLM-judge accuracy + evidence recall)

FinanceBench

pindex benchmarks incrementally against the full FinanceBench set (150 questions, 84 documents). One document at a time via ./eval/financebench/bench.sh <DOC_NAME> — fetch, index once, eval at all four effort levels, fold into eval/financebench/results/. See eval/financebench/results/README.md for layout, adjudication workflow, and the live scoreboard (go run ./eval/financebench/aggregate regenerates it).

Accumulating scoreboard — claude-haiku-4-5-20251001 (index + ask), gpt-4o-2024-11-20 judge. As of 2026-06-11: 7/84 docs, 18/150 questions.

Effort	Raw accuracy	Adjusted accuracy	Evidence recall	Hallucination
low	83.33% (15/18)	100.0%	88.89%	16.67%
medium	83.33% (15/18)	100.0%	88.89%	16.67%
high	94.44% (17/18)	100.0%	94.44%	5.56%
ultra	94.44% (17/18)	100.0%	94.44%	5.56%

Raw = LLM-judge only. Adjusted = human-adjudicated relabels (MVA / BE / SEDC count correct; only NAL counts wrong) — the process behind Mafin 2.5's published 98.7%.
high/ultra beat low/medium on raw accuracy: the agentic loop (high) and verification pass (ultra) recover questions where fixed page selection misses.
medium has matched low on every doc so far (refusal retry never fired).

Ad-hoc harness over a pre-indexed workspace:

pindex index ./financebench/pdfs/SOME_DOC.pdf --model gpt-4o-mini --workspace ws
pindex eval --questions financebench_open_source.jsonl --workspace ws \
  --model gpt-4o --judge-model gpt-4o --limit 50

It reports LLM-judge answer accuracy (the permissive Mafin 2.5 rubric, for comparability) and evidence recall (does the cited page text contain the gold evidence — alignment-free). A page-number recall is also printed but is alignment-sensitive (pindex's physical page index can differ from a filing's printed page label).

Smoke test on a single earnings release (ULTABEAUTY_2023Q4_EARNINGS, 9 pages, 4 questions):

ask/judge model	answer accuracy	evidence recall
gpt-4o-mini	0%	0%
gpt-4o (same stored index)	50%	75%

Swapping only the model (no re-index) recovered accuracy — the pipeline is sound and model-bound.

License

pindex is dual-licensed — the license depends on how the binary is built:

First-party source code: Apache-2.0 (LICENSE).
pindex (default build): AGPL-3.0-or-later — it links MuPDF via go-fitz, which is AGPL (LICENSE.AGPL).
pindex-lite (pure-Go, CGO_ENABLED=0): Apache-2.0 — excludes go-fitz/MuPDF, links no AGPL code.

See LICENSING.md for the full explanation and dependency provenance.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.githooks		.githooks
.github/workflows		.github/workflows
.vscode		.vscode
cmd/pindex		cmd/pindex
docs		docs
eval/financebench		eval/financebench
examples		examples
internal		internal
scripts		scripts
system_design		system_design
testdata		testdata
website		website
.env.example		.env.example
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yaml		.goreleaser.yaml
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
LICENSE.AGPL		LICENSE.AGPL
LICENSING.md		LICENSING.md
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
release-please-config.json		release-please-config.json
roadmap.md		roadmap.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pindex

Why vectorless

Install

API keys

Quickstart

Commands

Output modes

Configuration

Extractors

Architecture

FinanceBench

License

About

Licenses found

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pindex

Why vectorless

Install

API keys

Quickstart

Commands

Output modes

Configuration

Extractors

Architecture

FinanceBench

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages