#index #deterministic #policies #document #scripting

index-extract

Deterministic extraction and scripting policies for Index documents

1 stable release

Uses new Rust 2024

new 1.0.0 May 14, 2026

#500 in Text processing


Used in 4 crates

Unlicense

67KB
1K SLoC

Index is a terminal-native semantic browser with an adaptive transformer.

It does not try to clone a graphical browser. It treats the web as input and recomposes it into a calm, keyboard-first, scriptable terminal interface.

Product thesis

Modern pages are designed for pixels, animation, tracking, overlays, mouse input, and JavaScript-heavy state.

Index is designed for:

  • reading
  • navigation
  • forms
  • extraction
  • search
  • bookmarking
  • scripting
  • command-first workflows
  • terminal-native interaction

The core product is the transformer:

URL / HTML / Markdown / RSS / app snapshot
  -> parse
  -> classify
  -> extract semantic intent
  -> remove noise
  -> emit Index Document Model
  -> render in terminal

Current repository state

This workspace currently includes:

  • Rust workspace
  • core document model
  • static HTML parser backed by scraper
  • readability extractor with main-content detection
  • typestate transformer pipeline
  • orthogonal transform instruction set
  • terminal renderer backed by ratatui with semantic colors, reader profiles, symbols, syntax hiding, padded structural blocks, a high-cyan prompt, a mildly highlighted current line, structural sidebar modes, hidden response logs, and URL-history suggestions
  • bounded layout rhythm extraction from semantic block boundaries and simple CSS spacing hints
  • semantic page-region summaries that prioritize explicit main content and allow secondary navigation, related, comment, aside, and footer regions to be expanded from the sidebar
  • navigation state with history, bookmarks, session restore primitives, origin state, and cache keys
  • live HTTP fetching, redirect validation, and filesystem response cache primitives
  • static form extraction and semantic command submission actions
  • site adapter registry for supported task-oriented views
  • expanded knowledge adapter fixtures for code forges, docs, reference pages, forums, research abstracts, and archive items
  • robustness fixture matrix for malformed, sparse, code-heavy, table/list-heavy, navigation-heavy, accessibility-rich, and international page shapes
  • headless snapshot fallback abstraction with deterministic policies and accessibility-tree-first extraction when semantic roles are strong enough
  • origin-scoped authentication and cookie primitives with redaction
  • deterministic extraction and pipe confirmation policy
  • optional AI-assisted transformation boundary with offline fallback
  • hostile-input security policy with size limits and redirect validation
  • actionable failure diagnostics for failed or low-confidence transforms with likely causes and exact local commands
  • document quality scoring for adapter, strong generic, partial generic, fallback, and failed transform paths
  • local reader repair commands for cycling plausible main regions, hiding or showing noisy regions, and promoting a section into temporary focus
  • transformed document and renderer layout caches for repeated workflows
  • canonical runtime artifacts (index-artifact-v1) keyed by canonical URL and context with stale-while-revalidate behavior for repeated navigation
  • local knowledge shelf records with Markdown/JSON exports, citations, tags, and notes
  • offline shelf search across saved metadata and local Markdown exports
  • adapter contribution harness reports for fixture-backed adapter reviews
  • offline knowledge workflows for deterministic saves, citations, selected section export, batch extraction, and bookmark notes/tags
  • local capture artifacts for unsupported page shapes with credential redaction
  • TUI capture preview/save commands for local artifacts from the current page
  • runtime-compatible index.pack/v1 loading with deterministic precedence, local policy controls, and rollback snapshots
  • standalone index-compat-lab tooling for compatibility ingest, synthesis, linting, and override merge workflows
  • app icon, banners, and official terminal font guidance
  • installable package artifacts with man page and shell completions
  • production-readiness policies for compatibility, MSRV, adapters, diagnostics, benchmarks, and issue intake
  • CLI prototype with TUI, plain output, machine-readable extraction, and offline AI modes
  • unit and fixture tests
  • roadmap
  • changelog
  • RFCs
  • ADRs
  • AGENTS.md for AI-agent-driven development

The implementation remains transformer-first: pages are fetched or read, parsed into a semantic document model, and only then rendered in the terminal. Stateful browsing is represented through semantic session types while live URL loading and :open navigation are composed through fetcher and renderer action boundaries. Site adapters now run inside the transformer for recognized canonical URLs and emit task-oriented IndexDocument views before falling back to the generic reader.

Workspace layout

.
├── AGENTS.md
├── CHANGELOG.md
├── ROADMAP.md
├── Makefile
├── Cargo.toml
├── assets
│   ├── black-banner.png
│   ├── black-icon.png
│   ├── white-banner.png
│   └── white-icon.png
├── crates
│   ├── index-cli
│   ├── index-ai
│   ├── index-capture
│   ├── index-compat-lab
│   ├── index-core
│   ├── index-dom
│   ├── index-extract
│   ├── index-headless
│   ├── index-http
│   ├── index-readability
│   ├── index-renderer
│   ├── index-security
│   └── index-transformer
├── docs
│   ├── ARCHITECTURE.md
│   ├── COMPATIBILITY.md
│   ├── DIAGNOSTICS.md
│   ├── MSRV.md
│   ├── PERFORMANCE.md
│   ├── SECURITY.md
│   ├── SPEC.md
│   ├── adr
│   ├── issue-templates
│   └── rfc
└── examples
    └── sample.html

Local commands

make fmt
make clippy
make test
make coverage
make coverage-catalog
make dogfood-corpus
make forum-corpus
make top100-corpus
make security-review
make compatibility
make compatibility-slo
make compatibility-slo-v2
make compatibility-backlog
make readability-lift-v2
make actionability-lift-v2
make failure-quality-v3
make index-idx-adoption-v1
make family-pack-expansion-v2
make compatibility-pack-runtime-v1
make compat-lab-bootstrap-v1
make compat-rule-synthesis-v1
make compat-pack-trust-v1
make compat-pack-hotswap-v1
make compat-pack-ci-v1
make compat-no-binary-release-v1
make live-variance-v1
make app-shell-recovery-v2
make auth-assist-v1
make challenge-failure-ux-v1
make layout-fidelity-v3
make international-text-v2
make structured-data-recovery-v1
make compat-data-plane-v2
make compatibility-recovery-gate
make performance-great
make security-best
make ux-great
make readiness-great
make security-closure-v1
make performance-capacity-v1
make ux-interaction-v1
make operability-evidence-v1
make contract-freeze-v1
make release-1-0-gate-v1
make robustness-gate
make beta-readiness
make stable-readiness
make audit
make verify
make package
make package-dry-run
make package-manifest
make package-smoke
make release-candidate-dry-run
make bench
make alpha-smoke
make run

make coverage enforces a minimum 93% line coverage using cargo-llvm-cov. make coverage-catalog validates the fixture paths listed by the coverage program. make dogfood-corpus validates committed and live dogfooding corpus manifests without fetching live URLs. make forum-corpus validates forum target-domain tiers and fixture mappings. make top100-corpus validates top-100 target-domain rows, tiers, known-limit classes, and fixture mappings. make security-review validates the hostile-input abuse-case catalog and release security checklist. make compatibility validates terminal compatibility and accessibility release notes. make compatibility-slo scores top-100 and forum corpus compatibility against release floors for readability, actionability, and failure quality. make compatibility-slo-v2 enforces global + per-family SLO thresholds and optional baseline delta reporting for release candidates. make compatibility-backlog emits a deterministic top-N compatibility queue with recommended roadmap milestone linkage. make readability-lift-v2 validates dense-root selection, boilerplate suppression, spacing, and code-preservation fixtures for generic extraction. make actionability-lift-v2 validates link ranking/deduplication, forum next-step extraction, and form-default submission modeling. make failure-quality-v3 validates blocked-flow taxonomy coverage, deterministic failure diagnostics, and unsupported-page no-silent-success guardrails. make index-idx-adoption-v1 validates index idx lint, toolkit templates, and publisher guidance assets. make security-closure-v1 enforces threat-model/abuse-case/risk-register closure and pack-trust fail-closed checks. make performance-capacity-v1 composes strict benchmark budgets and runtime stage-policy coverage checks. make ux-interaction-v1 enforces quickstart/help/start-page interaction contracts and progress semantics. make operability-evidence-v1 validates deterministic release-evidence bundle generation and composed readiness evidence. make contract-freeze-v1 enforces 1.x external contract policies for CLI, index.idx/v1, and index.pack/v1. make release-1-0-gate-v1 composes M96-M100 gates with full verification and release-candidate dry-run checks for 1.0.0 decisions. make family-pack-expansion-v2 validates family-pack confidence/fallback behavior and fixtures for app-shell, commerce cards, and mixed-media pages. make compatibility-pack-runtime-v1 validates index.pack/v1 runtime schema/precedence and fail-closed behavior. make compat-lab-bootstrap-v1 validates deterministic index-compat-lab ingest and scaffold workflows. make compat-rule-synthesis-v1 validates deterministic rule synthesis, safety linting, and override-merge behavior. make compat-pack-trust-v1 validates compatibility-pack signing and verification workflows. make compat-pack-hotswap-v1 validates rollback snapshots and runtime reload attribution behavior. make compat-pack-ci-v1 validates composed compatibility-pack canary gates. make compat-no-binary-release-v1 validates the compatibility data-only release runbook. make live-variance-v1 validates deterministic live-variance aggregation from opt-in run ledgers. make app-shell-recovery-v2 validates app-shell recovery profile attribution, fallback order, and stage budgets. make auth-assist-v1 validates session-aware auth diagnostics and local cookie import/export helpers. make challenge-failure-ux-v1 validates deterministic blocked-flow challenge classification and reporting. make layout-fidelity-v3 validates spacing rhythm and pre/code fidelity regression tests. make international-text-v2 validates multilingual rendering/search regression checks. make structured-data-recovery-v1 validates structured metadata extraction guardrails. make compat-data-plane-v2 validates compatibility data-plane synthesis quality and strict linting. make compatibility-recovery-gate validates composed SLO-v2 + live-variance recovery evidence for release decisions. make robustness-gate validates robustness policy/report assets and composes local corpus, security, compatibility, package-manifest, and alpha-smoke checks into one deterministic command. make beta-readiness validates beta support scope and composes the local coverage, dogfooding, security, compatibility, and package-manifest gates. make stable-readiness validates stable support policy and fixture stewardship gates; it does not by itself mean a stable release has been earned. make audit runs cargo audit for dependency advisory checks. make package builds a local tarball with the binary, man page, completions, README, and license. make package-manifest validates package source paths and make package-smoke verifies the packaged binary outside the source tree. make release-candidate-dry-run builds and smoke-tests a package, then writes SHA-256 checksums under dist/SHA256SUMS. make bench runs a local release-binary smoke benchmark without hosted CI. index --benchmark <url-or-local-html-file> reports transform timing and cache reuse for one input. make alpha-smoke runs the alpha hardening smoke gate for local file, extraction, capture, adapter, shelf, and benchmark paths. Live URL and bounded TUI startup checks are opt-in through environment variables documented in docs/ALPHA.md. make performance-great enforces strict benchmark ceilings for first transform, cached transform, and release-binary average latency. make security-best composes abuse-case review, advisory checks, deny checks, and targeted credential-redaction tests. make ux-great validates quickstart/help usability paths and key command coverage for first-session usage. make readiness-great composes strict performance, security, UX, compatibility, robustness, beta, and stable gates.

CLI prototype

cargo run -p index-cli
cargo run -p index-cli -- quickstart
cargo run -p index-cli -- --profile docs
cargo run -p index-cli -- https://example.org
cargo run -p index-cli -- example.org
cargo run -p index-cli -- examples/sample.html
index quickstart

With no arguments, the CLI opens a built-in start page with the core commands. It also reads an http or https URL, a URL without a scheme, a local HTML file, or stdin and opens the terminal UI. URL inputs without an explicit scheme default to https://.

cat examples/sample.html | cargo run -p index-cli -- -

index-core now provides reusable navigation state for history, bookmarks, session restore, redirect tracking, per-origin data, redacted response-log entries, and persisted sidebar mode preference. index-http provides a blocking HTTP fetcher, form submission transport for GET and application/x-www-form-urlencoded POST responses, deterministic cache paths, a filesystem cache for text responses, and SecureFetcher policy enforcement before content reaches the transformer.

index-headless defines the snapshot fallback boundary for JavaScript-heavy pages: timeout policy, script/network permissions, sandbox requirements, DOM snapshots, accessibility snapshots, and deterministic failure values. It does not embed a browser engine. When an accessibility snapshot carries enough semantic roles, the transformer uses it before DOM text, maps roles into Index nodes, merges rendered DOM links, and falls back to DOM extraction when the accessibility tree is sparse.

Authentication state is modeled in index-core: cookies are isolated by origin, secure cookies require HTTPS, login form actions are checked by origin policy, logout clears session cookies, and diagnostics can redact known credentials. Real transport remains a later integration point.

Failure diagnostics are also modeled in index-core: failed or low-confidence fetch, headless, and generic transform paths can produce deterministic diagnostic documents with source, confidence, fallback information, suggested next actions, and redacted local text suitable for fixture review.

Document quality is recorded in document metadata and shown in the TUI status line. Quality categories are adapter, strong-generic, partial-generic, fallback, and failed; JSON extraction includes the category, score, and deterministic reasons so shell workflows can distinguish understood pages from fallback documents.

Security hardening is modeled in index-security: content-size limits, decompression expansion checks, redirect-loop detection, and URL scheme policy tests are reusable by fetchers and entry points. index-http exposes SecureFetcher for applying those checks before content reaches the transformer.

The static reader currently extracts main content, headings, links, code blocks, structured tables, image alt text, canonical URLs, descriptions, and OpenGraph title/description metadata from fetched or local HTML input. It also preserves lists and simple forms as semantic actions with fields, buttons, methods, and resolved actions. In the TUI, e opens form editing, tab changes fields, and enter submits through the host fetch boundary. When live image bytes are reachable, image nodes are rendered as bounded black-and-white dither previews with an explicit source link.

For scriptable output:

cargo run -p index-cli -- --plain examples/sample.html
curl -sS https://example.org | cargo run -p index-cli -- --plain -
cat examples/sample.html | cargo run -p index-cli -- --plain -
cargo run -p index-cli -- --plain https://example.org
cargo run -p index-cli -- --plain example.org
cargo run -p index-cli -- --extract markdown examples/sample.html
cargo run -p index-cli -- --extract links examples/sample.html
cargo run -p index-cli -- --extract json examples/sample.html

index-extract emits deterministic Markdown, stable numeric link lists, and JSON shaped from the Index Document Model, including table headers and row labels derived from structured table rows. It also classifies :pipe commands without executing them: safe commands require :pipe --confirm <cmd>, while shell syntax and unapproved programs are denied.

For offline knowledge workflows:

cargo run -p index-cli -- --save markdown examples/sample.html notes.md
cargo run -p index-cli -- --save json examples/sample.html notes.json
cargo run -p index-cli -- --citations examples/sample.html
cargo run -p index-cli -- --section "Overview" examples/sample.html
cargo run -p index-cli -- --batch-extract markdown examples/sample.html artifact.txt

--save writes deterministic Markdown or JSON to a local file. --citations emits stable TSV references for external HTTP(S) links. --section exports the first matching heading or section as Markdown. --batch-extract works only on local files and local capture artifacts; it does not fetch URLs.

For deterministic local AI-style transforms:

cargo run -p index-cli -- --ai-offline explain examples/sample.html
cargo run -p index-cli -- --ai-offline summarize examples/sample.html
cargo run -p index-cli -- --ai-offline extract examples/sample.html

index-ai defines the provider trait, versioned prompt templates, privacy modes, mock provider, and offline fallback. It performs no network IO; external providers must be integrated explicitly by a host and receive content only after a user invokes an AI action.

For local performance checks:

cargo run -p index-cli -- --benchmark examples/sample.html
curl -sS https://example.org | cargo run -p index-cli -- --benchmark -

The benchmark report is local, machine-readable, and includes input bytes, document counts, transform timing, and transformed-cache reuse.

For local capture artifacts:

cargo run -p index-cli -- capture --redact https://example.org/page examples/sample.html
cargo run -p index-cli -- capture --redact example.org/page examples/sample.html
cargo run -p index-cli -- capture --preview --redact https://example.org/page examples/sample.html
cat artifact.txt | cargo run -p index-cli -- capture --validate -
cat examples/sample.html | cargo run -p index-cli -- capture --redact https://example.org/page -

index-capture validates the source URL, redacts credential-shaped URLs, cookies, form values, and diagnostics, then emits a deterministic local artifact for review. Preview output includes a redaction summary and fixture submission checklist; validation confirms local bundles remain parseable and redacted. It does not fetch or upload anything.

Open and submit workflows report real runtime stages with target context: queued, fetching, snapshotting, parsing, transforming, scoring, storing, done, and failed. See docs/ASYNC_STAGES.md.

Sites can optionally publish index.idx/v1 as a same-origin manifest for safe presentation hints (docs/INDEX_IDX_PROTOCOL.md). Manifest hints are bounded, validated, and fail closed.

For local manifest validation:

index idx lint docs/index-idx/examples/article.index.idx.json https://example.org/docs/page
index idx lint docs/index-idx/examples/search.index.idx.json https://example.org/search?q=index
index idx lint docs/index-idx/examples/forum.index.idx.json https://example.org/forum/thread/42

For runtime compatibility-pack operations:

index compatibility-pack lint docs/compat-packs/examples/social-community.pack.json https://news.ycombinator.com/item?id=1
index compatibility-pack inspect https://news.ycombinator.com/item?id=1
index compatibility-pack install docs/compat-packs/examples/social-community.pack.json --user
index compatibility-pack list

For compatibility data-plane authoring:

index-compat-lab ingest --top100 docs/top100-corpus/matrix.tsv --forum docs/forum-corpus/matrix.tsv
index-compat-lab synthesize --top100 docs/top100-corpus/matrix.tsv --forum docs/forum-corpus/matrix.tsv --family social-community
index-compat-lab scaffold --top100 docs/top100-corpus/matrix.tsv --forum docs/forum-corpus/matrix.tsv --family social-community
index-compat-lab lint docs/compat-packs/examples/social-community.pack.json

For compatibility recovery diagnostics:

index compatibility-live-variance --targets docs/compat-live/targets.tsv --runs docs/compat-live/runs.tsv --window 5
index compatibility-recovery-plan chatgpt.com
index compatibility-recovery-gate --top100 docs/top100-corpus/matrix.tsv --forum docs/forum-corpus/matrix.tsv --live-targets docs/compat-live/targets.tsv --live-runs docs/compat-live/runs.tsv
index auth-assist diagnose-submit https://news.ycombinator.com/login 403 "csrf token expired"
index challenge-diagnose https://example.org blocked-flow.html

For installed binary verification and runtime locations:

index --version
index --paths
index doctor
index artifact inspect https://example.org/docs

Runtime locations follow XDG conventions: $XDG_CONFIG_HOME/index, $XDG_CACHE_HOME/index, and $XDG_STATE_HOME/index, with $HOME/.config, $HOME/.cache, and $HOME/.local/state fallbacks. index doctor emits a local telemetry-free support report with redacted runtime paths, directory health checks, package/version guidance, and no network probe. index artifact inspect reports local artifact presence and freshness by context (live-get, live-submit, offline) from the cache artifact store.

Reader profiles change terminal presentation without changing extracted semantic content:

index --profile reader
index --profile docs https://example.org/manual
index --profile links example.org

Inside the TUI, use :profile reader|docs|links|research|compact|verbose. Index starts in automatic profile mode and suggests docs, links, research, or reader from the current page intent. Use :profile auto to return to automatic selection after a manual override. Theme tokens cover semantic roles, markdown emphasis, diagnostics, links, and regions. True-color terminals get the richest palette; ANSI and monochrome terminals fall back to deterministic named colors and modifiers.

Packaging assets live in:

  • docs/man/index.1
  • completions/index.bash
  • completions/index.zsh
  • completions/index.fish
  • docs/packaging/CRATES_IO.md
  • docs/packaging/DISTROS.md
  • docs/packaging/CLEAN_INSTALL.md
  • docs/BRANDING.md
  • assets/white-icon.png
  • assets/black-icon.png
  • assets/white-banner.png
  • assets/black-banner.png

Branding:

  • App icon: assets/white-icon.png
  • README banner: assets/white-banner.png
  • Official interface font: JetBrainsMono Nerd Font Mono

Production-readiness policies live in:

  • docs/COMPATIBILITY.md
  • docs/COMPATIBILITY_VALIDATION.md
  • docs/ACCESSIBILITY.md
  • docs/MSRV.md
  • docs/ADAPTER_STABILITY.md
  • docs/ADAPTER_PRIORITY.md
  • docs/ADAPTER_HARNESS.md
  • docs/ADAPTER_DISCIPLINE.md
  • docs/DIAGNOSTICS.md
  • docs/DOCTOR.md
  • docs/FAILURE_HANDOFF.md
  • docs/INTERNATIONAL_TEXT.md
  • docs/ALPHA.md
  • docs/BETA.md
  • docs/BETA_READINESS_REPORT.md
  • docs/STABLE.md
  • docs/STABLE_READINESS_REPORT.md
  • docs/KNOWN_LIMITS.md
  • docs/DOGFOODING.md
  • docs/dogfooding/CORPUS.md
  • docs/RELEASE.md
  • docs/RELEASE_NOTES_TEMPLATE.md
  • docs/NETWORK.md
  • docs/SECURITY_REVIEW.md
  • docs/ABUSE_CASES.md
  • docs/PERFORMANCE.md
  • docs/QUALITY.md
  • docs/SNAPSHOT_POLICY.md
  • docs/ARTIFACT_RUNTIME.md
  • docs/CAPTURE.md
  • docs/OFFLINE.md
  • docs/issue-templates/

Coverage program docs live in:

  • docs/COVERAGE_PROGRAM.md
  • docs/COVERAGE_CATALOG.md
  • docs/FIXTURE_INTAKE.md
  • docs/FIXTURE_MATRIX.md
  • docs/SITE_FAMILY_PACKS.md
  • docs/CAPTURE.md
  • docs/forum-corpus/
  • docs/top100-corpus/

Local knowledge shelf commands:

index shelf save examples/sample.html
index shelf list
index shelf show <id>
index shelf search borrowing
index shelf search --format markdown borrowing
index shelf search --format json borrowing
index shelf tag <id> docs
index shelf note <id> "read before release"

Shelf metadata is stored under $XDG_STATE_HOME/index/shelf with $HOME/.local/state/index/shelf as the fallback. Markdown and JSON exports live under the shelf exports/ directory. Search is local-only and ranks title, tag, note, citation, source URL, Markdown heading, and Markdown body matches deterministically.

Adapter fixture review:

index adapter check crates/index-transformer/tests/fixtures/adapters/gitlab-project.html

The report is deterministic text with the detected adapter, support tier, quality, node/link/form/table/region counts, fallback reason, fixture checklist reference, and Markdown extraction snapshot.

TUI keys:

  • j/k scroll
  • gg/G top/bottom
  • / search
  • f link hints
  • l toggle the right sidebar
  • e edit the next form field near the current line
  • tab / shift-tab move between fields while editing a form
  • enter submit the current form while editing a form
  • esc cancel form editing
  • t toggle compact/detail table mode
  • [ / ] shift table columns while the sidebar is closed
  • [ / ] switch sidebar modes while the sidebar is open
  • 1-6 choose links, outline, forms, regions, search, or logs sidebar modes
  • j/k select sidebar items while the sidebar is open
  • enter open links, jump to outline/forms/search items, or expand/collapse selected regions
  • e edits the selected form while the forms sidebar is open
  • space expand/collapse the selected region in the regions sidebar
  • b go back to the previous page
  • :back go back to the previous page
  • :open <id> fetch and render a stable numeric link target
  • :open <url> fetch and render an explicit URL; tab completes from current-session URL history while typing
  • :logs show the hidden local response-log sidebar with redacted server response previews
  • :submit <form> field=value resolve a form submission action
  • :extract markdown|links|json request a document extraction action
  • :pipe <cmd> request a confirmed pipe action
  • :ai explain|summarize|extract request an explicit AI action
  • :profile reader|docs|links|research|compact|verbose|auto switch visual profile
  • :capture preview review local capture redactions for the current page
  • :capture save <path> save a local capture artifact for the current page
  • :main next / :main previous jump between plausible main regions
  • :hide region <id> / :show region <id> collapse or restore a region
  • :promote section <id> focus one region as the temporary main view
  • :quit quit

Architectural rule

All external formats must be converted into the Index Document Model before rendering.

No renderer should parse HTML directly. No adapter should write terminal escape sequences directly. No transformer should know about terminal layout constraints.

Community

License

Unlicense.

Dependencies

~2.2–3.5MB
~54K SLoC