Skip to content

hightemp/foliant

Repository files navigation

Foliant

CI Release Downloads Go Report Card

Foliant is a Go-native OCR/document intelligence project inspired by Surya and Marker.

The project is experimental. The current implementation has a pure-Go model cache, safetensors reader, tensor primitives, image text detection, selected recognition parity fixtures, image OCR wiring, fixture-limited layout and table structure paths, OCR JSON-to-Markdown rendering, constrained image/scanned-PDF/embedded-text-PDF conversion, PDF metadata inspection, a narrow image-only scanned-PDF path for pages backed by one full-page JPEG XObject, simple 8-bit Flate RGB/gray image XObject with limited predictors, or uncompressed 8-bit RGB/gray image XObject, common xref streams including a constrained Flate predictor DecodeParms subset, and a limited embedded-PDF-text path for simple hand-authored Type1/TrueType text fixtures. Broad Surya compatibility and production speed are not claimed. Full PDF rasterization/text extraction is still deferred.

Runtime Constraints

  • Pure Go runtime.
  • No Python runtime.
  • No PyTorch runtime.
  • No ONNX runtime.
  • No GGUF conversion.
  • No llama.cpp.
  • No Ollama.
  • No C/C++ inference runtime.
  • No CGo for inference.
  • No model format conversion.
  • Original Surya artifacts must be downloaded and read directly.

Development-only Python scripts may be added later only for generating or comparing fixtures against upstream Surya. They must not become runtime dependencies.

Installation

Prerequisites:

  • Go 1.23 or newer.
  • A writable Go build cache and model cache.
  • Enough RAM for the selected experimental workflow; see docs/MEMORY.md.

Build the CLI:

make build
./bin/foliant version

Optionally install it somewhere on PATH:

make install
foliant version

make install installs to ~/.local/bin by default. Override the destination with PREFIX, for example make install PREFIX=/usr/local.

Release builds can inject metadata without changing source:

go build \
  -ldflags "-X main.buildVersion=v0.0.0-experimental -X main.buildCommit=$(git rev-parse --short HEAD) -X main.buildDate=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  -o ./bin/foliant ./cmd/foliant
./bin/foliant version --json

No packaged binary target or OS support promise is declared yet. The project is pure Go and is expected to build on platforms supported by Go 1.23, but current validation is CPU-only local development validation, not a cross-platform release matrix. GPU/accelerator support is not implemented.

Quick Start

The examples below assume the built foliant binary is available on PATH. Use --model-cache when you want an explicit model cache location; otherwise Foliant uses the OS user cache under datalab/models.

Fetch the detection and recognition checkpoints before the first OCR run:

foliant models fetch --model text_detection --progress always
foliant models fetch --model recognition --progress always

Run OCR on an image and write JSON:

foliant ocr page.png --out result.json --progress always

Convert an image to Markdown and keep the OCR JSON sidecar:

foliant convert page.png --format markdown --out result.md --json-out result.json --progress always

Convert a supported scanned PDF page range:

foliant convert scanned.pdf --pages 1,3-5 --format markdown --out result.md --progress always

Commands

Version

foliant version
foliant version --json

Prints build metadata. --json emits foliant.version.v1.

Models

foliant models fetch --model text_detection
foliant models fetch --model recognition
foliant models validate --model text_detection
foliant inspect-model --model text_detection

models fetch downloads original Surya model artifacts into the local cache. models validate checks that cached files listed by manifest.json exist. inspect-model prints checkpoint config and safetensors metadata as JSON.

Common model flags:

--model-cache DIR   override the model cache root
--base-url URL      override the model artifact base URL
--workers N         parallel download workers
--no-download       fail if the required model is not already cached

Detect Text Lines

foliant detect page.png --out detection.json --progress always

detect loads the text detection checkpoint, runs image preprocessing and detection, then writes foliant.detection.v1 JSON with page bbox, text-line boxes, polygons, and confidences where available.

Useful flags:

--checkpoint MODEL_OR_DIR   detection checkpoint ref or local model directory
--model-cache DIR           model cache root
--out PATH                  write JSON to a file instead of stdout
--debug-dump DIR            write preprocessing debug files
--max-pixels N              reject images larger than N pixels; -1 disables
--processor-size N          experimental smoke/debug override
--log-level LEVEL           debug, info, warn, error
--progress MODE             auto, always, never

OCR Images and Supported Scanned PDFs

foliant ocr page.png --out result.json --progress always
foliant ocr scanned.pdf --pages 1,3-5 --out result.json --progress always

ocr runs detection, crops detected lines, runs recognition, and writes foliant.ocr.v1 JSON. PDF input is limited to the supported scanned-PDF subset described below.

Useful flags:

--detection-model MODEL_OR_DIR
--recognition-model MODEL_OR_DIR
--model-cache DIR
--out PATH
--pages LIST                 PDF pages, for example 1,3-5
--pdf-max-bytes N            reject PDFs larger than N bytes; -1 disables
--max-pixels N               reject decoded images larger than N pixels
--max-tokens N               recognition token limit per line
--crop-padding N             extra pixels around detected line crops
--no-download
--log-level LEVEL
--progress MODE

Convert to Markdown or JSON

foliant convert page.png --format markdown --out result.md --json-out result.json --progress always
foliant convert page.png --format json --out result.json --progress always
foliant convert scanned.pdf --pages 1,3-5 --format markdown --out result.md --progress always

convert is the main document-conversion command. It runs OCR for images or the supported scanned-PDF subset and renders Markdown or JSON. Markdown currently uses OCR line order by default.

Optional image-only experimental paths:

foliant convert page.png --layout --layout-max-tokens 1 --format markdown --out result.md
foliant convert table-page.png --tables --table-max-boxes 32 --format markdown --out result.md

--layout can attach layout blocks and use layout-aware Markdown when assignment is usable. --tables can attach fixture-limited table structure and render conservative Markdown pipe tables. These paths are disabled for PDFs.

Render Existing OCR JSON

foliant render result.json --format markdown --out result.md

Renders existing foliant.ocr.v1 JSON to Markdown. This is useful when OCR has already been run and only the output format needs to change.

Layout and Table JSON

foliant layout page.png --out layout.json --max-tokens 1
foliant table table-crop.png --out table.json --max-boxes 32

layout emits experimental foliant.layout.v1 JSON for an image. table emits experimental foliant.table.v1 JSON for one full image treated as a table crop. Both commands are fixture-limited and should not be treated as broad Surya/Marker parity.

PDF Utilities

foliant pdf inspect document.pdf --out pdf.json
foliant pdf extract-images scanned.pdf --out-dir pages/ --pages 1,3-5

pdf inspect reports basic PDF metadata, page boxes, page count, rotation, and parser feature flags. pdf extract-images extracts page PNGs from the supported image-only scanned-PDF subset for debugging and fixture work.

Current Limitations

detect, ocr, layout, table, render, convert, pdf inspect, and pdf extract-images are experimental. Detection and recognition have selected opt-in fixture validation against upstream Surya, but broad full-size Surya parity is not claimed. Production speed is not claimed.

Image OCR supports PNG/JPEG inputs. PDF OCR/conversion supports only a constrained scanned-PDF subset where each page is represented by one full-page JPEG image XObject, simple 8-bit FlateDecode DeviceRGB/DeviceGray image XObject with supported predictors, or uncompressed 8-bit DeviceRGB/DeviceGray image XObject. Limited embedded PDF text extraction exists for simple fixtures.

Arbitrary PDFs, vector pages, full PDF rasterization, complex font/text extraction, forms, transparency, broad layout/table parity, PDF layout/table recognition, and broad Marker compatibility are not supported yet.

detect rejects very large images before full decode by default. Use --max-pixels to tune the limit for trusted inputs. Use --log-level error to suppress non-error warnings in scripted detection runs. Use --progress auto|always|never on models fetch, models validate, detect, ocr, and convert to control dependency-free stage progress on stderr. auto is quiet for non-terminal writers so JSON/Markdown stdout remains machine-readable in tests and scripts; always is useful for long local CPU runs. For local smoke/debug runs on the naive CPU backend, detect also has an experimental --processor-size override. Leaving it unset preserves the checkpoint's Surya processor size.

Model Cache

By default Foliant uses the operating system user cache directory:

<user-cache-dir>/datalab/models/<model-name>/<version>

Example:

~/.cache/datalab/models/text_detection/2025_05_07

Use --model-cache to override the cache root.

Foliant does not vendor model weights. Commands that need models either use the existing cache or download original upstream artifacts unless --no-download is available and set for that command. Release notes must point users to NOTICE.md before any model-backed workflow is described as usable outside local experiments.

Resource Expectations

The current CPU backend is correctness-oriented and expensive. Local opt-in fixtures have required multiple minutes and several GiB of RAM: the combined OCR/layout convert fixture reached about 5.4 GiB RSS, and the table CLI lossless fixture reached about 0.9 GiB RSS. Full-page OCR can take several minutes on a desktop CPU. Treat these numbers as development measurements, not production SLOs. Keep --max-pixels guards enabled for untrusted images and review docs/MEMORY.md plus docs/PERFORMANCE.md before publishing binaries.

Development

Required checks:

gofmt -w .
go test ./...
go vet ./...
golangci-lint run ./...
go test -race ./...
go build -o /tmp/foliant-build-check ./cmd/foliant
go list -m all
git diff --check

Large-model integration tests must be opt-in. Normal unit tests must not download Surya models. In sandboxed environments, use a writable Go cache such as GOCACHE=/tmp/foliant-gocache; use GOLANGCI_LINT_CACHE=/tmp/foliant-golangci-lint-cache for golangci-lint if the default cache is not writable.

Experimental Release Notes Template

Use this shape for any experimental release notes:

  • Version/build: include foliant version --json, commit, build date, target OS/arch, and Go version.
  • Supported scope: summarize only features marked implemented or partially validated in docs/COMPATIBILITY.md.
  • Fixture status: list opt-in fixtures that passed, with dates and resource notes when relevant.
  • Known gaps: broad Surya/Marker compatibility, production speed, arbitrary PDFs, full PDF rasterization/text extraction, broad layout/table parity, and model-license review.
  • Licensing/dependencies: link NOTICE.md and docs/DEPENDENCIES.md; state that model weights are not bundled.
  • Validation: include the exact go test, go vet, golangci-lint, go test -race, go build, go list -m all, and git diff --check commands used.

License

Project code license is pending. Model artifact license risks are documented in NOTICE.md.

About

Foliant is a Go-native OCR/document intelligence project inspired by Surya and Marker.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors