ocr-book — Book OCR Pipeline → Markdown

Digitizes an entire book into Markdown from page photos, PDFs, or EPUBs, using PaddleOCR-VL-1.5 via llama-server (local inference).

Prerequisites

miniforge or Anaconda
llama-server (Vulkan recommended on Windows)
GGUF model: PaddleOCR-VL-1.5-GGUF

Installation

python setup.py
conda activate ocr-livre

Then configure the paths to llama-server and the models. The easiest way is to copy .env.example to .env and edit it, but you can also use environment variables or CLI arguments — see docs/SETUP.md for all options.

cp .env.example .env
# Edit .env and set LLAMA_SERVER_PATH, MODEL_PATH and MMPROJ_PATH

Project Structure

ocr-livre/
├── src/
│   ├── main.py          # CLI entry point
│   ├── config.py        # Central configuration (dataclass)
│   ├── ocr_client.py    # OCR of an image via PaddleOCRVL
│   ├── postprocess.py   # OCR text cleanup
│   ├── obsidian.py      # Obsidian export (wikilinks, migration)
│   ├── images.py        # Image collection and renaming
│   ├── pipeline.py      # Full orchestration
│   ├── progress.py      # Logging and statistics
│   ├── pdf.py           # PDF processing (text extraction or render → OCR)
│   └── epub.py          # EPUB extraction (Pandoc-based)
├── docs/
│   ├── architecture/    # Architecture documentation
│   ├── dev/             # Patches and development notes
│   ├── SETUP.md         # Installation instructions
│   ├── tested.md        # Experiment results
│   └── issues.md        # Work in progress
├── photos/              # Source images (one per page)
├── output/              # Generated Markdown + logs + figures
├── environment.yml      # Conda dependencies
└── setup.py             # Automated installation script

Usage

Run from the project root:

# Default pipeline (photos in ./photos, output output/book.md)
python main.py

# Specify folders
python main.py --images ./my_photos --out output/my_book.md

# PDF input
python main.py --images ./book.pdf --out output/book.md

# EPUB input
python main.py --images ./book.epub --out output/book.md

# Without layout detection
python main.py --no-layout

# Restart from the beginning
python main.py --no-resume

# Detailed logs
python main.py --verbose

# Dense tables — increase context if tables are truncated
python main.py --n-ctx 12288 --n-parallel 3

Example

A phone photo of a textbook page — charts, tables, and dense text — converted to clean Markdown in one command.

Left: original page photo. Right: extracted Markdown rendered.

PDF Processing

PDFs are automatically classified as text-based (native text layer) or image-based (scanned).

Text-based: extracts text natively with pymupdf, detects figures with layout model, no VLM OCR.
Image-based: renders pages to images, then runs the normal OCR pipeline.

Choose the extraction method explicitly:

python main.py --images ./book.pdf --method text         # fast, native text only
python main.py --images ./book.pdf --method docling      # structured extraction
python main.py --images ./book.pdf --method paddleocrvl  # best quality, slowest

EPUB Extraction

EPUBs are converted to Markdown via Pandoc, with embedded figures extracted automatically.

python main.py --images ./book.epub --out output/book.md

Obsidian Export

In obsidian mode, the pipeline:

converts figures to wikilinks ![[Files/image.jpg]]
saves the .md directly into the vault
copies figures to vault_path/vault_figures_dir/

Configure vault_path and vault_figures_dir in config.py, then:

# Full OCR + obsidian export
python main.py --mode obsidian

# Re-apply obsidian postprocess without re-running OCR
python main.py --mode obsidian --postprocess-only

# Migrate figures to the vault only
python main.py --migrate

Image Renaming

# Preview without modifying
python main.py --rename --dry-run

# Rename for real (→ page_001.jpg, page_002.jpg, …)
python main.py --rename

# Rename without running OCR
python main.py --rename-only

# Process subfolders by chapter
python main.py --rename-only --chapters "Chapter 1" "Chapter 2"

Automatic Resume

If the pipeline is interrupted, simply re-run:

python main.py

Already processed pages are automatically skipped.

Full Options

--images PATH              Photo folder, PDF, or EPUB       (default: ./photos)
--out FILE                 Output Markdown file             (default: output/book.md)
--llama-server PATH        Path to llama-server executable  (env: LLAMA_SERVER_PATH)
--model PATH               Path to model .gguf              (env: MODEL_PATH)
--mmproj PATH              Path to mmproj .gguf             (env: MMPROJ_PATH)
--mode {base,obsidian}     Output mode                      (default: base)
--method {text,docling,paddleocrvl}  PDF extraction method  (default: paddleocrvl)
--no-layout                Disable layout detection
--no-resume                Restart from the beginning
--no-postprocess           Raw output without cleanup
--postprocess-only         Obsidian postprocess without OCR  (requires --mode obsidian)
--migrate                  Copy figures to the vault        (requires vault_path configured)
--dry-run                  Simulate without modifying
--verbose                  DEBUG logs
--rename                   Rename images before OCR
--rename-only [N]          Rename without running OCR       (N = starting number)
--rename-prefix P          Rename prefix                    (default: page)
--chapters NAME…           Subfolders to process (in order)
--dir-level                Folder-level order for --rename
--max-tokens N             Max tokens generated per page    (default: 4096)
--n-ctx N                  KV cache size (context window)   (default: 6144)
--n-parallel N             Intra-page parallel slots        (default: 3)

Exit Codes

Code	Meaning
0	Full success
1	Fatal error
2	Finished with errors on some pages

Name		Name	Last commit message	Last commit date
Latest commit History 277 Commits
.claude		.claude
docs		docs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocr-book — Book OCR Pipeline → Markdown

Prerequisites

Installation

Project Structure

Usage

Example

PDF Processing

EPUB Extraction

Obsidian Export

Image Renaming

Automatic Resume

Full Options

Exit Codes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ocr-book — Book OCR Pipeline → Markdown

Prerequisites

Installation

Project Structure

Usage

Example

PDF Processing

EPUB Extraction

Obsidian Export

Image Renaming

Automatic Resume

Full Options

Exit Codes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages