Skip to content

akmalayari/ocr-book

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

277 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ocr-book — Book OCR Pipeline → Markdown

Digitizes an entire book into Markdown from page photos, PDFs, or EPUBs, using PaddleOCR-VL-1.5 via llama-server (local inference).


Prerequisites


Installation

python setup.py
conda activate ocr-livre

Then configure the paths to llama-server and the models. The easiest way is to copy .env.example to .env and edit it, but you can also use environment variables or CLI arguments — see docs/SETUP.md for all options.

cp .env.example .env
# Edit .env and set LLAMA_SERVER_PATH, MODEL_PATH and MMPROJ_PATH

Project Structure

ocr-livre/
├── src/
│   ├── main.py          # CLI entry point
│   ├── config.py        # Central configuration (dataclass)
│   ├── ocr_client.py    # OCR of an image via PaddleOCRVL
│   ├── postprocess.py   # OCR text cleanup
│   ├── obsidian.py      # Obsidian export (wikilinks, migration)
│   ├── images.py        # Image collection and renaming
│   ├── pipeline.py      # Full orchestration
│   ├── progress.py      # Logging and statistics
│   ├── pdf.py           # PDF processing (text extraction or render → OCR)
│   └── epub.py          # EPUB extraction (Pandoc-based)
├── docs/
│   ├── architecture/    # Architecture documentation
│   ├── dev/             # Patches and development notes
│   ├── SETUP.md         # Installation instructions
│   ├── tested.md        # Experiment results
│   └── issues.md        # Work in progress
├── photos/              # Source images (one per page)
├── output/              # Generated Markdown + logs + figures
├── environment.yml      # Conda dependencies
└── setup.py             # Automated installation script

Usage

Run from the project root:

# Default pipeline (photos in ./photos, output output/book.md)
python main.py

# Specify folders
python main.py --images ./my_photos --out output/my_book.md

# PDF input
python main.py --images ./book.pdf --out output/book.md

# EPUB input
python main.py --images ./book.epub --out output/book.md

# Without layout detection
python main.py --no-layout

# Restart from the beginning
python main.py --no-resume

# Detailed logs
python main.py --verbose

# Dense tables — increase context if tables are truncated
python main.py --n-ctx 12288 --n-parallel 3

Example

A phone photo of a textbook page — charts, tables, and dense text — converted to clean Markdown in one command.

OCR before/after

Left: original page photo. Right: extracted Markdown rendered.


PDF Processing

PDFs are automatically classified as text-based (native text layer) or image-based (scanned).

  • Text-based: extracts text natively with pymupdf, detects figures with layout model, no VLM OCR.
  • Image-based: renders pages to images, then runs the normal OCR pipeline.

Choose the extraction method explicitly:

python main.py --images ./book.pdf --method text         # fast, native text only
python main.py --images ./book.pdf --method docling      # structured extraction
python main.py --images ./book.pdf --method paddleocrvl  # best quality, slowest

EPUB Extraction

EPUBs are converted to Markdown via Pandoc, with embedded figures extracted automatically.

python main.py --images ./book.epub --out output/book.md

Obsidian Export

In obsidian mode, the pipeline:

  • converts figures to wikilinks ![[Files/image.jpg]]
  • saves the .md directly into the vault
  • copies figures to vault_path/vault_figures_dir/

Configure vault_path and vault_figures_dir in config.py, then:

# Full OCR + obsidian export
python main.py --mode obsidian

# Re-apply obsidian postprocess without re-running OCR
python main.py --mode obsidian --postprocess-only

# Migrate figures to the vault only
python main.py --migrate

Image Renaming

# Preview without modifying
python main.py --rename --dry-run

# Rename for real (→ page_001.jpg, page_002.jpg, …)
python main.py --rename

# Rename without running OCR
python main.py --rename-only

# Process subfolders by chapter
python main.py --rename-only --chapters "Chapter 1" "Chapter 2"

Automatic Resume

If the pipeline is interrupted, simply re-run:

python main.py

Already processed pages are automatically skipped.


Full Options

--images PATH              Photo folder, PDF, or EPUB       (default: ./photos)
--out FILE                 Output Markdown file             (default: output/book.md)
--llama-server PATH        Path to llama-server executable  (env: LLAMA_SERVER_PATH)
--model PATH               Path to model .gguf              (env: MODEL_PATH)
--mmproj PATH              Path to mmproj .gguf             (env: MMPROJ_PATH)
--mode {base,obsidian}     Output mode                      (default: base)
--method {text,docling,paddleocrvl}  PDF extraction method  (default: paddleocrvl)
--no-layout                Disable layout detection
--no-resume                Restart from the beginning
--no-postprocess           Raw output without cleanup
--postprocess-only         Obsidian postprocess without OCR  (requires --mode obsidian)
--migrate                  Copy figures to the vault        (requires vault_path configured)
--dry-run                  Simulate without modifying
--verbose                  DEBUG logs
--rename                   Rename images before OCR
--rename-only [N]          Rename without running OCR       (N = starting number)
--rename-prefix P          Rename prefix                    (default: page)
--chapters NAME…           Subfolders to process (in order)
--dir-level                Folder-level order for --rename
--max-tokens N             Max tokens generated per page    (default: 4096)
--n-ctx N                  KV cache size (context window)   (default: 6144)
--n-parallel N             Intra-page parallel slots        (default: 3)

Exit Codes

Code Meaning
0 Full success
1 Fatal error
2 Finished with errors on some pages

About

Book OCR Pipeline → Markdown (PaddleOCR-VL-1.5 + llama-server)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages