Skip to content

omrsangx/PDF-Reflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Reflow

Turn a fixed-layout PDF into a clean, reflowable reading experience in the browser — then export it back out as self-contained HTML or a reflowable PDF.

Most PDFs are built for print: rigid columns, fixed page breaks, pixel-pinned figures. That layout falls apart on phones and is painful to read at any zoom level other than the one the author picked. PDF Reflow extracts the semantic content (headings, paragraphs, figures, tables, equations, code, lists) and re-renders it as a responsive single-column document with light/dark mode, proper math typesetting, and mobile-friendly typography.

What it does

  • Uploads a PDF via a drag-and-drop web UI (100MB limit, validated by magic bytes).
  • Extracts structured content using a multi-stage pipeline:
    • Layout and reading order via Docling, with a PyMuPDF fallback.
    • Images with bounding boxes and dedup via PyMuPDF.
    • Math regions detected and OCR'd to LaTeX via Surya (LayoutPredictor + TexifyPredictor).
    • Spatial merge (IOU-based bbox matching) that stitches images and equations back into Docling's reading order.
  • Caches results on disk keyed by SHA-256 of the PDF, so re-uploading the same file is instant.
  • Renders the structured document in a reader view with KaTeX for math, lazy-loaded images, horizontally scrollable tables, and responsive typography.
  • Exports as either a truly self-contained HTML file (images, KaTeX CSS/JS, and KaTeX fonts all inlined as base64 — no network dependency) or a reflowable PDF via WeasyPrint.
  • Degrades gracefully: if Surya isn't available math is skipped; if Docling crashes the pipeline falls back to plain-text PyMuPDF; if WeasyPrint is missing HTML export still works.

Tech stack

Backend — Python 3.10+

Frontend — no build step

  • Vanilla HTML/CSS/JS (frontend/index.html, style.css, upload.js, reader.js)
  • KaTeX (vendored under frontend/vendor/katex/) for LaTeX rendering
  • CSS variables for light/dark theme, responsive at 768px breakpoint

Architecture

PDF File
  │
  ▼
FastAPI backend
  │
  ├─ 1. Docling         ─ layout, headings, paragraphs, tables, reading order
  ├─ 2. PyMuPDF         ─ images with bounding boxes
  ├─ 3. Surya           ─ math region detection + LaTeX OCR
  └─ 4. Merge (IOU)     ─ spatial match to unify all three
  │
  ▼
DocumentResult (JSON)  ──►  disk cache (SHA-256 keyed)
  │
  ▼
Vanilla-JS reader        ──►  HTML export (self-contained)
                         └──► PDF export (WeasyPrint)

Each pipeline stage runs in asyncio.to_thread() and is wrapped in try/except so a single failing extractor doesn't take down the rest.

API

The server exposes /api/upload, /api/status/{id}, /api/document/{id}, /api/images/{id}/{img}, /api/export/{id}/html, /api/export/{id}/pdf, and /api/health. The frontend polls /api/status/{id} once per second while processing.

See docs/architecture.md for the full endpoint contract, data model, progress stages, merge algorithm, and cache layout.

Quick start

  1. Create a venv and install Python deps (cross-platform — macOS, Linux, Windows):
    python3.12 -m venv venv
    source venv/bin/activate        # Windows: .\venv\Scripts\Activate.ps1
    pip install -r requirements.txt
  2. Install WeasyPrint's system libraries (Pango/Cairo/GDK-PixBuf) if you want PDF export. Per-OS commands are in docs/setup.md §3.
  3. Run the server:
    cd backend
    uvicorn main:app --host 127.0.0.1 --port 8000 --reload
  4. Open http://127.0.0.1:8000 and drop a PDF on the upload zone.

First run downloads ~1 GB of ML models from HuggingFace (cached after that) and can take ~10 minutes on CPU. Subsequent documents typically process in 30–60 s.

Full setup — including per-OS system deps, troubleshooting, environment variables, and running the test suite — lives in docs/setup.md.

Project structure

pdf_reflow/
├── backend/
│   ├── main.py              # FastAPI app, endpoints, capability detection
│   ├── models.py            # Pydantic ContentBlock union + DocumentResult
│   ├── models_registry.py   # Lazy singletons for Docling/Surya models
│   ├── security.py          # API-key dependency + CORS env parsing
│   ├── pipeline.py          # Async orchestrator with progress callbacks
│   ├── cache.py             # SHA-256 keyed on-disk cache with dedup
│   ├── exporter.py          # HTML and WeasyPrint PDF export
│   └── extractors/
│       ├── layout.py        # Docling primary + PyMuPDF fallback
│       ├── images.py        # PyMuPDF image extraction with dedup/size limits
│       ├── math_extract.py  # Surya layout + texify for LaTeX OCR
│       └── merge.py         # IOU-based spatial merge
├── frontend/
│   ├── index.html           # Single-page app shell
│   ├── style.css            # Theme variables, responsive typography
│   ├── upload.js            # Drag-and-drop + status polling
│   ├── reader.js            # Block dispatch, KaTeX, export menu
│   └── vendor/katex/        # Vendored KaTeX (CSS, JS, fonts)
├── docs/
│   ├── setup.md             # Cross-platform install and env config
│   └── architecture.md      # Deep dive: pipeline, data model, merge, caching
├── tests/                   # pytest suite (unit + integration markers)
├── requirements.txt
├── requirements-dev.txt
├── pytest.ini
├── uploads/                 # Incoming PDFs (auto-created, .gitignored)
└── cache/                   # Processed results + extracted images

Caveats

  • CPU is slow for math. Surya runs on CPU if no GPU is detected — expect ~10 min for the first document including model download. GPU cuts this dramatically.
  • Encrypted PDFs are rejected. Remove the password first with another tool.
  • Local/dev defaults. CORS is * and no API key is required unless you set CORS_ORIGINS and API_KEY env vars. Rate limiting is not in-process — front it with a reverse proxy in production.
  • Inline math placement is approximate. Inline equations are appended to the containing paragraph rather than inserted at the precise character offset. See the Known Limitations section in docs/architecture.md.
  • No frontend virtualization. Very large documents (500+ pages) may be slow to render in the reader.
  • torch / numpy pins are load-bearing. numpy<2 and surya-ocr 0.11.x are intentional — see docs/setup.md for why.

Further reading

  • docs/architecture.md — API schemas, pipeline internals, merge algorithm, graceful-degradation matrix, performance characteristics, known limitations.
  • docs/setup.md — cross-platform install, WeasyPrint system deps, environment variables, running tests, troubleshooting.

License

The source code written for this project is released under the MIT License (see LICENSE at the repository root).

⚠️ Important — not MIT end-to-end. This project depends on PyMuPDF (AGPL-3.0) and surya-ocr (GPL-3.0-or-later) at runtime. While every file I wrote is MIT, any combined distribution (Docker image, release archive with a bundled venv/, hosted SaaS, published wheel including dependencies) inherits the strongest copyleft terms among its components — in this case AGPL-3.0. See Copyleft dependency note below for what that means in practice and how to opt out.

Third-party licenses

PDF Reflow directly depends on the components in the table below. Their licenses apply to any combined/redistributed build of the application. Transitive dependencies (torch, transformers, tokenizers, huggingface_hub, starlette, anyio, click, safetensors, etc.) are all permissive (BSD / Apache-2.0 / MIT); full per-package detail is available via pip show <pkg> or inspecting the LICENSE files under venv/lib/.../*.dist-info/.

Component License Notes
FastAPI, Pydantic, Docling (+ docling-core / docling-parse / docling-ibm-models) MIT Permissive
Uvicorn, NumPy, WeasyPrint, httpx, starlette, click, torch BSD-3-Clause Permissive
python-multipart, pytest-asyncio, transformers, huggingface_hub Apache-2.0 Permissive
anyio, pytest MIT Permissive (pytest is dev-only)
Pillow HPND Permissive, MIT-compatible
KaTeX (vendored under frontend/vendor/katex/) MIT Redistributed — KaTeX's LICENSE file ships alongside the vendored code (see frontend/vendor/katex/LICENSE)
KaTeX fonts (vendored under frontend/vendor/katex/fonts/) MIT Redistributed — the katex-fonts repo's LICENSE ships alongside the fonts (see frontend/vendor/katex/fonts/LICENSE). Khan Academy relicensed the fonts from SIL OFL 1.1 to MIT in 2018.
PyMuPDF AGPL-3.0-or-later (or paid Artifex commercial license) Strong copyleft — see below
surya-ocr GPL-3.0-or-later Strong copyleft — required for math-equation detection
Pango / Cairo / GDK-PixBuf LGPL-2.1+ System libraries, installed by the user for WeasyPrint. Not bundled. Dynamic linking against LGPL libraries is compatible with any license.

Copyleft dependency note

This project intentionally keeps PyMuPDF and surya-ocr in its runtime dependencies because the closest permissive alternatives (pypdfium2, pdfplumber, pypdf) each give up a capability that noticeably hurts output quality for this use case — most importantly, lossless extraction of embedded image bytes and rich text-block structure. The tradeoff is made deliberately: better extraction quality, at the cost of inheriting copyleft terms for any combined distribution.

What this means for you if you use this project:

  • Cloning, running locally, modifying for personal/internal use — no special obligations beyond MIT.
  • Publishing the source to your own GitHub fork — no special obligations; MIT and source-only redistribution of your own code doesn't trigger the AGPL/GPL machinery because you're not distributing the copyleft libraries themselves (pip does that, from PyPI, under their licenses).
  • Distributing a prebuilt artifact that bundles the deps (Docker images, PyInstaller binaries, release zips containing venv/, wheels with bundled copyleft code) — the combined work is subject to AGPL-3.0 (because PyMuPDF is AGPL and AGPL is viral across combined works). You must offer the complete corresponding source of the combined work, preserve all notices, and license the whole under AGPL-3.0.
  • Hosting as a public network service — AGPL-3.0 §13 requires you to offer source to every user who interacts with the running service. In practice this means a visible "Source" link in the UI pointing to the exact code being run.

How to ship a permissive-only build (if needed)

If you need to distribute this project without AGPL/GPL obligations:

  1. Replace PyMuPDF with pypdfium2 (Apache-2.0 + BSD-3-Clause). Your stack becomes permissive end-to-end, but the migration is non-trivial — image extraction needs to re-encode pixel data, which costs ~1 s per PDF and can introduce mild generation loss on JPEG figures. The PyMuPDF fallback in layout.py loses structured text-block parsing, so you'd either reimplement paragraph clustering or accept "Docling or bust."
  2. Make surya-ocr optional. Move it out of requirements.txt into an extras file (requirements-math.txt) or a pyproject.toml extras group (pip install pdf-reflow[math]). The code already degrades gracefully when Surya is absent (main.py:56-60).
  3. Obtain a commercial PyMuPDF license from Artifex (https://artifex.com/licensing/) to release you from AGPL.

Redistribution checklist (if you ship a bundled build)

When redistributing a combined artifact, ensure the following are included:

  • A copy of the AGPL-3.0 license text (typically at the artifact's root).
  • A copy of the GPL-3.0 license text (if surya-ocr is bundled).
  • Copies of each permissive license for bundled deps (MIT / BSD-3-Clause / Apache-2.0 / HPND) with their original copyright notices intact. The per-package LICENSE/COPYING files inside each *.dist-info/ directory are sufficient.
  • Written offer (or accompanying archive) for the complete corresponding source code of the combined work.
  • If hosting as a network service: a clearly visible "Source" link pointing to the exact revision deployed.
  • Preserve vendored KaTeX LICENSE and fonts/LICENSE notices in the build.

Disclaimer

This license summary is informational and not legal advice. For anything affecting redistribution at scale, commercial deployment, or corporate compliance, consult a lawyer. License obligations around Python import-time "linking" are well-established in FSF doctrine but have not been exhaustively tested in court; the consensus in the Python ecosystem is to treat AGPL/GPL imports as triggering license obligations for combined works.

About

Turn a fixed-layout PDF into a clean, reflowable reading experience in the browser — then export it back out as self-contained HTML or a reflowable PDF.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors