PDF Reflow

Turn a fixed-layout PDF into a clean, reflowable reading experience in the browser — then export it back out as self-contained HTML or a reflowable PDF.

Most PDFs are built for print: rigid columns, fixed page breaks, pixel-pinned figures. That layout falls apart on phones and is painful to read at any zoom level other than the one the author picked. PDF Reflow extracts the semantic content (headings, paragraphs, figures, tables, equations, code, lists) and re-renders it as a responsive single-column document with light/dark mode, proper math typesetting, and mobile-friendly typography.

What it does

Uploads a PDF via a drag-and-drop web UI (100MB limit, validated by magic bytes).
Extracts structured content using a multi-stage pipeline:
- Layout and reading order via Docling, with a PyMuPDF fallback.
- Images with bounding boxes and dedup via PyMuPDF.
- Math regions detected and OCR'd to LaTeX via Surya (LayoutPredictor + TexifyPredictor).
- Spatial merge (IOU-based bbox matching) that stitches images and equations back into Docling's reading order.
Caches results on disk keyed by SHA-256 of the PDF, so re-uploading the same file is instant.
Renders the structured document in a reader view with KaTeX for math, lazy-loaded images, horizontally scrollable tables, and responsive typography.
Exports as either a truly self-contained HTML file (images, KaTeX CSS/JS, and KaTeX fonts all inlined as base64 — no network dependency) or a reflowable PDF via WeasyPrint.
Degrades gracefully: if Surya isn't available math is skipped; if Docling crashes the pipeline falls back to plain-text PyMuPDF; if WeasyPrint is missing HTML export still works.

Tech stack

Backend — Python 3.10+

FastAPI + Uvicorn (ASGI, async request handling with background tasks)
Pydantic for the DocumentResult / ContentBlock schema (discriminated union on type)
PyMuPDF, Docling, Surya, Pillow for extraction
WeasyPrint for PDF export

Frontend — no build step

Vanilla HTML/CSS/JS (frontend/index.html, style.css, upload.js, reader.js)
KaTeX (vendored under frontend/vendor/katex/) for LaTeX rendering
CSS variables for light/dark theme, responsive at 768px breakpoint

Architecture

PDF File
  │
  ▼
FastAPI backend
  │
  ├─ 1. Docling         ─ layout, headings, paragraphs, tables, reading order
  ├─ 2. PyMuPDF         ─ images with bounding boxes
  ├─ 3. Surya           ─ math region detection + LaTeX OCR
  └─ 4. Merge (IOU)     ─ spatial match to unify all three
  │
  ▼
DocumentResult (JSON)  ──►  disk cache (SHA-256 keyed)
  │
  ▼
Vanilla-JS reader        ──►  HTML export (self-contained)
                         └──► PDF export (WeasyPrint)

Each pipeline stage runs in asyncio.to_thread() and is wrapped in try/except so a single failing extractor doesn't take down the rest.

API

The server exposes /api/upload, /api/status/{id}, /api/document/{id}, /api/images/{id}/{img}, /api/export/{id}/html, /api/export/{id}/pdf, and /api/health. The frontend polls /api/status/{id} once per second while processing.

See docs/architecture.md for the full endpoint contract, data model, progress stages, merge algorithm, and cache layout.

Quick start

Create a venv and install Python deps (cross-platform — macOS, Linux, Windows):

python3.12 -m venv venv
source venv/bin/activate        # Windows: .\venv\Scripts\Activate.ps1
pip install -r requirements.txt

Install WeasyPrint's system libraries (Pango/Cairo/GDK-PixBuf) if you want PDF export. Per-OS commands are in docs/setup.md §3.

Run the server:

cd backend
uvicorn main:app --host 127.0.0.1 --port 8000 --reload

Open http://127.0.0.1:8000 and drop a PDF on the upload zone.

First run downloads ~1 GB of ML models from HuggingFace (cached after that) and can take ~10 minutes on CPU. Subsequent documents typically process in 30–60 s.

Full setup — including per-OS system deps, troubleshooting, environment variables, and running the test suite — lives in docs/setup.md.

Project structure

pdf_reflow/
├── backend/
│   ├── main.py              # FastAPI app, endpoints, capability detection
│   ├── models.py            # Pydantic ContentBlock union + DocumentResult
│   ├── models_registry.py   # Lazy singletons for Docling/Surya models
│   ├── security.py          # API-key dependency + CORS env parsing
│   ├── pipeline.py          # Async orchestrator with progress callbacks
│   ├── cache.py             # SHA-256 keyed on-disk cache with dedup
│   ├── exporter.py          # HTML and WeasyPrint PDF export
│   └── extractors/
│       ├── layout.py        # Docling primary + PyMuPDF fallback
│       ├── images.py        # PyMuPDF image extraction with dedup/size limits
│       ├── math_extract.py  # Surya layout + texify for LaTeX OCR
│       └── merge.py         # IOU-based spatial merge
├── frontend/
│   ├── index.html           # Single-page app shell
│   ├── style.css            # Theme variables, responsive typography
│   ├── upload.js            # Drag-and-drop + status polling
│   ├── reader.js            # Block dispatch, KaTeX, export menu
│   └── vendor/katex/        # Vendored KaTeX (CSS, JS, fonts)
├── docs/
│   ├── setup.md             # Cross-platform install and env config
│   └── architecture.md      # Deep dive: pipeline, data model, merge, caching
├── tests/                   # pytest suite (unit + integration markers)
├── requirements.txt
├── requirements-dev.txt
├── pytest.ini
├── uploads/                 # Incoming PDFs (auto-created, .gitignored)
└── cache/                   # Processed results + extracted images

Caveats

CPU is slow for math. Surya runs on CPU if no GPU is detected — expect ~10 min for the first document including model download. GPU cuts this dramatically.
Encrypted PDFs are rejected. Remove the password first with another tool.
Local/dev defaults. CORS is * and no API key is required unless you set CORS_ORIGINS and API_KEY env vars. Rate limiting is not in-process — front it with a reverse proxy in production.
Inline math placement is approximate. Inline equations are appended to the containing paragraph rather than inserted at the precise character offset. See the Known Limitations section in docs/architecture.md.
No frontend virtualization. Very large documents (500+ pages) may be slow to render in the reader.
torch / numpy pins are load-bearing. numpy<2 and surya-ocr 0.11.x are intentional — see docs/setup.md for why.

License

The source code written for this project is released under the MIT License (see LICENSE at the repository root).

⚠️ Important — not MIT end-to-end. This project depends on PyMuPDF (AGPL-3.0) and surya-ocr (GPL-3.0-or-later) at runtime. While every file I wrote is MIT, any combined distribution (Docker image, release archive with a bundled venv/, hosted SaaS, published wheel including dependencies) inherits the strongest copyleft terms among its components — in this case AGPL-3.0. See Copyleft dependency note below for what that means in practice and how to opt out.

Third-party licenses

PDF Reflow directly depends on the components in the table below. Their licenses apply to any combined/redistributed build of the application. Transitive dependencies (torch, transformers, tokenizers, huggingface_hub, starlette, anyio, click, safetensors, etc.) are all permissive (BSD / Apache-2.0 / MIT); full per-package detail is available via pip show <pkg> or inspecting the LICENSE files under venv/lib/.../*.dist-info/.

Component	License	Notes
FastAPI, Pydantic, Docling (+ `docling-core` / `docling-parse` / `docling-ibm-models`)	MIT	Permissive
Uvicorn, NumPy, WeasyPrint, httpx, starlette, click, torch	BSD-3-Clause	Permissive
python-multipart, pytest-asyncio, transformers, huggingface_hub	Apache-2.0	Permissive
anyio, pytest	MIT	Permissive (pytest is dev-only)
Pillow	HPND	Permissive, MIT-compatible
KaTeX (vendored under `frontend/vendor/katex/`)	MIT	Redistributed — KaTeX's `LICENSE` file ships alongside the vendored code (see `frontend/vendor/katex/LICENSE`)
KaTeX fonts (vendored under `frontend/vendor/katex/fonts/`)	MIT	Redistributed — the `katex-fonts` repo's `LICENSE` ships alongside the fonts (see `frontend/vendor/katex/fonts/LICENSE`). Khan Academy relicensed the fonts from SIL OFL 1.1 to MIT in 2018.
PyMuPDF	AGPL-3.0-or-later (or paid Artifex commercial license)	Strong copyleft — see below
surya-ocr	GPL-3.0-or-later	Strong copyleft — required for math-equation detection
Pango / Cairo / GDK-PixBuf	LGPL-2.1+	System libraries, installed by the user for WeasyPrint. Not bundled. Dynamic linking against LGPL libraries is compatible with any license.

Copyleft dependency note

This project intentionally keeps PyMuPDF and surya-ocr in its runtime dependencies because the closest permissive alternatives (pypdfium2, pdfplumber, pypdf) each give up a capability that noticeably hurts output quality for this use case — most importantly, lossless extraction of embedded image bytes and rich text-block structure. The tradeoff is made deliberately: better extraction quality, at the cost of inheriting copyleft terms for any combined distribution.

What this means for you if you use this project:

Cloning, running locally, modifying for personal/internal use — no special obligations beyond MIT.
Publishing the source to your own GitHub fork — no special obligations; MIT and source-only redistribution of your own code doesn't trigger the AGPL/GPL machinery because you're not distributing the copyleft libraries themselves (pip does that, from PyPI, under their licenses).
Distributing a prebuilt artifact that bundles the deps (Docker images, PyInstaller binaries, release zips containing venv/, wheels with bundled copyleft code) — the combined work is subject to AGPL-3.0 (because PyMuPDF is AGPL and AGPL is viral across combined works). You must offer the complete corresponding source of the combined work, preserve all notices, and license the whole under AGPL-3.0.
Hosting as a public network service — AGPL-3.0 §13 requires you to offer source to every user who interacts with the running service. In practice this means a visible "Source" link in the UI pointing to the exact code being run.

How to ship a permissive-only build (if needed)

If you need to distribute this project without AGPL/GPL obligations:

Replace PyMuPDF with pypdfium2 (Apache-2.0 + BSD-3-Clause). Your stack becomes permissive end-to-end, but the migration is non-trivial — image extraction needs to re-encode pixel data, which costs ~1 s per PDF and can introduce mild generation loss on JPEG figures. The PyMuPDF fallback in layout.py loses structured text-block parsing, so you'd either reimplement paragraph clustering or accept "Docling or bust."
Make surya-ocr optional. Move it out of requirements.txt into an extras file (requirements-math.txt) or a pyproject.toml extras group (pip install pdf-reflow[math]). The code already degrades gracefully when Surya is absent (main.py:56-60).
Obtain a commercial PyMuPDF license from Artifex (https://artifex.com/licensing/) to release you from AGPL.

Redistribution checklist (if you ship a bundled build)

When redistributing a combined artifact, ensure the following are included:

A copy of the AGPL-3.0 license text (typically at the artifact's root).
A copy of the GPL-3.0 license text (if surya-ocr is bundled).
Copies of each permissive license for bundled deps (MIT / BSD-3-Clause / Apache-2.0 / HPND) with their original copyright notices intact. The per-package LICENSE/COPYING files inside each *.dist-info/ directory are sufficient.
Written offer (or accompanying archive) for the complete corresponding source code of the combined work.
If hosting as a network service: a clearly visible "Source" link pointing to the exact revision deployed.
Preserve vendored KaTeX LICENSE and fonts/LICENSE notices in the build.

Disclaimer

This license summary is informational and not legal advice. For anything affecting redistribution at scale, commercial deployment, or corporate compliance, consult a lawyer. License obligations around Python import-time "linking" are well-established in FSF doctrine but have not been exhaustively tested in court; the consensus in the Python ecosystem is to treat AGPL/GPL imports as triggering license obligations for combined works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Reflow

What it does

Tech stack

Architecture

API

Quick start

Project structure

Caveats

Further reading

License

Third-party licenses

Copyleft dependency note

How to ship a permissive-only build (if needed)

Redistribution checklist (if you ship a bundled build)

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
docs		docs
frontend		frontend
tests		tests
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PDF Reflow

What it does

Tech stack

Architecture

API

Quick start

Project structure

Caveats

Further reading

License

Third-party licenses

Copyleft dependency note

How to ship a permissive-only build (if needed)

Redistribution checklist (if you ship a bundled build)

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages