Turn a fixed-layout PDF into a clean, reflowable reading experience in the browser — then export it back out as self-contained HTML or a reflowable PDF.
Most PDFs are built for print: rigid columns, fixed page breaks, pixel-pinned figures. That layout falls apart on phones and is painful to read at any zoom level other than the one the author picked. PDF Reflow extracts the semantic content (headings, paragraphs, figures, tables, equations, code, lists) and re-renders it as a responsive single-column document with light/dark mode, proper math typesetting, and mobile-friendly typography.
- Uploads a PDF via a drag-and-drop web UI (100MB limit, validated by magic bytes).
- Extracts structured content using a multi-stage pipeline:
- Layout and reading order via Docling, with a PyMuPDF fallback.
- Images with bounding boxes and dedup via PyMuPDF.
- Math regions detected and OCR'd to LaTeX via Surya (
LayoutPredictor+TexifyPredictor). - Spatial merge (IOU-based bbox matching) that stitches images and equations back into Docling's reading order.
- Caches results on disk keyed by SHA-256 of the PDF, so re-uploading the same file is instant.
- Renders the structured document in a reader view with KaTeX for math, lazy-loaded images, horizontally scrollable tables, and responsive typography.
- Exports as either a truly self-contained HTML file (images, KaTeX CSS/JS, and KaTeX fonts all inlined as base64 — no network dependency) or a reflowable PDF via WeasyPrint.
- Degrades gracefully: if Surya isn't available math is skipped; if Docling crashes the pipeline falls back to plain-text PyMuPDF; if WeasyPrint is missing HTML export still works.
Backend — Python 3.10+
- FastAPI + Uvicorn (ASGI, async request handling with background tasks)
- Pydantic for the
DocumentResult/ContentBlockschema (discriminated union ontype) - PyMuPDF, Docling, Surya, Pillow for extraction
- WeasyPrint for PDF export
Frontend — no build step
- Vanilla HTML/CSS/JS (
frontend/index.html,style.css,upload.js,reader.js) - KaTeX (vendored under
frontend/vendor/katex/) for LaTeX rendering - CSS variables for light/dark theme, responsive at 768px breakpoint
PDF File
│
▼
FastAPI backend
│
├─ 1. Docling ─ layout, headings, paragraphs, tables, reading order
├─ 2. PyMuPDF ─ images with bounding boxes
├─ 3. Surya ─ math region detection + LaTeX OCR
└─ 4. Merge (IOU) ─ spatial match to unify all three
│
▼
DocumentResult (JSON) ──► disk cache (SHA-256 keyed)
│
▼
Vanilla-JS reader ──► HTML export (self-contained)
└──► PDF export (WeasyPrint)
Each pipeline stage runs in asyncio.to_thread() and is wrapped in try/except so a single failing extractor doesn't take down the rest.
The server exposes /api/upload, /api/status/{id}, /api/document/{id}, /api/images/{id}/{img}, /api/export/{id}/html, /api/export/{id}/pdf, and /api/health. The frontend polls /api/status/{id} once per second while processing.
See docs/architecture.md for the full endpoint contract, data model, progress stages, merge algorithm, and cache layout.
- Create a venv and install Python deps (cross-platform — macOS, Linux, Windows):
python3.12 -m venv venv source venv/bin/activate # Windows: .\venv\Scripts\Activate.ps1 pip install -r requirements.txt
- Install WeasyPrint's system libraries (Pango/Cairo/GDK-PixBuf) if you want PDF export. Per-OS commands are in docs/setup.md §3.
- Run the server:
cd backend uvicorn main:app --host 127.0.0.1 --port 8000 --reload - Open http://127.0.0.1:8000 and drop a PDF on the upload zone.
First run downloads ~1 GB of ML models from HuggingFace (cached after that) and can take ~10 minutes on CPU. Subsequent documents typically process in 30–60 s.
Full setup — including per-OS system deps, troubleshooting, environment variables, and running the test suite — lives in docs/setup.md.
pdf_reflow/
├── backend/
│ ├── main.py # FastAPI app, endpoints, capability detection
│ ├── models.py # Pydantic ContentBlock union + DocumentResult
│ ├── models_registry.py # Lazy singletons for Docling/Surya models
│ ├── security.py # API-key dependency + CORS env parsing
│ ├── pipeline.py # Async orchestrator with progress callbacks
│ ├── cache.py # SHA-256 keyed on-disk cache with dedup
│ ├── exporter.py # HTML and WeasyPrint PDF export
│ └── extractors/
│ ├── layout.py # Docling primary + PyMuPDF fallback
│ ├── images.py # PyMuPDF image extraction with dedup/size limits
│ ├── math_extract.py # Surya layout + texify for LaTeX OCR
│ └── merge.py # IOU-based spatial merge
├── frontend/
│ ├── index.html # Single-page app shell
│ ├── style.css # Theme variables, responsive typography
│ ├── upload.js # Drag-and-drop + status polling
│ ├── reader.js # Block dispatch, KaTeX, export menu
│ └── vendor/katex/ # Vendored KaTeX (CSS, JS, fonts)
├── docs/
│ ├── setup.md # Cross-platform install and env config
│ └── architecture.md # Deep dive: pipeline, data model, merge, caching
├── tests/ # pytest suite (unit + integration markers)
├── requirements.txt
├── requirements-dev.txt
├── pytest.ini
├── uploads/ # Incoming PDFs (auto-created, .gitignored)
└── cache/ # Processed results + extracted images
- CPU is slow for math. Surya runs on CPU if no GPU is detected — expect ~10 min for the first document including model download. GPU cuts this dramatically.
- Encrypted PDFs are rejected. Remove the password first with another tool.
- Local/dev defaults. CORS is
*and no API key is required unless you setCORS_ORIGINSandAPI_KEYenv vars. Rate limiting is not in-process — front it with a reverse proxy in production. - Inline math placement is approximate. Inline equations are appended to the containing paragraph rather than inserted at the precise character offset. See the Known Limitations section in docs/architecture.md.
- No frontend virtualization. Very large documents (500+ pages) may be slow to render in the reader.
- torch / numpy pins are load-bearing.
numpy<2andsurya-ocr 0.11.xare intentional — see docs/setup.md for why.
- docs/architecture.md — API schemas, pipeline internals, merge algorithm, graceful-degradation matrix, performance characteristics, known limitations.
- docs/setup.md — cross-platform install, WeasyPrint system deps, environment variables, running tests, troubleshooting.
The source code written for this project is released under the MIT License (see LICENSE at the repository root).
⚠️ Important — not MIT end-to-end. This project depends on PyMuPDF (AGPL-3.0) and surya-ocr (GPL-3.0-or-later) at runtime. While every file I wrote is MIT, any combined distribution (Docker image, release archive with a bundledvenv/, hosted SaaS, published wheel including dependencies) inherits the strongest copyleft terms among its components — in this case AGPL-3.0. See Copyleft dependency note below for what that means in practice and how to opt out.
PDF Reflow directly depends on the components in the table below. Their licenses apply to any combined/redistributed build of the application. Transitive dependencies (torch, transformers, tokenizers, huggingface_hub, starlette, anyio, click, safetensors, etc.) are all permissive (BSD / Apache-2.0 / MIT); full per-package detail is available via pip show <pkg> or inspecting the LICENSE files under venv/lib/.../*.dist-info/.
| Component | License | Notes |
|---|---|---|
FastAPI, Pydantic, Docling (+ docling-core / docling-parse / docling-ibm-models) |
MIT | Permissive |
| Uvicorn, NumPy, WeasyPrint, httpx, starlette, click, torch | BSD-3-Clause | Permissive |
| python-multipart, pytest-asyncio, transformers, huggingface_hub | Apache-2.0 | Permissive |
| anyio, pytest | MIT | Permissive (pytest is dev-only) |
| Pillow | HPND | Permissive, MIT-compatible |
KaTeX (vendored under frontend/vendor/katex/) |
MIT | Redistributed — KaTeX's LICENSE file ships alongside the vendored code (see frontend/vendor/katex/LICENSE) |
KaTeX fonts (vendored under frontend/vendor/katex/fonts/) |
MIT | Redistributed — the katex-fonts repo's LICENSE ships alongside the fonts (see frontend/vendor/katex/fonts/LICENSE). Khan Academy relicensed the fonts from SIL OFL 1.1 to MIT in 2018. |
| PyMuPDF | AGPL-3.0-or-later (or paid Artifex commercial license) | Strong copyleft — see below |
| surya-ocr | GPL-3.0-or-later | Strong copyleft — required for math-equation detection |
| Pango / Cairo / GDK-PixBuf | LGPL-2.1+ | System libraries, installed by the user for WeasyPrint. Not bundled. Dynamic linking against LGPL libraries is compatible with any license. |
This project intentionally keeps PyMuPDF and surya-ocr in its runtime dependencies because the closest permissive alternatives (pypdfium2, pdfplumber, pypdf) each give up a capability that noticeably hurts output quality for this use case — most importantly, lossless extraction of embedded image bytes and rich text-block structure. The tradeoff is made deliberately: better extraction quality, at the cost of inheriting copyleft terms for any combined distribution.
What this means for you if you use this project:
- Cloning, running locally, modifying for personal/internal use — no special obligations beyond MIT.
- Publishing the source to your own GitHub fork — no special obligations; MIT and source-only redistribution of your own code doesn't trigger the AGPL/GPL machinery because you're not distributing the copyleft libraries themselves (pip does that, from PyPI, under their licenses).
- Distributing a prebuilt artifact that bundles the deps (Docker images, PyInstaller binaries, release zips containing
venv/, wheels with bundled copyleft code) — the combined work is subject to AGPL-3.0 (because PyMuPDF is AGPL and AGPL is viral across combined works). You must offer the complete corresponding source of the combined work, preserve all notices, and license the whole under AGPL-3.0. - Hosting as a public network service — AGPL-3.0 §13 requires you to offer source to every user who interacts with the running service. In practice this means a visible "Source" link in the UI pointing to the exact code being run.
If you need to distribute this project without AGPL/GPL obligations:
- Replace PyMuPDF with
pypdfium2(Apache-2.0 + BSD-3-Clause). Your stack becomes permissive end-to-end, but the migration is non-trivial — image extraction needs to re-encode pixel data, which costs ~1 s per PDF and can introduce mild generation loss on JPEG figures. The PyMuPDF fallback inlayout.pyloses structured text-block parsing, so you'd either reimplement paragraph clustering or accept "Docling or bust." - Make surya-ocr optional. Move it out of
requirements.txtinto an extras file (requirements-math.txt) or apyproject.tomlextras group (pip install pdf-reflow[math]). The code already degrades gracefully when Surya is absent (main.py:56-60). - Obtain a commercial PyMuPDF license from Artifex (https://artifex.com/licensing/) to release you from AGPL.
When redistributing a combined artifact, ensure the following are included:
- A copy of the AGPL-3.0 license text (typically at the artifact's root).
- A copy of the GPL-3.0 license text (if surya-ocr is bundled).
- Copies of each permissive license for bundled deps (MIT / BSD-3-Clause / Apache-2.0 / HPND) with their original copyright notices intact. The per-package
LICENSE/COPYINGfiles inside each*.dist-info/directory are sufficient. - Written offer (or accompanying archive) for the complete corresponding source code of the combined work.
- If hosting as a network service: a clearly visible "Source" link pointing to the exact revision deployed.
- Preserve vendored KaTeX
LICENSEandfonts/LICENSEnotices in the build.
This license summary is informational and not legal advice. For anything affecting redistribution at scale, commercial deployment, or corporate compliance, consult a lawyer. License obligations around Python import-time "linking" are well-established in FSF doctrine but have not been exhaustively tested in court; the consensus in the Python ecosystem is to treat AGPL/GPL imports as triggering license obligations for combined works.