PDF Flow Diff is a local web MVP for reviewing long-form PDF revisions when pagination reflow makes page-by-page comparison unusable.
It is designed for workflows where a tiny text change matters, but a single inserted sentence can push the next 100 pages into different physical positions. The project reconstructs a continuous logical text flow, coarse-aligns extracted lines with patiencediff, refines only changed windows with diff-match-patch, then projects review anchors back onto the original PDF coordinates for side-by-side visual review.
- Ignores false positives caused only by pagination reflow or layout overflow.
- Tracks real edits at word and character granularity.
- Projects every review anchor back to page coordinates for visual highlighting.
- Supports a review-friendly dual-pane UI with anchor jumping instead of fragile scroll lock.
- Separates explicit-grid table regions from body text so table edits can be reviewed at cell level.
Supported:
- Text-based PDFs exported from Word or similar tools
- Mixed body text plus explicit-grid tables
- Local single-user review workflow
- Browser-based review on current Chromium-class browsers
Not supported in this MVP:
- Scanned PDFs or OCR
- Handwritten annotations
- Image diffs
- Multi-user jobs or persistent queues
- Audit export files such as JSON download or marked-up PDFs
The core pipeline is:
- Extract page geometry, text atoms, and table regions from each PDF with
PyMuPDF. - Reconstruct a cross-page logical text flow and normalize only what is needed for alignment.
- Coarse-align extracted text lines with
patiencediffso hardequalwindows can re-anchor the flow. - Run local
diff-match-patchonly inside non-equal windows for word and character precision. - Coalesce raw edit events into review-friendly anchors and mark risky large windows as low confidence.
- Project anchor ranges back to PDF page coordinates.
- Render side-by-side PDFs with anchor-linked highlights in the React UI.
See docs/architecture.md for the full module breakdown.
.
├── backend/ FastAPI service and diff engine
├── frontend/ React review UI
├── docs/ API, integration, and architecture docs
└── .github/workflows/ CI pipeline
- Backend: FastAPI, PyMuPDF, patiencediff, diff-match-patch, Pydantic
- Frontend: React, Vite, TypeScript, Tailwind, react-pdf
- Tooling:
uvfor Python dependency management,pnpmfor frontend dependency management
Requirements:
- Python
3.11+ - Node.js
20+ uvpnpm
cd backend
uv sync --extra dev
uv run uvicorn app.main:app --reload --port 8000The API will be available at http://localhost:8000.
Useful endpoints while developing:
GET /healthGET /docsfor Swagger UIGET /redocfor ReDoc
cd frontend
pnpm install
pnpm devThe Vite dev server runs at http://localhost:5173 and expects the backend at http://localhost:8000.
The backend exposes an async job workflow:
POST /api/jobsuploadssourcePdfandmodifiedPdf.GET /api/jobs/{id}polls job progress.GET /api/jobs/{id}/resultfetches the final diff anchors.GET /api/jobs/{id}/files/{side}streams the original uploaded PDF back for rendering.
Primary anchor kinds:
insertdeletereplacereflow
Primary source types:
texttable
Anchor confidence values:
highlow
Full request and response examples live in docs/api.md, and backend integration guidance lives in docs/backend-integration.md.
Backend checks:
cd backend
uv run pytestFrontend build:
cd frontend
pnpm buildFor PDFs with explicit ruling lines, the extractor uses PyMuPDF table detection to split those regions out of the main text flow. Table cells are then diffed independently and returned as source_type="table" anchors so the UI can highlight a single cell instead of collapsing the whole table into one paragraph-like change.
If a table cannot be matched reliably, the system intentionally falls back to a weaker structural signal instead of returning misleading cell-level results.
- The project does use mature open-source diff libraries; it does not reimplement the core text diff algorithms.
- Custom logic focuses on:
- text flow reconstruction across pages
- line-level coarse anchoring with
patiencediff - windowed local refinement with
diff-match-patch - minimal normalization before alignment
- review-anchor coalescing
- low-confidence marking for risky large replace windows
- reflow detection
- coordinate projection
- explicit-grid table extraction
This repository is released under GPL-2.0-only to stay compatible with the current patiencediff coarse-alignment dependency used by the backend. See LICENSE.
- API reference
- Backend integration guide
- Architecture notes
- GitHub release prep
- Contributing guide
- Security policy
- Better table matching for inserted/deleted sections across multiple pages
- Optional audit-mode output that preserves raw low-level edit events
- Exportable machine-readable results
- Broader browser compatibility validation
- Better handling for merged cells and borderless tables