A 100% Local, Browser-Based PDF to Markdown Converter.
LiteDoc is a zero-setup, client-side tool built to extract text, images, tables, and math from PDFs. Save your LLM tokens and avoid wrestling with heavy backend environmentsโjust drop your file in the browser and get clean Markdown.
Full Main UI with Files Loaded
|
๐ Editor View |
๐ Explorer View |
|
โณ Loading Process |
โ๏ธ Settings |
There are incredible, industry-standard tools out there for PDF parsing, like Markitdown (Microsoft), Docling (IBM), and Marker. However, they are fundamentally built for automated backend pipelines, which introduces significant friction for average users.
| Feature | ๐ Markitdown / Docling / Marker | ๐ LiteDoc |
|---|---|---|
| Setup Required | pip install, Python environments, Docker |
Zero. Just open a web page. |
| Target Audience | Backend Devs, Data Engineers, AI Pipelines | Everyone. Students, researchers, writers. |
| Processing | Local CLI or Server-side API | 100% Client-side (WASM + JS). |
| Privacy | Depends on your infrastructure setup | Absolute. Files never leave your device. |
LiteDoc is for people who just want their Markdown right now. No dependencies, no server uploads, no privacy concerns. It runs entirely on your local machine using your browser's resources.
- ๐ 100% Local & Private: Core extraction, OCR, and bundling run entirely on your local CPU/GPU inside the browser. Files never touch any server.
- ๐งฉ Document Layout Analysis (DLA): Employs a recursive XY-Cut algorithm to map out and isolate sidebars, headers, and multi-column flows, preventing horizontal text mixing.
- ๐ผ๏ธ Smart OCR & OSD Router: Runs a lightweight 400x400px pre-pass to auto-detect script direction and language, dynamically initializing WebAssembly Tesseract.js workers.
- ๐ Table & Vector Figures: Detects vector lines to construct pristine GitHub-Flavored Markdown tables (supporting complex merged cells) and crops diagrams/charts as JPEG assets.
- ๐งฎ LaTeX Math Equations: Automatically detects math formula bounding boxes and renders them with KaTeX.
- ๐ Arabic & RTL Formatting: Native support for Right-to-Left scripts with automatic line alignment and typography routing.
- ๐ก๏ธ Local Decryption: Handles password-protected PDFs securely by prompt-unlocking them locally in the browser sandbox.
- ๐งน Custom Font Fallbacks: Intercepts corrupted, custom-encoded "garbage" fonts, offering image-fallback options to ensure the document remains readable.
- โก Batch Queuing & Memory Protection: Processes large files in 10-page chunks and releases web canvas assets dynamically to prevent Out-Of-Memory (OOM) browser crashes.
- ๐ฑ Fully Mobile Responsive: Overhauled layout designed to offer full-editor features, document navigation, and settings toggles on mobile screens.
- โธ๏ธ Queue & Formatting Control: Pause or skip processing tasks on demand, or use the "Unformat" action to strip markdown styling instantly.
Because LiteDoc is a purely client-side web application, you don't need to install any dependencies to run it!
The Easiest Way:
- Go to the Releases page.
- Download the
index.htmlfile from the latest release. - Open the downloaded
index.htmlfile in any modern web browser (Chrome, Edge, Firefox, Safari) and drag and drop your PDFs!
Run from Source:
- Clone or download this repository.
- Open
dist/index.htmlin your browser.
If you want to modify the source code:
- Make changes inside the
src/directory (includes modular CSS and JS). - Bundle your changes into a single self-contained file by running:
python scripts/build.py
- The compiled production bundle will be updated at
dist/index.html.
Once processing finishes, you can preview the generated Markdown directly in the built-in Ace Editor.
Click Download Files (.zip) to get a neatly packaged archive containing your .md file and an attached folder containing all extracted images, tables, and charts.
LiteDoc relies on a powerful stack of client-side libraries:
- PDF.js - Core parsing, rendering, and text-layer extraction.
- Tesseract.js - WebAssembly-based OCR for scanned document fallback.
- JSZip - Local, client-side ZIP packaging of extracted assets.
- KaTeX - Fast math typesetting in the Markdown previewer.
- Ace Editor - High-performance code editor for tweaking Markdown before export.
Unlike basic wrapper libraries that blindly extract text sequentially from top to bottom, LiteDoc utilizes advanced Document Layout Analysis (DLA) and topological graph algorithms natively in your browser. This ensures structurally perfect extractions for complex formats like multi-column scientific papers, journals, and math-heavy PDFs.
I employ a top-down Recursive XY-Cut Algorithm to cleanly divide pages into discrete rectangular regions.
- The algorithm projects text block coordinates onto the X and Y axes, building density histograms.
- It mathematically detects "valleys" (gutters or whitespace gaps) and slices the document recursively until it isolates individual columns, headers, and floating sidebars without cross-contamination.
After slicing the page, the geometric blocks are mapped into a Directed Acyclic Graph (DAG).
- I use Kahn's Topological Sort to determine the exact human reading order.
- Edges in the graph are defined by strict geometric constraints (e.g., prioritizing
$x_{min}$ alignment and strict horizontal overlap margins). This eliminates "column interleaving" bugs where a right column might be accidentally read before the left.
Mathematical formulas are notoriously difficult to extract because PDF engines often map math symbols to the Private Use Area (PUA) of Unicode.
- LiteDoc analyzes character densities and font registries (e.g.,
CMSY,MathJax) line-by-line. - When an equation block is detected (
$Density_{math} > 25%$ ), instead of outputting corrupted text, LiteDoc geometrically calculates the strict bounding box of the multi-line formula. - The region is rendered onto an offscreen web canvas and seamlessly cropped into a high-fidelity image (
[IMAGE_MATH]), perfectly preserving visual fractions and complex integrals.
LiteDoc implements a robust Gibberish Scorer to identify heavily corrupted, custom-encoded "subset" fonts. It calculates a statistical
Contributions, issues, and feature requests are highly welcome! Since the goal is to keep the tool accessible and server-free, any PRs should adhere to the "100% client-side" philosophy.
A Note on Future Updates: Up until now, bugs and algorithmic edge-cases have been tracked manually by the maintainer. Because I currently don't have anyone actively opening issues on the repository, future updates will be rolling out at a slower pace. If you find a bug or want a feature, please open an issue! It is the best way to drive the next wave of development.
| Link | |
|---|---|
| ๐ Website | litedoc.xyz |
| ๐ Twitter | @0xovoo |
| โ Ko-fi | ko-fi.com/0xovo |
| ๐ฆ GitHub | github.com/0xovo/LiteDoc |
| ๐ง Email | contact@litedoc.xyz |
LiteDoc isโand always will beโ100% free and open-source. I originally built this tool to help broke students stop burning their paid AI tokens just to parse their study materials.
If LiteDoc has saved you time, protected your privacy, or spared your wallet from expensive backend API costs, please consider making a donation! Your support is what keeps this project alive and continuously improving.
Built with โค๏ธ by 0xovo