A CLI tool for extracting text from various document formats with configurable OCR capabilities.
- Multi-format Support: PDF, Word, HTML, E-books, Images, and text files
- Configurable OCR: llm-caller (Call LLM models) or surya_ocr (local OCR engine)
- Smart Content Strategy: Choose between text-first or image-first processing for PDFs
- Interactive Tool Selection: Auto-detects available tools and prompts for selection
- Automatic Tool Detection: No configuration files needed - tools are detected when required
- Cross-platform: macOS, Linux, Windows with automatic tool detection
- 🚀 Quick Start - Installation and basic usage
- 🔧 Development - Architecture and development guide
- 🏷️ Versioning - Version management and build system
# macOS
brew install ghostscript pandoc calibre && pip install surya-ocr
# Ubuntu/Linux
sudo apt-get install ghostscript pandoc calibre && pip install surya-ocr
# Windows (with Chocolatey)
choco install ghostscript pandoc calibre && pip install surya-ocrDownload pre-built binaries from Releases or build from source with go build.
# Extract with interactive selection (prompts for OCR tool and content type)
doc-to-text document.pdf
# Use specific OCR tool
doc-to-text document.pdf --ocr surya_ocr
doc-to-text document.pdf --ocr llm-caller --llm-template qwen-vl-ocr
# Specify content processing strategy for PDFs
doc-to-text document.pdf --content-type text # Try Calibre first, OCR fallback
doc-to-text document.pdf --content-type image # Direct OCR processing
# Custom output
doc-to-text document.pdf -o output.txt
# Display and set language
doc-to-text language
# Switch language
DOC_TEXT_LANG=zh doc-to-text -V
# Version info
doc-to-text --version # Quick version
doc-to-text version # Detailed build infoThe tool automatically detects required tools when needed:
- OCR Tools:
llm-caller,surya_ocr - Document Processing:
ebook-convert(Calibre),pandoc,gs(Ghostscript) - Detection Strategy: Command lookup → Common paths → Clear error messages
# Temporarily override settings
DOC_TEXT_OCR_STRATEGY=surya_ocr doc-to-text document.pdf
DOC_TEXT_CONTENT_TYPE=text doc-to-text document.pdf
DOC_TEXT_MAX_CONCURRENCY=8 doc-to-text document.pdf| Setting | Description | Default |
|---|---|---|
ocr_strategy |
OCR tool selection | interactive |
content_type |
PDF processing strategy | image |
max_concurrency |
Concurrent processes | 4 |
verbose |
Enable progress output | false |
| Type | Extensions | Method |
|---|---|---|
| PDFs | .pdf |
OCR or Calibre (based on content-type) |
| Images | .jpg, .png, .gif, .bmp, .tiff |
OCR |
| Documents | .doc, .docx, .rtf, .odt, .ppt, .xls |
Pandoc |
| Web | .html, .mhtml |
Built-in parser |
| E-books | .epub, .mobi |
Calibre |
| Text | .txt, .md, .json, .csv, .xml, .py, .js |
Direct reading |
- Local and multilingual (100+ languages)
- Installation:
pip install surya-ocr - Best for: Standard documents, batch processing
llm-caller (Configurable AI)
- AI-powered with template-based approach
- Requires:
--llm-templateparameter - Best for: Complex layouts, handwritten text, specific models
- Default mode: Automatically prompts for tool selection
- Smart detection: Shows only available engines
- Auto-selection: For text content-type, automatically selects best tool without prompts
The --content-type parameter determines PDF processing strategy:
text: Tries Calibre first (fast for text-based PDFs), then OCR if failedimage: Uses OCR directly (default, best for scanned documents)
Text is extracted to organized directories:
- Input:
/path/to/document.pdf - Output:
/path/to/{md5_hash}/text.txt - Pages:
/path/to/{md5_hash}/pages/(for PDFs)
Large document processing can be interrupted and resumed. The tool automatically:
- Detects completed pages and skips them
- Continues from the last processed page
- Maintains processing state in intermediate directories
OCR tool not found: Tools are automatically detected. Ensure they are installed and available in your PATH
Permission errors: Ensure tools are executable and paths are accessible
Poor OCR quality: Try different OCR engines or ensure good source quality (300 DPI recommended)