📄 Extract chapters from PDFs easily using bookmarks or manual entries, outputting clean text for each section. Simplify your document processing.
-
Updated
Nov 9, 2025
📄 Extract chapters from PDFs easily using bookmarks or manual entries, outputting clean text for each section. Simplify your document processing.
Gemma-3 OCR exemplifies the confluence of abstruse computer vision and arcane NLP, leveraging Gemma-3 Vision’s neural framework for precise OCR and semantically refined text curation. Powered by Streamlit and Ollama, this hermetic system converts visual data into perspicuous, markdown-rendered output, ensuring maximal accuracy and confidentiality.
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
A lightweight, PaddleOCR implementation in Bun/Node.js for text detection and recognition in JavaScript environments.
A very simple news crawler with a funny name
أداة متقدمة لمعالجة الكتب المصورة وتحويلها إلى نصوص قابلة للبحث باستخدام OCR
⚡ Pen2PDF Suite – an all-in-one 🚀 productivity platform ✨ with 🤖 AI-powered text extraction (PDF/Images → Markdown 📝), 📅 smart timetable management (CSV/Excel import 📊), ✅ todo lists with subtasks📈, 🧠 AI-generated notes library 📚 and 💬 Isabella AI assistant (OpenAI/Microsoft/llama/Mistral/LongCat/Gemini models 🔄)for context-aware help 🧩.
HTML to markdown converter
.NET 8 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.
A self‑hosted search engine for documents
The fluent, lightweight and powerful .NET lexerless parsing library for language development (DSL) and data scraping.
Official Implementation of Fin-ExBERT: A GNN augmented BERT model finetuned on financial data capable of performing Natural Language Inference and User instructed Text Extraction tasks
A curated list of amazingly libraries, services and resources to work with PDF files
Apache Tika extract text and metadata from any document format with this pre-built containerised solution Kubernetes-ready deployment with intuitive UI, API, and text-to-speech capabilities - perfect for content indexing, analysis, and document processing workflows
A text extraction and manipulation toolset for NISO-JATS coded XML files
A Figma plugin for extracting and importing text layers to enhance design collaboration.
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
File Extractor Pro is a GUI application to extract and process files based on specified criteria. It allows users to include or exclude files based on extensions, include hidden files, and generate detailed extraction reports.
AI-based Media and Misinformation Content Analysis Tool: Analyze text and images
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."