text-extraction

Here are 384 public repositories matching this topic...

gazza2577 / pdf-chapter-extractor

📄 Extract chapters from PDFs easily using bookmarks or manual entries, outputting clean text for each section. Simplify your document processing.

nodejs javascript python cli automation bookmarks text-extraction poppler pdftotext pdf-tools pypdf ai-assisted

Updated Nov 9, 2025

ricochetservice / Gemma3_OCR_Text_Extractor_LLM

Star

Gemma-3 OCR exemplifies the confluence of abstruse computer vision and arcane NLP, leveraging Gemma-3 Vision’s neural framework for precise OCR and semantically refined text curation. Powered by Streamlit and Ollama, this hermetic system converts visual data into perspicuous, markdown-rendered output, ensuring maximal accuracy and confidentiality.

ocr base64 deep-learning image-processing transformers pillow text-extraction ocr-recognition streamlit text-extraction-from-image llm vision-language-model ollama gemma3

Updated Nov 9, 2025
Python

Goldziher / kreuzberg

Sponsor

Star

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

python ocr async mcp pandoc tesseract text-extraction metadata-extraction table-extraction pdfium rag pdf-extraction document-intelligence

Updated Nov 9, 2025
HTML

PT-Perkasa-Pilar-Utama / ppu-paddle-ocr

Star

A lightweight, PaddleOCR implementation in Bun/Node.js for text detection and recognition in JavaScript environments.

ocr computer-vision image-processing text-extraction text-recognition node-js text-detection optical-character-recognition bun paddlepaddle javascript-ocr onnx onnxruntime paddleocr image-text-extraction typescript-ocr

Updated Nov 9, 2025
TypeScript

flairNLP / fundus

Star

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping image-classification datasets news-crawler corpus-tools commoncrawl web-corpus news-scraping cc-news image-extraction

Updated Nov 9, 2025
Python

ayzem88 / pdf-ocr-processor

Star

أداة متقدمة لمعالجة الكتب المصورة وتحويلها إلى نصوص قابلة للبحث باستخدام OCR

python pdf ocr image-processing tesseract text-extraction arabic pdf-processing

Updated Nov 8, 2025
Python

H0NEYP0T-466 / Pen2PDF

Star

⚡ Pen2PDF Suite – an all-in-one 🚀 productivity platform ✨ with 🤖 AI-powered text extraction (PDF/Images → Markdown 📝), 📅 smart timetable management (CSV/Excel import 📊), ✅ todo lists with subtasks📈, 🧠 AI-generated notes library 📚 and 💬 Isabella AI assistant (OpenAI/Microsoft/llama/Mistral/LongCat/Gemini models 🔄)for context-aware help 🧩.

ocr self-hosted text-extraction openai file-converter mistral mern-stack document-processing pdf-tools handwritten-notes longcat google-gemini ppt-to-pdf pdf-to-markdown llama-models microsoft-phi-4 ai-text-extraction pen2pdf

Updated Nov 8, 2025
JavaScript

Goldziher / html-to-markdown

Sponsor

Star

HTML to markdown converter

text-extraction html-converter text-processing markdown-converter rag

Updated Nov 8, 2025
HTML

dotfurther / OpenDiscoverSDK

Star

.NET 8 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.

Updated Nov 7, 2025
C#

ICIJ / datashare

Star

A self‑hosted search engine for documents

docker elasticsearch extract text-extraction named-entity-recognition web-gui datashare investigative-journalism

Updated Nov 7, 2025
Java

RomeCore / RCParsing

Star

The fluent, lightweight and powerful .NET lexerless parsing library for language development (DSL) and data scraping.

parser parsing compiler dsl ast text-extraction parser-combinator

Updated Nov 9, 2025
C#

soumick1 / Fin-ExBERT

Star

Official Implementation of Fin-ExBERT: A GNN augmented BERT model finetuned on financial data capable of performing Natural Language Inference and User instructed Text Extraction tasks

text-extraction finance-application graph-neural-networks

Updated Nov 7, 2025
Python

OneOffTech / awesome-pdf

Star

A curated list of amazingly libraries, services and resources to work with PDF files

pdf data-science awesome ocr text-extraction pdf-viewer awesome-list datasets pdf-generation

Updated Nov 5, 2025

saidsef / tika-document-to-text

Star

Apache Tika extract text and metadata from any document format with this pre-built containerised solution Kubernetes-ready deployment with intuitive UI, API, and text-to-speech capabilities - perfect for content indexing, analysis, and document processing workflows

nodejs python kubernetes text-to-speech docker-container text-extraction extract-text kubernetes-deployment helm-chart document-to-text document-to-text-ui

Updated Nov 3, 2025
JavaScript

ingmarboeschen / JATSdecoder

Star

A text extraction and manipulation toolset for NISO-JATS coded XML files

text-mining r text-extraction xml-files cermine pubmedcentral niso-jats

Updated Nov 3, 2025
R

snowmuffin / MuffinSync

Star

A Figma plugin for extracting and importing text layers to enhance design collaboration.

javascript html typescript collaboration text-extraction content-management user-friendly design-tools workflow-automation figma-plugin

Updated Nov 2, 2025
HTML

docwire / docwire

Star

DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality

Updated Nov 1, 2025
C++

damulhan / hwp_extract

Star

extract hwpx/hwp files to text

converter text-extraction hwp hwpx

Updated Oct 31, 2025
Kotlin

cortega26 / File-Extractor-Pro

Star

File Extractor Pro is a GUI application to extract and process files based on specified criteria. It allows users to include or exclude files based on extensions, include hidden files, and generate detailed extraction reports.

text-extraction file-to-text file-extraction file-extract

Updated Oct 30, 2025
Python

ssciwr / AMMICO

Star

AI-based Media and Misinformation Content Analysis Tool: Analyze text and images

nlp translation computer-vision text-extraction classification

Updated Nov 3, 2025
Python

Improve this page

Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-extraction

Here are 384 public repositories matching this topic...

gazza2577 / pdf-chapter-extractor

ricochetservice / Gemma3_OCR_Text_Extractor_LLM

Goldziher / kreuzberg

PT-Perkasa-Pilar-Utama / ppu-paddle-ocr

flairNLP / fundus

ayzem88 / pdf-ocr-processor

H0NEYP0T-466 / Pen2PDF

Goldziher / html-to-markdown

dotfurther / OpenDiscoverSDK

ICIJ / datashare

RomeCore / RCParsing

soumick1 / Fin-ExBERT

OneOffTech / awesome-pdf

saidsef / tika-document-to-text

ingmarboeschen / JATSdecoder

snowmuffin / MuffinSync

docwire / docwire

damulhan / hwp_extract

cortega26 / File-Extractor-Pro

ssciwr / AMMICO

Improve this page

Add this topic to your repo