Module for automatic summarization of text documents and HTML pages.
-
Updated
Sep 8, 2025 - Python
Module for automatic summarization of text documents and HTML pages.
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Golang PDF library for creating and processing PDF files (pure go)
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
A very simple news crawler with a funny name
This repository has moved! https://github.com/unidoc/unipdf
Heuristic based boilerplate removal tool
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Text Extraction, Rendering and Converting of PDF Documents
A self‑hosted search engine for documents
Parse PDFs into markdown using Vision LLMs
PDF text data extraction web app with OCR for scanned documents
A simple library and set of tools for parsing, modifying, and composing SRT files.
AWS Lambda functions to extract text from various binary formats.
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
HTML to markdown converter
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."