Module for automatic summarization of text documents and HTML pages.
-
Updated
Sep 8, 2025 - Python
Module for automatic summarization of text documents and HTML pages.
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
A very simple news crawler with a funny name
Heuristic based boilerplate removal tool
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Parse PDFs into markdown using Vision LLMs
PDF text data extraction web app with OCR for scanned documents
A simple library and set of tools for parsing, modifying, and composing SRT files.
AWS Lambda functions to extract text from various binary formats.
Benchmarking PDF libraries
Python port of Boilerpipe library
Simple pdf to text with python using PDFtk and PyPDF2
Entity Disambiguation as text extraction (ACL 2022)
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
python based software to unpack kindlegen generated ebooks
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Python scripts that converts PDF files to text, splits them into chunks, and stores their vector representations using GPT4All embeddings in a Chroma DB. It also provides a script to query the Chroma DB for similarity search based on user input.
A Python asyncio wrapper for Tesseract-OCR.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."