#

text-extraction

Here are 188 public repositories matching this topic...

miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

python nlp pagerank-algorithm text-extraction reduction summarization html-page summary lsa sumy textteaser summarizer html-extraction html-extractor

Updated Sep 8, 2025
Python

adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Updated Sep 12, 2025
Python

chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Updated Apr 14, 2025
Python

flairNLP / fundus

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping image-classification datasets news-crawler corpus-tools commoncrawl web-corpus news-scraping cc-news image-extraction

Updated Nov 10, 2025
Python

miso-belica / jusText

Heuristic based boilerplate removal tool

python text-extraction html-parser html-parsing

Updated Feb 25, 2025
Python

vsymbol / CUTIE

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

computer-vision deep-learning text-extraction

Updated Dec 8, 2022
Python

iamarunbrahma / vision-parse

Parse PDFs into markdown using Vision LLMs

text-extraction pdf-parser document-parser pdf-to-markdown

Updated Oct 4, 2025
Python

nainiayoub / pdf-text-data-extractor

PDF text data extraction web app with OCR for scanned documents

python pdf ocr text-extraction pdf-to-text ocr-text-reader ocr-python streamlit streamlit-webapp

Updated Jun 5, 2024
Python

cdown / srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

python library tools command-line text-extraction subtitles subtitle srt subtitles-parsing mit-license command-line-tool subtitle-parser subtitle-fixer

Updated Mar 19, 2024
Python

skylander86 / lambda-text-extractor

AWS Lambda functions to extract text from various binary formats.

pdf ocr aws-lambda lambda-functions tesseract text-extraction searchable-pdfs pdf-ocr-extraction

Updated Feb 7, 2018
Python

py-pdf / benchmarks

Benchmarking PDF libraries

pdf benchmark text-extraction mupdf data-extraction pypdf2 poppler-utils

Updated Jul 2, 2025
Python

jmriebold / BoilerPy3

Python port of Boilerpipe library

text-extraction boilerpipe boilerpy html-text-extraction full-text-extraction

Updated Aug 20, 2024
Python

asepmaulanaismail / pdf-to-txt-python

Simple pdf to text with python using PDFtk and PyPDF2

python pdf python3 text-extraction pdf-to-text pypdf2 pdftk pdf-extractor

Updated Oct 1, 2023
Python

fourdigits / wagtail_textract

Text extraction for Wagtail document search

search django wagtail tesseract text-extraction textract

Updated Oct 25, 2023
Python

SapienzaNLP / extend

Entity Disambiguation as text extraction (ACL 2022)

nlp natural-language-processing acl pytorch text-extraction entity-linking entity-disambiguation entity-disambiguation-models acl2022

Updated Apr 17, 2022
Python

weareprestatech / hotpdf

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

python pdf text-extraction text-search

Updated Dec 15, 2024
Python

iscc / mobi

python based software to unpack kindlegen generated ebooks

mobi text-extraction kindle

Updated Sep 14, 2025
Python

iamarunbrahma / pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Nov 22, 2024
Python

Govind-S-B / pdf-to-text-chroma-search

Python scripts that converts PDF files to text, splits them into chunks, and stores their vector representations using GPT4All embeddings in a Chroma DB. It also provides a script to query the Chroma DB for similarity search based on user input.

text-extraction similarity-search pdf-processing vector-embeddings chromadb

Updated Oct 23, 2023
Python

amenezes / aiopytesseract

A Python asyncio wrapper for Tesseract-OCR.

ocr tesseract text-extraction asyncio tesseract-ocr optical-character-recognition pdftotext pytesseract pytesseract-ocr

Updated Sep 22, 2025
Python

Improve this page

Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."