text-extraction

Here are 384 public repositories matching this topic...

miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

python nlp pagerank-algorithm text-extraction reduction summarization html-page summary lsa sumy textteaser summarizer html-extraction html-extractor

Updated Sep 8, 2025
Python

adbar / trafilatura

Sponsor

Star

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Updated Sep 12, 2025
Python

unidoc / unipdf

Star

Golang PDF library for creating and processing PDF files (pure go)

golang pdf signing text-extraction pdf-generator pdf-generation pdf-reader pdf-manipulation pdf-library pdf-document-processor pdf-compression pdf-sign pdf-reports

Updated Oct 9, 2025
Go

chrismattmann / tika-python

Sponsor

Star

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Updated Apr 14, 2025
Python

whitelok / image-text-localization-recognition

Star

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

machine-learning awesome ocr deep-learning text-extraction text-recognition deep-learning-algorithms convolutional-neural-networks text-detection scene-texts

Updated Sep 17, 2023

Goldziher / kreuzberg

Sponsor

Star

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

python ocr async mcp pandoc tesseract text-extraction metadata-extraction table-extraction pdfium rag pdf-extraction document-intelligence

Updated Nov 10, 2025
HTML

flairNLP / fundus

Star

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping image-classification datasets news-crawler corpus-tools commoncrawl web-corpus news-scraping cc-news image-extraction

Updated Nov 9, 2025
Python

unidoc / unidoc

Star

This repository has moved! https://github.com/unidoc/unipdf

golang pdf text-extraction pdf-files pdf-invoice unidoc pdf-library

Updated May 23, 2019
Go

miso-belica / jusText

Sponsor

Star

Heuristic based boilerplate removal tool

python text-extraction html-parser html-parsing

Updated Feb 25, 2025
Python

vsymbol / CUTIE

Star

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

computer-vision deep-learning text-extraction

Updated Dec 8, 2022
Python

ropensci / pdftools

Star

Text Extraction, Rendering and Converting of PDF Documents

r text-extraction rstats pdf-files r-package poppler pdf-format poppler-library pdftools

Updated Sep 9, 2025
C++

ICIJ / datashare

Star

A self‑hosted search engine for documents

docker elasticsearch extract text-extraction named-entity-recognition web-gui datashare investigative-journalism

Updated Nov 7, 2025
Java

iamarunbrahma / vision-parse

Star

Parse PDFs into markdown using Vision LLMs

text-extraction pdf-parser document-parser pdf-to-markdown

Updated Oct 4, 2025
Python

nainiayoub / pdf-text-data-extractor

Star

PDF text data extraction web app with OCR for scanned documents

python pdf ocr text-extraction pdf-to-text ocr-text-reader ocr-python streamlit streamlit-webapp

Updated Jun 5, 2024
Python

cdown / srt

Star

A simple library and set of tools for parsing, modifying, and composing SRT files.

python library tools command-line text-extraction subtitles subtitle srt subtitles-parsing mit-license command-line-tool subtitle-parser subtitle-fixer

Updated Mar 19, 2024
Python

skylander86 / lambda-text-extractor

Star

AWS Lambda functions to extract text from various binary formats.

pdf ocr aws-lambda lambda-functions tesseract text-extraction searchable-pdfs pdf-ocr-extraction

Updated Feb 7, 2018
Python

pd3f / pd3f

Star

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

python pdf machine-learning ocr pipeline text-extraction pdf-to-text language-model extract-text parsr pd3f

Updated Oct 13, 2023
HTML

Goldziher / html-to-markdown

Sponsor

Star

HTML to markdown converter

text-extraction html-converter text-processing markdown-converter rag

Updated Nov 10, 2025
HTML

archivesunleashed / aut

Star

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

scala big-data spark apache-spark hadoop analysis python3 text-extraction pyspark digital-humanities dataframe big-data-analytics webarchives network-graphing

Updated Feb 27, 2024
Scala

rajesh-bhat / spark-ai-summit-2020-text-extraction

Star

keras cnn text-extraction lstm text-recognition text-detection summit ctc-loss spark-ai

Updated Dec 7, 2020
Jupyter Notebook

Improve this page

Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-extraction

Here are 384 public repositories matching this topic...

miso-belica / sumy

adbar / trafilatura

unidoc / unipdf

chrismattmann / tika-python

whitelok / image-text-localization-recognition

Goldziher / kreuzberg

flairNLP / fundus

unidoc / unidoc

miso-belica / jusText

vsymbol / CUTIE

ropensci / pdftools

ICIJ / datashare

iamarunbrahma / vision-parse

nainiayoub / pdf-text-data-extractor

cdown / srt

skylander86 / lambda-text-extractor

pd3f / pd3f

Goldziher / html-to-markdown

archivesunleashed / aut

rajesh-bhat / spark-ai-summit-2020-text-extraction

Improve this page

Add this topic to your repo