text-extraction

Here are 21 public repositories matching this topic...

Goldziher / kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

python ocr async mcp pandoc tesseract text-extraction metadata-extraction table-extraction pdfium rag pdf-extraction document-intelligence

Updated Nov 5, 2025
HTML

Goldziher / html-to-markdown

Sponsor

Star

HTML to markdown converter

text-extraction html-converter text-processing markdown-converter rag

Updated Nov 5, 2025
HTML

pd3f / pd3f

Star

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

python pdf machine-learning ocr pipeline text-extraction pdf-to-text language-model extract-text parsr pd3f

Updated Oct 13, 2023
HTML

bookieio / breadability

Star

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

python text-mining text-extraction html-parsing html-extraction html-extractor

Updated May 9, 2024
HTML

victorqribeiro / ocr

Star

Simple app to extract text from pictures using Tesseract

ocr tesseract text-extraction text-recognition image-recognition

Updated Jul 19, 2021
HTML

AndyTheFactory / article-extraction-dataset

Star

Article title, authors, date and body extraction dataset.

text-mining news html-to-markdown scraping corpus news-aggregator text-extraction dataset web-scraping readability datasets scraping-websites html2text news-crawler corpus-builder corpus-tools article-extractor text-cleaning text-preprocessing

Updated Mar 26, 2024
HTML

lihanghang / TecRoom

Star

技术栈在线总结文档，包含编程语言、数据结构与算法、机器学习、数据库等。

java nlp docker machine-learning text-mining data-mining tools deep-learning design-patterns python3 text-extraction artificial-intelligence data-structures coding summary llm

Updated Oct 28, 2025
HTML

A simple web application built with React which allows to upload images containing text, select the language of the text for recognition, and extract the text from the image. As quick as a finger snap - SnapText.

react reactjs web-application text-extraction text-recognition copy-to-clipboard multi-language-support simple-app copy-text-to-clipboard text-extraction-from-image copy-result

Updated Dec 10, 2023
HTML

mazzasaverio / url2md4ai

Star

Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.

html-to-markdown text-extraction openai playwright html-to-markdown-converter trafilatura

Updated Jul 6, 2025
HTML

importcjj / go-readability

Star

Go package that cleans a HTML page for better readability.

go html golang text extractor text-extraction readability html2text html-extractor

Updated Aug 1, 2023
HTML

devsteppe9 / tesseract-quick-implementation

Star

Tesseract-OCR quick implementation. Linked with stack-overflow question

tesseract text-extraction tesseract-ocr pyinstaller tesseract-4 tesseract-python

Updated Nov 26, 2019
HTML

sharmaroshan / Text-Classification

Star

This is a Project Assignment where I have Learned to Classify the Different Texts Using Clustering Techniques. Natural Language Processing and Clustering both of these Concepts are Being Used. I have Used K-means Clustering Techniques to Implement the Problem.

python natural-language-processing text-mining numpy text-analysis pandas text-extraction nltk bag-of-words tf-idf text-processing jupyter-notebooks text-cleaning

Updated Aug 18, 2019
HTML

MaarkNassef / GraduationProject

Star

HR Assistant: Web application for efficient HR recruitment and resume management. Utilizes OCR for text extraction and similarity analysis to rearrange resumes based on job descriptions. Simplifies the hiring process for HR recruiters and enhances candidate selection.

python resume flask text-extraction hr similarity-measures recruitment pyotp ocr-python

Updated Jul 11, 2023
HTML

xsukax / xsukax-ReadClean-PDF

Star

A privacy-focused, client-side web application that extracts clean, readable content from any webpage and converts it to PDF format. Built with pure HTML, CSS, and JavaScript—no backend required, no tracking, complete privacy.

Updated Oct 5, 2025
HTML

zanachka / html-text

Star

Extract text from HTML

text-extraction html-extraction

Updated Oct 21, 2020
HTML

bansal-yash / COP290-Design-Practices

Star

Course Projects of COP290:- Design Pratices course at IIT Delhi under Professor Huzur Saran

tiled-map-editor pygame text-extraction trading-strategies trading-simulator

Updated Dec 1, 2024
HTML

Rexaintreal / Lynx

Star

Lynx is a project combining several smaller OpenCV initiatives developed for the Hackberry YSWS event, featuring various image processing functionalities on its website.

Updated Oct 12, 2025
HTML

bektade / NaturalLanguageProcessing

Star

Collection of NLP projects from classowrk.

nlp information-retrieval text-extraction

Updated Jun 28, 2023
HTML

RobertSloan22 / projects001

Star

Version 0.1 of Planned Dashboard for Dashboards

google text text-extraction iframe adas google-streetview

Updated May 18, 2023
HTML

sanidhyajadaun / MediLink

Star

MediLink is a web application that revolutionizes health record management by seamlessly integrating NLP techniques for handwritten text extraction on prescriptions and blockchain technology for secure data storage.

nlp blockchain text-extraction health-record-management

Updated Dec 6, 2023
HTML

Improve this page

Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-extraction

Here are 21 public repositories matching this topic...

Goldziher / kreuzberg

Goldziher / html-to-markdown

pd3f / pd3f

bookieio / breadability

victorqribeiro / ocr

AndyTheFactory / article-extraction-dataset

lihanghang / TecRoom

nikolay-malygin / snap-text

mazzasaverio / url2md4ai

importcjj / go-readability

devsteppe9 / tesseract-quick-implementation

sharmaroshan / Text-Classification

MaarkNassef / GraduationProject

xsukax / xsukax-ReadClean-PDF

zanachka / html-text

bansal-yash / COP290-Design-Practices

Rexaintreal / Lynx

bektade / NaturalLanguageProcessing

RobertSloan22 / projects001

sanidhyajadaun / MediLink

Improve this page

Add this topic to your repo