Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
-
Updated
Nov 5, 2025 - HTML
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
HTML to markdown converter
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Simple app to extract text from pictures using Tesseract
Article title, authors, date and body extraction dataset.
技术栈在线总结文档,包含编程语言、数据结构与算法、机器学习、数据库等。
A simple web application built with React which allows to upload images containing text, select the language of the text for recognition, and extract the text from the image. As quick as a finger snap - SnapText.
Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
Go package that cleans a HTML page for better readability.
Tesseract-OCR quick implementation. Linked with stack-overflow question
This is a Project Assignment where I have Learned to Classify the Different Texts Using Clustering Techniques. Natural Language Processing and Clustering both of these Concepts are Being Used. I have Used K-means Clustering Techniques to Implement the Problem.
HR Assistant: Web application for efficient HR recruitment and resume management. Utilizes OCR for text extraction and similarity analysis to rearrange resumes based on job descriptions. Simplifies the hiring process for HR recruiters and enhances candidate selection.
A privacy-focused, client-side web application that extracts clean, readable content from any webpage and converts it to PDF format. Built with pure HTML, CSS, and JavaScript—no backend required, no tracking, complete privacy.
Course Projects of COP290:- Design Pratices course at IIT Delhi under Professor Huzur Saran
Lynx is a project combining several smaller OpenCV initiatives developed for the Hackberry YSWS event, featuring various image processing functionalities on its website.
Collection of NLP projects from classowrk.
Version 0.1 of Planned Dashboard for Dashboards
MediLink is a web application that revolutionizes health record management by seamlessly integrating NLP techniques for handwritten text extraction on prescriptions and blockchain technology for secure data storage.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."