A web-based article extractor and text-to-speech converter. Extract content from any URL and listen to articles with natural voice synthesis. Supports multiple extraction methods.
-
Updated
Aug 9, 2025 - Python
A web-based article extractor and text-to-speech converter. Extract content from any URL and listen to articles with natural voice synthesis. Supports multiple extraction methods.
Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
The metadata and text content extractor for almost every file type.
Dust library for html processing
Opinionated and Sophisticated Document Region Analyzer.
Via Text Density Simple Web Crawler With Go
A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
📋 WebMD is a Chrome extension that transforms web pages into Markdown documents with surgical precision.
🤖 Automated Q&A Dataset Generation Pipeline powered by LLMs. Multi-stage pipeline that searches, filters, extracts and transforms web content into high-quality question-answer datasets for LLM training. Supports multiple LLM providers (Groq, Mistral, Ollama) and search engines.
A web application that scrapes web pages, extracts main content, and uses OpenLLaMA to convert the content into specified formats.
A Python library for scraping technical documentation with ease — supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.
A reusable Python scraper framework for efficiently crawling documentation websites and preparing content for AI training, with async support and structured output.
extract and analyze content from various file formats including PDFs, text files, and images.
Configurable and schedulable web scrapping tool. Used to extract raw article content and metadata for aggregated news feeds.
Recommending Relevant Sections from a Webpage About Programming Errors and Exceptions
Article parser for Habr, Proglib, and vc.ru that extracts main content, removes ads and unnecessary elements, preserving proper formatting
n8n community node for extracting main content from webpages using Defuddle library
Document processing and querying system built with FastAPI and React. Upload documents and interact with their content using natural language queries powered by Gemini API and Unstructured.io
Add a description, image, and links to the content-extraction topic page so that developers can more easily learn about it.
To associate your repository with the content-extraction topic, visit your repo's landing page and select "manage topics."