A web-based article extractor and text-to-speech converter. Extract content from any URL and listen to articles with natural voice synthesis. Supports multiple extraction methods.
-
Updated
Aug 9, 2025 - Python
A web-based article extractor and text-to-speech converter. Extract content from any URL and listen to articles with natural voice synthesis. Supports multiple extraction methods.
Opinionated and Sophisticated Document Region Analyzer.
The metadata and text content extractor for almost every file type.
A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
🤖 Automated Q&A Dataset Generation Pipeline powered by LLMs. Multi-stage pipeline that searches, filters, extracts and transforms web content into high-quality question-answer datasets for LLM training. Supports multiple LLM providers (Groq, Mistral, Ollama) and search engines.
A Python library for scraping technical documentation with ease — supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.
A reusable Python scraper framework for efficiently crawling documentation websites and preparing content for AI training, with async support and structured output.
extract and analyze content from various file formats including PDFs, text files, and images.
Article parser for Habr, Proglib, and vc.ru that extracts main content, removes ads and unnecessary elements, preserving proper formatting
Transform PowerPoint presentations into LLM-optimized markdown while preserving instructional design narrative. Built for technical trainers.
Extract website content, use GPT for analysis
Automatic clickbait detection and content extraction in social media.
Multi-process crawler which extracts main content and sustain itself by extracting more links to crawl.
WordPress Content Extractor: XML to Structured Text Converter
🦙 Production-ready MCP server for comprehensive research and analysis with web + academic search, content extraction, and optional React web interface
WebScraperAPI is a powerful web application that transforms any website into structured data using the Firecrawl API. It provides an intuitive interface for extracting specific information from websites and converting it into structured formats like JSON and CSV.
Extract content from reddit, tiktok, articles, youtube
Dash app to summarize YouTube & Reddit content with Ollama.
Add a description, image, and links to the content-extraction topic page so that developers can more easily learn about it.
To associate your repository with the content-extraction topic, visit your repo's landing page and select "manage topics."