Extract website content, use GPT for analysis
-
Updated
Nov 11, 2025 - Python
Extract website content, use GPT for analysis
Automatically scrape metadata from websites
Extract content from reddit, tiktok, articles, youtube
🔄 Extract and convert WordPress export files to Markdown, CSV, and JSON formats with intelligent HTML parsing and code block detection
Lightweight Dash app to summarize YouTube & Reddit content with Ollama.
Article parser for Habr, Proglib, and vc.ru that extracts main content, removes ads and unnecessary elements, preserving proper formatting
A reusable Python scraper framework for efficiently crawling documentation websites and preparing content for AI training, with async support and structured output.
Mobile First Indexing Tool
🕷️ Advanced news scraper for major Bangladeshi newspapers with cross-platform support, comprehensive logging, and intelligent data extraction. Features API-based scraping, date filtering, real-time monitoring, and automated export tools for ProthomAlo, Daily Sun, Daily Ittefaq, BD Pratidin, Bangladesh Today, and The Daily Star.
A web-based article extractor and text-to-speech converter. Extract content from any URL and listen to articles with natural voice synthesis. Supports multiple extraction methods.
Transform PowerPoint presentations into LLM-optimized markdown while preserving instructional design narrative. Built for technical trainers.
extract and analyze content from various file formats including PDFs, text files, and images.
🦙 Production-ready MCP server for comprehensive research and analysis with web + academic search, content extraction, and optional React web interface
🤖 Automated Q&A Dataset Generation Pipeline powered by LLMs. Multi-stage pipeline that searches, filters, extracts and transforms web content into high-quality question-answer datasets for LLM training. Supports multiple LLM providers (Groq, Mistral, Ollama) and search engines.
Crawler for MIT OCW - organized scraping with batch control, timers, and content deduplication.
Opinionated and Sophisticated Document Region Analyzer.
A Python library for scraping technical documentation with ease — supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.
Add a description, image, and links to the content-extraction topic page so that developers can more easily learn about it.
To associate your repository with the content-extraction topic, visit your repo's landing page and select "manage topics."