content-extraction

Here are 33 public repositories matching this topic...

LmiLaugh / extended-gpt-scraper

Extract website content, use GPT for analysis

sentiment-analysis web-scraping content-extraction web-automation playwright openai-api markdown-formatting website-scraping gpt-api ai-driven-text-processing content-proofreading

Updated Nov 11, 2025
Python

Tom0985 / metadata-scraper

Star

Automatically scrape metadata from websites

json-data web-scraping web-crawling content-extraction metadata-extraction website-analysis url-pattern-matching content-scraping seo-data pagination-handling

Updated Nov 10, 2025
Python

prompt-stack / content-engine

Star

Extract content from reddit, tiktok, articles, youtube

python api scraping content-extraction llm

Updated Oct 31, 2025
Python

MacphersonDesigns / wp-rest-dumper

Star

Scraper for Wordpress Sites.

python wordpress scraper backup rest-api content-extraction

Updated Oct 31, 2025
Python

legendmohe / wordpress-content-extractor

Star

🔄 Extract and convert WordPress export files to Markdown, CSV, and JSON formats with intelligent HTML parsing and code block detection

python markdown wordpress xml-parser content-extraction blog-migration

Updated Oct 10, 2025
Python

Noe-AC / url2tldr

Star

Lightweight Dash app to summarize YouTube & Reddit content with Ollama.

python nlp youtube reddit side-project tldr dash web-scraping webapp text-summarization summarization content-extraction llm ollama

Updated Oct 9, 2025
Python

danlikendy / articles_project

Star

Article parser for Habr, Proglib, and vc.ru that extracts main content, removes ads and unnecessary elements, preserving proper formatting

python markdown web-scraping beautifulsoup text-processing content-extraction article-parser cli-tool proglibio habr

Updated Oct 2, 2025
Python

WiseDodge / AIContextScraper

Star

A reusable Python scraper framework for efficiently crawling documentation websites and preparing content for AI training, with async support and structured output.

python machine-learning scraper framework web-crawler web-scraper asyncio dataset-creation data-collection knowledge-base data-processing content-extraction ai-training documentation-scraper llm-training

Updated Sep 17, 2025
Python

zeoagency / mobile-first-indexing-tool

Star

Mobile First Indexing Tool

aws-lambda seo mfi content-extraction lighthouse seo-tool aws-layers

Updated Sep 10, 2025
Python

EhsanulHaqueSiam / BDNewsPaperScraper

Star

🕷️ Advanced news scraper for major Bangladeshi newspapers with cross-platform support, comprehensive logging, and intelligent data extraction. Features API-based scraping, date filtering, real-time monitoring, and automated export tools for ProthomAlo, Daily Sun, Daily Ittefaq, BD Pratidin, Bangladesh Today, and The Daily Star.

Updated Aug 30, 2025
Python

jvcByte / text-to-speech

Star

A web-based article extractor and text-to-speech converter. Extract content from any URL and listen to articles with natural voice synthesis. Supports multiple extraction methods.

audio text-to-speech web-scraping content-extraction voice-synthesis reading-assistant article-extraction text-to-speech-converter

Updated Aug 9, 2025
Python

timothywarner-org / pptx-shredder

Star

Transform PowerPoint presentations into LLM-optimized markdown while preserving instructional design narrative. Built for technical trainers.

python training markdown powerpoint content-extraction education-technology instructional-design technical-training llm

Updated Jul 14, 2025
Python

YanivHaliwa / Url-To-Text

Sponsor

Star

data-mining web-scraping cybersecurity content-extraction linux-tools

Updated Jul 11, 2025
Python

AhmedZeyadTareq / Llama-Parse-Content-Extraction

Star

extract and analyze content from various file formats including PDFs, text files, and images.

content-extraction file-processing rag pdf-parser-component document-parsing llama-index llamaparse

Updated Jul 6, 2025
Python

DeepKariaX / Analysis-Alpaca-Researcher

Star

🦙 Production-ready MCP server for comprehensive research and analysis with web + academic search, content extraction, and optional React web interface

react research ai duckduckgo openai content-extraction claude web-search fastapi groq semantic-scholar academic-search llm anthropic mcp-server

Updated Jun 9, 2025
Python

gokhaneraslan / llm-qa-dataset-pipeline

Sponsor

Star

🤖 Automated Q&A Dataset Generation Pipeline powered by LLMs. Multi-stage pipeline that searches, filters, extracts and transforms web content into high-quality question-answer datasets for LLM training. Supports multiple LLM providers (Groq, Mistral, Ollama) and search engines.

nlp machine-learning natural-language-processing web-scraping question-answering dataset-generation content-extraction mistral document-processing qa-dataset groq automated-pipeline llm llama-index trafilatura ollama semantic-chunking crawl4ai ai-training-data

Updated Jun 7, 2025
Python

Ashad001 / MITCrawlerX

Star

Crawler for MIT OCW - organized scraping with batch control, timers, and content deduplication.

python computer-science education automation web-scraping batch-processing content-extraction open-education mit-ocw

Updated Jun 1, 2025
Python

mohammad6706 / export-data-url

Star

python html api python3 requests content-extraction beatifulsoup fastapi beatifulsoup4

Updated May 20, 2025
Python

rithulkamesh / docproc

Sponsor

Star

Opinionated and Sophisticated Document Region Analyzer.

python machine-learning ocr text-classification text-extraction data-extraction region-detection content-extraction document-analysis layout-analysis pdf-processing pdf-text-extraction document-parsing equation-detection mathematical-symbols

Updated Apr 13, 2025
Python

Prakashmaheshwaran / docscraperforai

Star

A Python library for scraping technical documentation with ease — supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.

markdown documentation json crawler python-library requests web-scraping beautifulsoup content-extraction cli-tool proxy-support docscraper

Updated Apr 12, 2025
Python

Improve this page

Add a description, image, and links to the content-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the content-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content-extraction

Here are 33 public repositories matching this topic...

LmiLaugh / extended-gpt-scraper

Tom0985 / metadata-scraper

prompt-stack / content-engine

MacphersonDesigns / wp-rest-dumper

legendmohe / wordpress-content-extractor

Noe-AC / url2tldr

danlikendy / articles_project

WiseDodge / AIContextScraper

zeoagency / mobile-first-indexing-tool

EhsanulHaqueSiam / BDNewsPaperScraper

jvcByte / text-to-speech

timothywarner-org / pptx-shredder

YanivHaliwa / Url-To-Text

AhmedZeyadTareq / Llama-Parse-Content-Extraction

DeepKariaX / Analysis-Alpaca-Researcher

gokhaneraslan / llm-qa-dataset-pipeline

Ashad001 / MITCrawlerX

mohammad6706 / export-data-url

rithulkamesh / docproc

Prakashmaheshwaran / docscraperforai

Improve this page

Add this topic to your repo