content-extraction

Here are 33 public repositories matching this topic...

jvcByte / text-to-speech

A web-based article extractor and text-to-speech converter. Extract content from any URL and listen to articles with natural voice synthesis. Supports multiple extraction methods.

audio text-to-speech web-scraping content-extraction voice-synthesis reading-assistant article-extraction text-to-speech-converter

Updated Aug 9, 2025
Python

rithulkamesh / docproc

Sponsor

Star

Opinionated and Sophisticated Document Region Analyzer.

python machine-learning ocr text-classification text-extraction data-extraction region-detection content-extraction document-analysis layout-analysis pdf-processing pdf-text-extraction document-parsing equation-detection mathematical-symbols

Updated Apr 13, 2025
Python

baughmann / tikara

Star

The metadata and text content extractor for almost every file type.

java metadata natural-language-processing text-mining ocr ml language-detection text-extraction docx pdf-to-text image-to-text content-extraction metadata-extraction apache-tika document-processing llm document-parsing retrieval-augmented-generation

Updated Feb 3, 2025
Python

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

rate-limiting http-requests error-handling html-parsing data-collection text-processing web-crawling content-extraction yaml-configuration data-scraping python-crawler modular-design metadata-storage url-normalization pdf-text-extraction structured-data-storage concurrent-crawling data-extraction-pipeline data-preservation-and-recovery

Updated Nov 18, 2024
Python

gokhaneraslan / llm-qa-dataset-pipeline

Sponsor

Star

🤖 Automated Q&A Dataset Generation Pipeline powered by LLMs. Multi-stage pipeline that searches, filters, extracts and transforms web content into high-quality question-answer datasets for LLM training. Supports multiple LLM providers (Groq, Mistral, Ollama) and search engines.

nlp machine-learning natural-language-processing web-scraping question-answering dataset-generation content-extraction mistral document-processing qa-dataset groq automated-pipeline llm llama-index trafilatura ollama semantic-chunking crawl4ai ai-training-data

Updated Jun 7, 2025
Python

Prakashmaheshwaran / docscraperforai

Star

A Python library for scraping technical documentation with ease — supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.

markdown documentation json crawler python-library requests web-scraping beautifulsoup content-extraction cli-tool proxy-support docscraper

Updated Apr 12, 2025
Python

WiseDodge / AIContextScraper

Star

A reusable Python scraper framework for efficiently crawling documentation websites and preparing content for AI training, with async support and structured output.

python machine-learning scraper framework web-crawler web-scraper asyncio dataset-creation data-collection knowledge-base data-processing content-extraction ai-training documentation-scraper llm-training

Updated Sep 17, 2025
Python

AhmedZeyadTareq / Llama-Parse-Content-Extraction

Star

extract and analyze content from various file formats including PDFs, text files, and images.

content-extraction file-processing rag pdf-parser-component document-parsing llama-index llamaparse

Updated Jul 6, 2025
Python

midstreeeam / peduncle

Star

content extraction from html

content-extraction

Updated Sep 11, 2023
Python

danlikendy / articles_project

Star

Article parser for Habr, Proglib, and vc.ru that extracts main content, removes ads and unnecessary elements, preserving proper formatting

python markdown web-scraping beautifulsoup text-processing content-extraction article-parser cli-tool proglibio habr

Updated Oct 2, 2025
Python

mohammad6706 / export-data-url

Star

python html api python3 requests content-extraction beatifulsoup fastapi beatifulsoup4

Updated May 20, 2025
Python

timothywarner-org / pptx-shredder

Star

Transform PowerPoint presentations into LLM-optimized markdown while preserving instructional design narrative. Built for technical trainers.

python training markdown powerpoint content-extraction education-technology instructional-design technical-training llm

Updated Jul 14, 2025
Python

LmiLaugh / extended-gpt-scraper

Star

Extract website content, use GPT for analysis

sentiment-analysis web-scraping content-extraction web-automation playwright openai-api markdown-formatting website-scraping gpt-api ai-driven-text-processing content-proofreading

Updated Nov 11, 2025
Python

fuvidani / clickbait-defeater

Star

Automatic clickbait detection and content extraction in social media.

kotlin python chrome-extension docker machine-learning facebook spring-boot docker-compose clickbait microservices-architecture content-extraction spring5-webflux

Updated Oct 30, 2018
Python

rmwkwok / crawler

Sponsor

Star

Multi-process crawler which extracts main content and sustain itself by extracting more links to crawl.

crawler content-extraction multiprocess

Updated Mar 18, 2021
Python

01one / wordpress-content-extractor

Star

WordPress Content Extractor: XML to Structured Text Converter

xml-parser content-extraction python-utility content-migration web-content wordpress-export wordpress-to-text

Updated Apr 6, 2025
Python

DeepKariaX / Analysis-Alpaca-Researcher

Star

🦙 Production-ready MCP server for comprehensive research and analysis with web + academic search, content extraction, and optional React web interface

react research ai duckduckgo openai content-extraction claude web-search fastapi groq semantic-scholar academic-search llm anthropic mcp-server

Updated Jun 9, 2025
Python

Aish-p / WebScraperAPI

Star

WebScraperAPI is a powerful web application that transforms any website into structured data using the Firecrawl API. It provides an intuitive interface for extracting specific information from websites and converting it into structured formats like JSON and CSV.

web-scraping developer-tools data-processing content-extraction

Updated Feb 20, 2025
Python

prompt-stack / content-engine

Star

Extract content from reddit, tiktok, articles, youtube

python api scraping content-extraction llm

Updated Oct 31, 2025
Python

Noe-AC / url2tldr

Star

Dash app to summarize YouTube & Reddit content with Ollama.

python nlp youtube reddit side-project tldr dash web-scraping webapp text-summarization summarization content-extraction llm ollama

Updated Nov 13, 2025
Python

Improve this page

Add a description, image, and links to the content-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the content-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content-extraction

Here are 33 public repositories matching this topic...

jvcByte / text-to-speech

rithulkamesh / docproc

baughmann / tikara

simonpierreboucher / Crawler

gokhaneraslan / llm-qa-dataset-pipeline

Prakashmaheshwaran / docscraperforai

WiseDodge / AIContextScraper

AhmedZeyadTareq / Llama-Parse-Content-Extraction

midstreeeam / peduncle

danlikendy / articles_project

mohammad6706 / export-data-url

timothywarner-org / pptx-shredder

LmiLaugh / extended-gpt-scraper

fuvidani / clickbait-defeater

rmwkwok / crawler

01one / wordpress-content-extractor

DeepKariaX / Analysis-Alpaca-Researcher

Aish-p / WebScraperAPI

prompt-stack / content-engine

Noe-AC / url2tldr

Improve this page

Add this topic to your repo