content-extraction

Here are 76 public repositories matching this topic...

jvcByte / text-to-speech

A web-based article extractor and text-to-speech converter. Extract content from any URL and listen to articles with natural voice synthesis. Supports multiple extraction methods.

audio text-to-speech web-scraping content-extraction voice-synthesis reading-assistant article-extraction text-to-speech-converter

Updated Aug 9, 2025
Python

Prakashmaheshwaran / docscraperforai

Star

A Python library for scraping technical documentation with ease — supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.

markdown documentation json crawler python-library requests web-scraping beautifulsoup content-extraction cli-tool proxy-support docscraper

Updated Apr 12, 2025
Python

vaisu-bhut / Theai

Star

artificial-intelligence content-extraction mern-stack-development

Updated Jan 21, 2024
JavaScript

pdfix / pdfix_sdk_example_npm

Star

Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

nodejs html pdf sdk conversion tagging wasm pdf-converter pdf-forms extract-data autotag pdf-manipulation content-extraction remediation pdf-data-extraction pdf2html webassemply

Updated Feb 20, 2025
JavaScript

gokhaneraslan / llm-qa-dataset-pipeline

Sponsor

Star

🤖 Automated Q&A Dataset Generation Pipeline powered by LLMs. Multi-stage pipeline that searches, filters, extracts and transforms web content into high-quality question-answer datasets for LLM training. Supports multiple LLM providers (Groq, Mistral, Ollama) and search engines.

nlp machine-learning natural-language-processing web-scraping question-answering dataset-generation content-extraction mistral document-processing qa-dataset groq automated-pipeline llm llama-index trafilatura ollama semantic-chunking crawl4ai ai-training-data

Updated Jun 7, 2025
Python

sebischair / LowestCommonAncestorExtractor

Star

A python content extraction library for the structured extraction of Terms and Conditions from German and English online shops

content-extraction

Updated May 9, 2022
Python

simonpierreboucher / Crawler

Star

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

rate-limiting http-requests error-handling html-parsing data-collection text-processing web-crawling content-extraction yaml-configuration data-scraping python-crawler modular-design metadata-storage url-normalization pdf-text-extraction structured-data-storage concurrent-crawling data-extraction-pipeline data-preservation-and-recovery

Updated Nov 18, 2024
Python

WiseDodge / AIContextScraper

Star

A reusable Python scraper framework for efficiently crawling documentation websites and preparing content for AI training, with async support and structured output.

python machine-learning scraper framework web-crawler web-scraper asyncio dataset-creation data-collection knowledge-base data-processing content-extraction ai-training documentation-scraper llm-training

Updated Sep 17, 2025
Python

AhmedZeyadTareq / Llama-Parse-Content-Extraction

Star

extract and analyze content from various file formats including PDFs, text files, and images.

content-extraction file-processing rag pdf-parser-component document-parsing llama-index llamaparse

Updated Jul 6, 2025
Python

masud-technope / ContentSuggest-Replication-Package-CASCON2015

Star

Recommending Relevant Sections from a Webpage About Programming Errors and Exceptions

dom-manipulation content-extraction replication-package content-suggest

Updated May 22, 2019
Hack

midstreeeam / peduncle

Star

content extraction from html

content-extraction

Updated Sep 11, 2023
Python

danlikendy / articles_project

Star

Article parser for Habr, Proglib, and vc.ru that extracts main content, removes ads and unnecessary elements, preserving proper formatting

python markdown web-scraping beautifulsoup text-processing content-extraction article-parser cli-tool proglibio habr

Updated Oct 2, 2025
Python

MichaelvanLaar / n8n-nodes-defuddle

Star

n8n community node for extracting main content from webpages using Defuddle library

web-scraping readability content-extraction n8n n8n-community-node-package defuddle

Updated Oct 29, 2025
TypeScript

mrinshad / ChatPDF

Star

Document processing and querying system built with FastAPI and React. Upload documents and interact with their content using natural language queries powered by Gemini API and Unstructured.io

python docker js reactjs content-extraction unstructured-data gemini-api router-react fastapi bootsrtap axios-react llm

Updated Nov 5, 2024
JavaScript

KunlinY / DistributedCrawlSystem

Star

分布式爬虫系统

java redis crawler content-extraction

Updated Jun 17, 2017
Java

HarryDulaney / news-feed-scraper

Star

Configurable and schedulable web scrapping tool. Used to extract raw article content and metadata for aggregated news feeds.

scraper webscraper news-feed content-extraction web-automation news-feed-provider newsscraper scraperapi java-web-scraper

Updated Jan 2, 2023
Java

PixelGrace / smart-article-extractor

Star

Article extraction, content scraping

content-extraction article-scraping puppeteer-automation news-data-mining academic-article-scraper journalism-research-tool fake-news-monitoring web-content-downloader dynamic-page-scraping json-csv-exporter

Updated Nov 10, 2025
JavaScript

chithraxx-0616 / AI_SUMMARIZER

Star

A Chrome extension that summarizes articles using Gemini API

javascript css chrome-extension html text-analysis browser-extension web-extension user-friendly content-extraction gemini-api reading-tools generative-ai google-ai-studio summarizer-api

Updated Oct 21, 2025
HTML

cyanheads / jinaai-mcp-server

Sponsor

Star

A Model Context Protocol (MCP) server that provides intelligent web reading capabilities using the Jina AI Reader API. It extracts clean, LLM-ready content from any URL.

agent mcp web-scraping content-extraction jina llm jinaai mcp-server modelcontextprotocol

Updated Sep 4, 2025
TypeScript

LmiLaugh / extended-gpt-scraper

Star

Extract website content, use GPT for analysis

sentiment-analysis web-scraping content-extraction web-automation playwright openai-api markdown-formatting website-scraping gpt-api ai-driven-text-processing content-proofreading

Updated Nov 11, 2025
Python

Improve this page

Add a description, image, and links to the content-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the content-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content-extraction

Here are 76 public repositories matching this topic...

jvcByte / text-to-speech

Prakashmaheshwaran / docscraperforai

vaisu-bhut / Theai

pdfix / pdfix_sdk_example_npm

gokhaneraslan / llm-qa-dataset-pipeline

sebischair / LowestCommonAncestorExtractor

simonpierreboucher / Crawler

WiseDodge / AIContextScraper

AhmedZeyadTareq / Llama-Parse-Content-Extraction

masud-technope / ContentSuggest-Replication-Package-CASCON2015

midstreeeam / peduncle

danlikendy / articles_project

MichaelvanLaar / n8n-nodes-defuddle

mrinshad / ChatPDF

KunlinY / DistributedCrawlSystem

HarryDulaney / news-feed-scraper

PixelGrace / smart-article-extractor

chithraxx-0616 / AI_SUMMARIZER

cyanheads / jinaai-mcp-server

LmiLaugh / extended-gpt-scraper

Improve this page

Add this topic to your repo