content-extraction

Here are 31 public repositories matching this topic...

gdamdam / sumo

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

nlp nltk automatic-summarization content-extraction semantic-analysis sentence-extraction entity-recognition

Updated Jan 15, 2019
Python

timoteostewart / benson

Star

Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

productivity web-scraping content-extraction boilerplate-removal

Updated Oct 30, 2024
Python

bencmc / youtube_video_summarizer

Star

This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

python natural-language-processing youtube-api video-processing openai text-summarization text-processing natural content-extraction streamlit transcript-analysis gpt-35-turbo langchain-python

Updated Sep 29, 2023
Python

zeoagency / mobile-first-indexing-tool

Star

Mobile First Indexing Tool

aws-lambda seo mfi content-extraction lighthouse seo-tool aws-layers

Updated Sep 10, 2025
Python

leroyanders / acrticle-scrapper

Star

This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…

python web-scraping content-extraction metadata-extraction article-parser markdown-conversion image-downloading data-archiving html-to-markdown-converter content-creation-tools

Updated Feb 19, 2024
Python

baughmann / tikara

Star

The metadata and text content extractor for almost every file type.

java metadata natural-language-processing text-mining ocr ml language-detection text-extraction docx pdf-to-text image-to-text content-extraction metadata-extraction apache-tika document-processing llm document-parsing retrieval-augmented-generation

Updated Feb 3, 2025
Python

TypesetIO / jsuite

Star

Tools for parsing and manipulating JATS XML documents.

xml-schema content-extraction

Updated Jul 6, 2022
Python

rmwkwok / crawler

Sponsor

Star

Multi-process crawler which extracts main content and sustain itself by extracting more links to crawl.

crawler content-extraction multiprocess

Updated Mar 18, 2021
Python

rithulkamesh / docproc

Sponsor

Star

Opinionated and Sophisticated Document Region Analyzer.

python machine-learning ocr text-classification text-extraction data-extraction region-detection content-extraction document-analysis layout-analysis pdf-processing pdf-text-extraction document-parsing equation-detection mathematical-symbols

Updated Apr 13, 2025
Python

timothywarner-org / pptx-shredder

Star

Transform PowerPoint presentations into LLM-optimized markdown while preserving instructional design narrative. Built for technical trainers.

python training markdown powerpoint content-extraction education-technology instructional-design technical-training llm

Updated Jul 14, 2025
Python

YanivHaliwa / Url-To-Text

Sponsor

Star

data-mining web-scraping cybersecurity content-extraction linux-tools

Updated Jul 11, 2025
Python

Ashad001 / MITCrawlerX

Star

Crawler for MIT OCW - organized scraping with batch control, timers, and content deduplication.

python computer-science education automation web-scraping batch-processing content-extraction open-education mit-ocw

Updated Jun 1, 2025
Python

mohammad6706 / export-data-url

Star

python html api python3 requests content-extraction beatifulsoup fastapi beatifulsoup4

Updated May 20, 2025
Python

EhsanulHaqueSiam / BDNewsPaperScraper

Star

🕷️ Advanced news scraper for major Bangladeshi newspapers with cross-platform support, comprehensive logging, and intelligent data extraction. Features API-based scraping, date filtering, real-time monitoring, and automated export tools for ProthomAlo, Daily Sun, Daily Ittefaq, BD Pratidin, Bangladesh Today, and The Daily Star.

Updated Aug 30, 2025
Python

Aish-p / WebScraperAPI

Star

WebScraperAPI is a powerful web application that transforms any website into structured data using the Firecrawl API. It provides an intuitive interface for extracting specific information from websites and converting it into structured formats like JSON and CSV.

web-scraping developer-tools data-processing content-extraction

Updated Feb 20, 2025
Python

jvcByte / text-to-speech

Star

A web-based article extractor and text-to-speech converter. Extract content from any URL and listen to articles with natural voice synthesis. Supports multiple extraction methods.

audio text-to-speech web-scraping content-extraction voice-synthesis reading-assistant article-extraction text-to-speech-converter

Updated Aug 9, 2025
Python

Prakashmaheshwaran / docscraperforai

Star

A Python library for scraping technical documentation with ease — supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.

markdown documentation json crawler python-library requests web-scraping beautifulsoup content-extraction cli-tool proxy-support docscraper

Updated Apr 12, 2025
Python

gokhaneraslan / llm-qa-dataset-pipeline

Sponsor

Star

🤖 Automated Q&A Dataset Generation Pipeline powered by LLMs. Multi-stage pipeline that searches, filters, extracts and transforms web content into high-quality question-answer datasets for LLM training. Supports multiple LLM providers (Groq, Mistral, Ollama) and search engines.

nlp machine-learning natural-language-processing web-scraping question-answering dataset-generation content-extraction mistral document-processing qa-dataset groq automated-pipeline llm llama-index trafilatura ollama semantic-chunking crawl4ai ai-training-data

Updated Jun 7, 2025
Python

sebischair / LowestCommonAncestorExtractor

Star

A python content extraction library for the structured extraction of Terms and Conditions from German and English online shops

content-extraction

Updated May 9, 2022
Python

simonpierreboucher / Crawler

Star

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

rate-limiting http-requests error-handling html-parsing data-collection text-processing web-crawling content-extraction yaml-configuration data-scraping python-crawler modular-design metadata-storage url-normalization pdf-text-extraction structured-data-storage concurrent-crawling data-extraction-pipeline data-preservation-and-recovery

Updated Nov 18, 2024
Python

Improve this page

Add a description, image, and links to the content-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the content-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content-extraction

Here are 31 public repositories matching this topic...

gdamdam / sumo

timoteostewart / benson

bencmc / youtube_video_summarizer

zeoagency / mobile-first-indexing-tool

leroyanders / acrticle-scrapper

baughmann / tikara

TypesetIO / jsuite

rmwkwok / crawler

rithulkamesh / docproc

timothywarner-org / pptx-shredder

YanivHaliwa / Url-To-Text

Ashad001 / MITCrawlerX

mohammad6706 / export-data-url

EhsanulHaqueSiam / BDNewsPaperScraper

Aish-p / WebScraperAPI

jvcByte / text-to-speech

Prakashmaheshwaran / docscraperforai

gokhaneraslan / llm-qa-dataset-pipeline

sebischair / LowestCommonAncestorExtractor

simonpierreboucher / Crawler

Improve this page

Add this topic to your repo