Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
-
Updated
Jan 15, 2019 - Python
Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.
Tools for parsing and manipulating JATS XML documents.
A python content extraction library for the structured extraction of Terms and Conditions from German and English online shops
Mobile First Indexing Tool
This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…
Crawler for MIT OCW - organized scraping with batch control, timers, and content deduplication.
Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!
🕷️ Advanced news scraper for major Bangladeshi newspapers with cross-platform support, comprehensive logging, and intelligent data extraction. Features API-based scraping, date filtering, real-time monitoring, and automated export tools for ProthomAlo, Daily Sun, Daily Ittefaq, BD Pratidin, Bangladesh Today, and The Daily Star.
A web-based article extractor and text-to-speech converter. Extract content from any URL and listen to articles with natural voice synthesis. Supports multiple extraction methods.
A Python library for scraping technical documentation with ease — supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.
🤖 Automated Q&A Dataset Generation Pipeline powered by LLMs. Multi-stage pipeline that searches, filters, extracts and transforms web content into high-quality question-answer datasets for LLM training. Supports multiple LLM providers (Groq, Mistral, Ollama) and search engines.
The metadata and text content extractor for almost every file type.
A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
A reusable Python scraper framework for efficiently crawling documentation websites and preparing content for AI training, with async support and structured output.
extract and analyze content from various file formats including PDFs, text files, and images.
Opinionated and Sophisticated Document Region Analyzer.
Article parser for Habr, Proglib, and vc.ru that extracts main content, removes ads and unnecessary elements, preserving proper formatting
Transform PowerPoint presentations into LLM-optimized markdown while preserving instructional design narrative. Built for technical trainers.
Add a description, image, and links to the content-extraction topic page so that developers can more easily learn about it.
To associate your repository with the content-extraction topic, visit your repo's landing page and select "manage topics."