#

html-extraction

Here are 19 public repositories matching this topic...

shmdoc / unit-parser

Script for extracting units from http://vocab.nerc.ac.uk/collection/P06/current/ to easily add units to the database (This should only be temporarily to demonstrate how units can work)

linked-open-data html-extraction

Updated Jul 27, 2020
HTML

Whomrx666 / Xtract-htmlV2

Xtract-htmlV2 is a tool for getting the HTML code from the website you want and is the successor to the previous version

linux extract termux kali-linux html-extraction html-extractor termux-tool xtract-htmlv2

Updated Oct 16, 2025
Python

zanachka / extruct

Extract embedded metadata from HTML markup

text-extraction html-extraction

Updated Mar 24, 2025
Python

jhontron6 / wordpress-bs4-theme-elements-scraper

WordPress BS4 theme extractor

python theme wordpress scraper beautifulsoup bs4 elements html-extraction website-structure-analysis cms-theme-replication ui-component-scraping page-template-reverse-engineering

Updated Dec 5, 2025

zanachka / article-extraction-benchmark

Article extraction benchmark: dataset and evaluation scripts

text-extraction html-extraction

Updated Mar 1, 2026
Python

zanachka / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

text-extraction html-extraction

Updated Jan 26, 2026
Python

zanachka / html-text

Extract text from HTML

text-extraction html-extraction

Updated Feb 10, 2026
HTML

9dl / HTML-Dumper

extracts and saves HTML, CSS, and JavaScript files from a specified URL.

web-scraping html-extraction

Updated Oct 14, 2024
C#

RayenMalouche / MCP-PDF-Extractor-server

A Java-based server leveraging Apache Tika to extract content and metadata from files (PDF, DOCX, TXT, etc.) in a local files-to-extract directory. Supports HTML (with CSS styling) and text extraction, file listing, and metadata retrieval via MCP-compliant tools and REST APIs. Built with Spring Boot, Jetty, and MCP SDK.

java html pdf parser mcp extractor pdf-extractor html-extraction html-extractor pdf-extraction mcp-server modelcontextprotocol extractor-to-html

Updated Aug 30, 2025
Java

zanachka / price-parser

Extract price amount and currency symbol from a raw text string

text-extraction html-extraction

Updated Mar 19, 2026
Python

zanachka / number-parser

Parse numbers written in natural language

text-extraction html-extraction

Updated Oct 25, 2024
Python

zanachka / dateparser

python parser for human readable dates

text-extraction html-extraction

Updated Mar 25, 2026
Python

romny5 / reasonkit-web

🌐 Build high-performance web sensing and browser automation tools with ReasonKit Web, a Rust-native implementation for efficient solutions.

rust pdf screenshot async chromium tokio web-scraping developer-tools cdp web-automation chrome-devtools-protocol html-extraction headless-browser ai-agent llm-tools agent-tools model-context-protocol

Updated Mar 25, 2026
Rust

zanachka / jusText

Heuristic based boilerplate removal tool

text-extraction html-extraction

Updated Oct 21, 2020
Python

Whomrx666 / Xtract-html

Xtract-html is a tool for extracting HTML display code from a website, which you can also use for your website.

linux html termux kali-linux html-extraction html-extractor termux-tool xtract-html

Updated Oct 16, 2025
Python

reasonkit-web

reasonkit / reasonkit-web

High-performance MCP server for browser automation, web capture, and content extraction. Rust-powered CDP client for AI agents.

Updated Mar 23, 2026
Rust

hext

html-extract / hext

Domain-specific language for extracting structured data from HTML documents

python html node cpp dsl scraping data-extraction html-extraction

Updated Oct 16, 2025
C++

bookieio / breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

python text-mining text-extraction html-parsing html-extraction html-extractor

Updated May 9, 2024
HTML

miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

python nlp pagerank-algorithm text-extraction reduction summarization html-page summary lsa sumy textteaser summarizer html-extraction html-extractor

Updated Feb 14, 2026
Python

Improve this page

Add a description, image, and links to the html-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the html-extraction topic, visit your repo's landing page and select "manage topics."