Script for extracting units from http://vocab.nerc.ac.uk/collection/P06/current/ to easily add units to the database (This should only be temporarily to demonstrate how units can work)
-
Updated
Jul 27, 2020 - HTML
Script for extracting units from http://vocab.nerc.ac.uk/collection/P06/current/ to easily add units to the database (This should only be temporarily to demonstrate how units can work)
Xtract-htmlV2 is a tool for getting the HTML code from the website you want and is the successor to the previous version
Extract embedded metadata from HTML markup
WordPress BS4 theme extractor
Article extraction benchmark: dataset and evaluation scripts
fast python port of arc90's readability tool, updated to match latest readability.js!
extracts and saves HTML, CSS, and JavaScript files from a specified URL.
A Java-based server leveraging Apache Tika to extract content and metadata from files (PDF, DOCX, TXT, etc.) in a local files-to-extract directory. Supports HTML (with CSS styling) and text extraction, file listing, and metadata retrieval via MCP-compliant tools and REST APIs. Built with Spring Boot, Jetty, and MCP SDK.
Extract price amount and currency symbol from a raw text string
Parse numbers written in natural language
🌐 Build high-performance web sensing and browser automation tools with ReasonKit Web, a Rust-native implementation for efficient solutions.
Heuristic based boilerplate removal tool
Xtract-html is a tool for extracting HTML display code from a website, which you can also use for your website.
High-performance MCP server for browser automation, web capture, and content extraction. Rust-powered CDP client for AI agents.
Domain-specific language for extracting structured data from HTML documents
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Module for automatic summarization of text documents and HTML pages.
Add a description, image, and links to the html-extraction topic page so that developers can more easily learn about it.
To associate your repository with the html-extraction topic, visit your repo's landing page and select "manage topics."