htmlsense

htmlsense is a Go library and command-line tool for automatically extracting XPath selector mappings from HTML documents. It leverages Large Language Models (LLM) to intelligently analyze HTML structures, identify repeated patterns and valuable content elements, and generate corresponding XPath selectors, greatly simplifying the manual work of writing selectors in web scraping development.

Background

When developing web scraping programs, we often need to manually identify and write XPath or CSS selectors to locate page elements. This process is both tedious and error-prone, especially when page structures are complex or frequently changing. htmlsense automates selector generation by combining HTML structure analysis with LLM's intelligent recognition capabilities, making web scraping development more efficient.

Features

✨ Intelligent Recognition: Automatically identifies repeated structures and valuable content elements in HTML using LLM
🧹 Auto Cleaning: Automatically removes interfering tags (head, style, script, etc.) and attributes
🎯 Scope Extraction: Supports specifying extraction scope via scopeXPath to focus on specific areas
✅ Auto Validation: Automatically validates generated XPath selectors and filters out invalid ones
📊 Data Extraction: Optionally extract actual data (table format) in addition to selectors
🛠️ CLI Tool: Provides a complete command-line tool for easy integration into scripts and automation workflows
📦 Easy Integration: Can be used as a Go library, easily integrated into existing projects
🔧 Flexible Configuration: Supports multiple LLM providers and model configurations

Installation

Prerequisites

Go 1.21 or higher
LLM API key (OpenAI or other supported providers)

Install from Source

git clone https://github.com/gotoailab/htmlsense.git
cd htmlsense
make install

After installation, the htmlsense command will be installed to the $GOPATH/bin directory.

Build with Makefile

# Build binary to build/ directory
make build

# Install to GOPATH/bin
make install

# Run all tests
make test

# Generate test coverage report
make test-coverage

# Format code
make fmt

# Run code checks
make vet

# Run all checks (format, check, test)
make check

# Build multi-platform release binaries
make release

# View all available commands
make help

Quick Start

Command-Line Tool Usage

Basic Usage

# Extract selectors from file
htmlsense -html page.html -api-key YOUR_API_KEY

# Extract from stdin
cat page.html | htmlsense -html - -api-key YOUR_API_KEY

# Extract with scope (e.g., only within a container)
htmlsense -html page.html -scope "//div[@class='container']" -api-key YOUR_API_KEY

# Output as text format (default is JSON)
htmlsense -html page.html -output text -api-key YOUR_API_KEY

# Save intermediate results to directory
htmlsense -html page.html -save-intermediate ./output -api-key YOUR_API_KEY

# Extract data and save intermediate results
htmlsense -html page.html -extract-data -save-intermediate ./output -api-key YOUR_API_KEY

# Use environment variable for API key (recommended)
export HTMLSENSE_API_KEY=YOUR_API_KEY
htmlsense -html page.html

# Specify different LLM provider and model
htmlsense -html page.html -provider openai -model gpt-4 -api-key YOUR_API_KEY

Command-Line Arguments

Argument	Description	Default	Required
`-html`	HTML file path (use `-` to read from stdin)	-	Yes
`-output`	Output format: `json` or `text`	`json`	No
`-scope`	Optional XPath to limit extraction scope	-	No
`-extract-data`	Extract actual data instead of just selectors	`false`	No
`-save-intermediate`	Directory to save intermediate results (simplified.html, selectors.json, data.json)	-	No
`-api-key`	LLM API key	-	Yes*
`-provider`	LLM provider (e.g., `openai`)	`openai`	No
`-model`	LLM model name	`gpt-4`	No
`-version`	Show version information	-	No
`-help`	Show help information	-	No

*Can be set via HTMLSENSE_API_KEY environment variable, in which case -api-key is not needed

Output Examples

JSON Format Output:

{
  "title": "//h1[@class='article-title']",
  "content": "//div[@class='article-content']//p",
  "author": "//span[@class='author-name']",
  "date": "//time[@class='publish-date']"
}

Text Format Output:

Extracted selector mappings:
  title: //h1[@class='article-title']
  content: //div[@class='article-content']//p
  author: //span[@class='author-name']
  date: //time[@class='publish-date']

Data Extraction Output (with -extract-data flag):

JSON format:

[
  {
    "title": "Article Title 1",
    "content": "Article content 1...",
    "author": "Author Name 1",
    "date": "2024-01-01"
  },
  {
    "title": "Article Title 2",
    "content": "Article content 2...",
    "author": "Author Name 2",
    "date": "2024-01-02"
  }
]

Text format:

Extracted data:

Row 1:
  title: Article Title 1
  content: Article content 1...
  author: Author Name 1
  date: 2024-01-01

Row 2:
  title: Article Title 2
  content: Article content 2...
  author: Author Name 2
  date: 2024-01-02

Intermediate Results (with -save-intermediate flag):

When using -save-intermediate, the following files are saved to the specified directory:

simplified.html: The cleaned and simplified HTML used for LLM analysis

selectors.json: Contains both raw selectors from LLM and validated selectors:

{
  "raw_selectors": {
    "title": "//h1[@class='article-title']",
    "content": "//div[@class='article-content']//p"
  },
  "valid_selectors": {
    "title": "//h1[@class='article-title']",
    "content": "//div[@class='article-content']//p"
  }
}

data.json: Extracted data in JSON format (only when -extract-data is used)

Use as Go Library

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/gotoailab/htmlsense"
)

func main() {
    // Configure htmlsense
    config := htmlsense.Config{
        APIKey:   "your-api-key",
        Provider: "openai",
        Model:    "gpt-4",
    }

    // Create extractor
    extractor, err := htmlsense.NewExtractor(config)
    if err != nil {
        log.Fatal(err)
    }

    // HTML content
    htmlContent := `
    <html>
        <body>
            <div class="container">
                <h1 class="title">Article Title</h1>
                <p class="content">Article content...</p>
                <span class="author">Author Name</span>
            </div>
        </body>
    </html>
    `

    ctx := context.Background()

    // Method 1: Extract from entire HTML
    selectors, err := extractor.ExtractSelectors(ctx, htmlContent)
    if err != nil {
        log.Fatal(err)
    }

    // Method 2: Extract from specified scope (e.g., only within container)
    selectors, err := extractor.ExtractSelectors(ctx, htmlContent, "//div[@class='container']")
    if err != nil {
        log.Fatal(err)
    }

    // Use extracted selectors
    for elementName, xpath := range selectors {
        fmt.Printf("%s: %s\n", elementName, xpath)
    }

    // Method 3: Extract actual data (table format)
    data, err := extractor.ExtractData(ctx, htmlContent)
    if err != nil {
        log.Fatal(err)
    }

    // data is []map[string]string, where each map represents a row
    for i, row := range data {
        fmt.Printf("Row %d:\n", i+1)
        for fieldName, value := range row {
            fmt.Printf("  %s: %s\n", fieldName, value)
        }
    }
}

How It Works

htmlsense works through the following process:

HTML Cleaning: Removes unnecessary tags (head, style, script, etc.) and interfering attributes (such as img src), simplifying the HTML structure to make it more suitable for LLM analysis
Scope Extraction (Optional): If a scopeXPath parameter is provided, first extract the DOM subtree from the specified scope in the original HTML using XPath
Structure Recognition: Uses LLM to analyze the simplified HTML and intelligently identify:
- Repeated structural patterns (such as list items, cards, articles, etc.)
- Valuable content elements (titles, content, authors, dates, prices, etc.)
Selector Generation: Generates accurate XPath selectors for each identified element, ensuring selectors are both specific and flexible
Selector Validation: Automatically validates that generated XPath selectors are valid and can match nodes in the HTML, filtering out invalid selectors

Use Cases

Web Scraping Development: Quickly generate selectors for page elements without manual writing
Data Extraction Automation: Batch process multiple pages, automatically identify and extract structured data
Page Structure Analysis: Analyze webpage structures to understand page layout and content organization
Test Automation: Generate element locators for UI testing
Content Monitoring: Monitor webpage structure changes and automatically update selectors

Project Structure

htmlsense/
├── cmd/
│   └── htmlsense/          # Command-line tool
│       └── main.go
├── html/                   # HTML processing module
│   ├── cleaner.go          # HTML cleaning
│   └── simplifier.go       # HTML simplification
├── selector/               # Selector extraction module
│   ├── extractor.go        # Selector extractor
│   └── validator.go        # XPath validator
├── example/                # Example code
├── build/                  # Build output directory
├── Makefile               # Build script
├── go.mod                 # Go module definition
└── README.md             # Project documentation

Dependencies

github.com/gotoailab/llmhub - LLM client library
github.com/antchfx/htmlquery - HTML parsing and XPath queries
golang.org/x/net/html - HTML parsing

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or suggestions, please contact us via:

Submit an Issue: GitHub Issues
Email: [Maintainer Email]

Acknowledgments

Thanks to all developers who contributed to this project!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cmd/htmlsense		cmd/htmlsense
example		example
html		html
selector		selector
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
README_CN.md		README_CN.md
go.mod		go.mod
go.sum		go.sum
htmlsense.go		htmlsense.go
htmlsense_test.go		htmlsense_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

htmlsense

Background

Features

Installation

Prerequisites

Install from Source

Build with Makefile

Quick Start

Command-Line Tool Usage

Basic Usage

Command-Line Arguments

Output Examples

Use as Go Library

How It Works

Use Cases

Project Structure

Dependencies

Contributing

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

gotoailab/htmlsense

Folders and files

Latest commit

History

Repository files navigation

htmlsense

Background

Features

Installation

Prerequisites

Install from Source

Build with Makefile

Quick Start

Command-Line Tool Usage

Basic Usage

Command-Line Arguments

Output Examples

Use as Go Library

How It Works

Use Cases

Project Structure

Dependencies

Contributing

License

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages