SmartCrawler

A web crawler that uses WebDriver to extract and parse HTML content from web pages with intelligent duplicate detection and template pattern recognition.

✨ Features

🌐 Multi-URL Crawling: Crawl multiple URLs in a single session
🔍 Intelligent Duplicate Detection: Automatically identifies and filters duplicate content patterns across domains
📋 Template Pattern Recognition: Detects variable patterns in content (e.g., "42 comments" → "{count} comments")
🌳 Structured HTML Tree: Provides filtered HTML tree view with duplicate marking
⚡ WebDriver Integration: Uses WebDriver for dynamic content handling
📊 Verbose Output: Detailed HTML tree analysis with filtering information

🚀 Quick Start

Install SmartCrawler - Download from releases or build from source
Set up WebDriver - Install Firefox/Chrome and corresponding WebDriver
Start crawling - Run SmartCrawler with your target URLs

# Basic usage
smart-crawler --link "https://example.com"

# Multiple URLs with verbose output
smart-crawler --link "https://example.com" --link "https://another.com" --verbose

# Template detection mode
smart-crawler --link "https://example.com" --template --verbose

📖 Documentation

Getting Started

Choose your operating system for detailed setup instructions:

Windows Setup - Complete Windows installation guide
macOS Setup - macOS installation and setup
Linux Setup - Linux installation for various distributions

Usage

CLI Options - Complete command-line reference and examples

Development

Development Guide - Setup, building, testing, and contributing instructions

🔧 System Requirements

Operating System: Windows 10+, macOS 10.15+, or Linux
Browser: Firefox (recommended) or Chrome
WebDriver: GeckoDriver (Firefox) or ChromeDriver (Chrome)
Memory: 512MB RAM minimum, 1GB recommended

📋 Quick Reference

Basic Commands

# Crawl a single URL
smart-crawler --link "https://example.com"

# Crawl with detailed output
smart-crawler --link "https://example.com" --verbose

# Template detection mode
smart-crawler --link "https://example.com" --template --verbose

# Multiple URLs
smart-crawler --link "https://site1.com" --link "https://site2.com"

WebDriver Setup

# Start Firefox WebDriver
geckodriver --port 4444

# Start Chrome WebDriver
chromedriver --port=4444

🛠️ Development

For developers interested in contributing to SmartCrawler or building from source:

Development Guide - Complete setup, building, testing, and contributing instructions

📄 License

This project is licensed under the GPL-3.0 license - see the LICENSE file for details.

🔗 Links

🆘 Support

If you encounter issues:

Check the getting started guides for your operating system
Review the CLI options documentation
Search existing GitHub issues
Create a new issue with detailed error information

Note: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' robots.txt files and terms of service.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src		src
tests		tests
website		website
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
VIBE.md		VIBE.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmartCrawler

✨ Features

🚀 Quick Start

📖 Documentation

Getting Started

Usage

Development

🔧 System Requirements

📋 Quick Reference

Basic Commands

WebDriver Setup

🛠️ Development

📄 License

🔗 Links

🆘 Support

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SmartCrawler

✨ Features

🚀 Quick Start

📖 Documentation

Getting Started

Usage

Development

🔧 System Requirements

📋 Quick Reference

Basic Commands

WebDriver Setup

🛠️ Development

📄 License

🔗 Links

🆘 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages