Skip to content

pixlie/SmartCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

153 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SmartCrawler

A web crawler that uses WebDriver to extract and parse HTML content from web pages with intelligent duplicate detection and template pattern recognition.

✨ Features

  • 🌐 Multi-URL Crawling: Crawl multiple URLs in a single session
  • πŸ” Intelligent Duplicate Detection: Automatically identifies and filters duplicate content patterns across domains
  • πŸ“‹ Template Pattern Recognition: Detects variable patterns in content (e.g., "42 comments" β†’ "{count} comments")
  • 🌳 Structured HTML Tree: Provides filtered HTML tree view with duplicate marking
  • ⚑ WebDriver Integration: Uses WebDriver for dynamic content handling
  • πŸ“Š Verbose Output: Detailed HTML tree analysis with filtering information

πŸš€ Quick Start

  1. Install SmartCrawler - Download from releases or build from source
  2. Set up WebDriver - Install Firefox/Chrome and corresponding WebDriver
  3. Start crawling - Run SmartCrawler with your target URLs
# Basic usage
smart-crawler --link "https://example.com"

# Multiple URLs with verbose output
smart-crawler --link "https://example.com" --link "https://another.com" --verbose

# Template detection mode
smart-crawler --link "https://example.com" --template --verbose

πŸ“– Documentation

Getting Started

Choose your operating system for detailed setup instructions:

Usage

  • CLI Options - Complete command-line reference and examples

Development

πŸ”§ System Requirements

  • Operating System: Windows 10+, macOS 10.15+, or Linux
  • Browser: Firefox (recommended) or Chrome
  • WebDriver: GeckoDriver (Firefox) or ChromeDriver (Chrome)
  • Memory: 512MB RAM minimum, 1GB recommended

πŸ“‹ Quick Reference

Basic Commands

# Crawl a single URL
smart-crawler --link "https://example.com"

# Crawl with detailed output
smart-crawler --link "https://example.com" --verbose

# Template detection mode
smart-crawler --link "https://example.com" --template --verbose

# Multiple URLs
smart-crawler --link "https://site1.com" --link "https://site2.com"

WebDriver Setup

# Start Firefox WebDriver
geckodriver --port 4444

# Start Chrome WebDriver
chromedriver --port=4444

πŸ› οΈ Development

For developers interested in contributing to SmartCrawler or building from source:

  • Development Guide - Complete setup, building, testing, and contributing instructions

πŸ“„ License

This project is licensed under the GPL-3.0 license - see the LICENSE file for details.

πŸ”— Links

πŸ†˜ Support

If you encounter issues:

  1. Check the getting started guides for your operating system
  2. Review the CLI options documentation
  3. Search existing GitHub issues
  4. Create a new issue with detailed error information

Note: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' robots.txt files and terms of service.

About

A smart web crawler built in Rust that uses Claude AI to select the most relevant URLs from website sitemaps based on crawling objectives.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors