A web crawler that uses WebDriver to extract and parse HTML content from web pages with intelligent duplicate detection and template pattern recognition.
- 🌐 Multi-URL Crawling: Crawl multiple URLs in a single session
- 🔍 Intelligent Duplicate Detection: Automatically identifies and filters duplicate content patterns across domains
- 📋 Template Pattern Recognition: Detects variable patterns in content (e.g., "42 comments" → "{count} comments")
- 🌳 Structured HTML Tree: Provides filtered HTML tree view with duplicate marking
- ⚡ WebDriver Integration: Uses WebDriver for dynamic content handling
- 📊 Verbose Output: Detailed HTML tree analysis with filtering information
- Install SmartCrawler - Download from releases or build from source
- Set up WebDriver - Install Firefox/Chrome and corresponding WebDriver
- Start crawling - Run SmartCrawler with your target URLs
# Basic usage
smart-crawler --link "https://example.com"
# Multiple URLs with verbose output
smart-crawler --link "https://example.com" --link "https://another.com" --verbose
# Template detection mode
smart-crawler --link "https://example.com" --template --verbose
Choose your operating system for detailed setup instructions:
- Windows Setup - Complete Windows installation guide
- macOS Setup - macOS installation and setup
- Linux Setup - Linux installation for various distributions
- CLI Options - Complete command-line reference and examples
- Development Guide - Setup, building, testing, and contributing instructions
- Operating System: Windows 10+, macOS 10.15+, or Linux
- Browser: Firefox (recommended) or Chrome
- WebDriver: GeckoDriver (Firefox) or ChromeDriver (Chrome)
- Memory: 512MB RAM minimum, 1GB recommended
# Crawl a single URL
smart-crawler --link "https://example.com"
# Crawl with detailed output
smart-crawler --link "https://example.com" --verbose
# Template detection mode
smart-crawler --link "https://example.com" --template --verbose
# Multiple URLs
smart-crawler --link "https://site1.com" --link "https://site2.com"
# Start Firefox WebDriver
geckodriver --port 4444
# Start Chrome WebDriver
chromedriver --port=4444
For developers interested in contributing to SmartCrawler or building from source:
- Development Guide - Complete setup, building, testing, and contributing instructions
This project is licensed under the GPL-3.0 license - see the LICENSE file for details.
If you encounter issues:
- Check the getting started guides for your operating system
- Review the CLI options documentation
- Search existing GitHub issues
- Create a new issue with detailed error information
Note: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' robots.txt files and terms of service.