Skip to content

A smart web crawler built in Rust that uses Claude AI to select the most relevant URLs from website sitemaps based on crawling objectives.

License

Notifications You must be signed in to change notification settings

pixlie/SmartCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SmartCrawler

A web crawler that uses WebDriver to extract and parse HTML content from web pages with intelligent duplicate detection and template pattern recognition.

✨ Features

  • 🌐 Multi-URL Crawling: Crawl multiple URLs in a single session
  • 🔍 Intelligent Duplicate Detection: Automatically identifies and filters duplicate content patterns across domains
  • 📋 Template Pattern Recognition: Detects variable patterns in content (e.g., "42 comments" → "{count} comments")
  • 🌳 Structured HTML Tree: Provides filtered HTML tree view with duplicate marking
  • ⚡ WebDriver Integration: Uses WebDriver for dynamic content handling
  • 📊 Verbose Output: Detailed HTML tree analysis with filtering information

🚀 Quick Start

  1. Install SmartCrawler - Download from releases or build from source
  2. Set up WebDriver - Install Firefox/Chrome and corresponding WebDriver
  3. Start crawling - Run SmartCrawler with your target URLs
# Basic usage
smart-crawler --link "https://example.com"

# Multiple URLs with verbose output
smart-crawler --link "https://example.com" --link "https://another.com" --verbose

# Template detection mode
smart-crawler --link "https://example.com" --template --verbose

📖 Documentation

Getting Started

Choose your operating system for detailed setup instructions:

Usage

  • CLI Options - Complete command-line reference and examples

Development

🔧 System Requirements

  • Operating System: Windows 10+, macOS 10.15+, or Linux
  • Browser: Firefox (recommended) or Chrome
  • WebDriver: GeckoDriver (Firefox) or ChromeDriver (Chrome)
  • Memory: 512MB RAM minimum, 1GB recommended

📋 Quick Reference

Basic Commands

# Crawl a single URL
smart-crawler --link "https://example.com"

# Crawl with detailed output
smart-crawler --link "https://example.com" --verbose

# Template detection mode
smart-crawler --link "https://example.com" --template --verbose

# Multiple URLs
smart-crawler --link "https://site1.com" --link "https://site2.com"

WebDriver Setup

# Start Firefox WebDriver
geckodriver --port 4444

# Start Chrome WebDriver
chromedriver --port=4444

🛠️ Development

For developers interested in contributing to SmartCrawler or building from source:

  • Development Guide - Complete setup, building, testing, and contributing instructions

📄 License

This project is licensed under the GPL-3.0 license - see the LICENSE file for details.

🔗 Links

🆘 Support

If you encounter issues:

  1. Check the getting started guides for your operating system
  2. Review the CLI options documentation
  3. Search existing GitHub issues
  4. Create a new issue with detailed error information

Note: SmartCrawler is designed for ethical web scraping and research purposes. Always respect websites' robots.txt files and terms of service.

About

A smart web crawler built in Rust that uses Claude AI to select the most relevant URLs from website sitemaps based on crawling objectives.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •