This project is an educational tool to demonstrate how to use Python libraries like Selenium, MongoDB, and Schedule for web scraping. The scraper collects publicly available job data from a specified website and stores it in MongoDB.
- Uses Selenium for browser automation and web interaction.
- Includes proxy support for enhanced anonymity.
- Saves scraped data in MongoDB.
- Allows scheduling of scraping tasks using the schedule library.
- Checks the
robots.txt
file of the target website for scraping permissions. - Logs all activities and errors in
webscraper.log
.
- Respect
robots.txt
: This scraper checks therobots.txt
file of the target website before scraping. If scraping is disallowed, the scraper will log a warning and terminate. - Terms of Service: Always adhere to the terms of service of the target website. Unauthorized scraping may violate these terms.
- No Personal Data: This scraper does not target or process personal data.
- Commercial Use: This project is for educational purposes only. Do not use it for commercial purposes without explicit consent from the website owner.
robots.txt
is a file used by websites to communicate with web crawlers. It specifies which parts of the website are allowed or disallowed for automated access. Learn more at robotstxt.org.
- Python: Version 3.8 or higher
- Browser: Google Chrome
- Database: MongoDB (local or remote instance)
All required Python libraries are listed in requirements.txt
. Install them using:
pip install -r requirements.txt