Skip to content

This Python-based Web Scraper uses Selenium for browser automation, . It supports proxy use for anonymity and checks robots.txt for compliance. Designed for educational purposes, it demonstrates web scraping concepts and data handling but is not intended for commercial or large-scale scraping projects.

License

Notifications You must be signed in to change notification settings

Benjamin2099/EducationalWebScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Educational Web Scraper

This project is an educational tool to demonstrate how to use Python libraries like Selenium, MongoDB, and Schedule for web scraping. The scraper collects publicly available job data from a specified website and stores it in MongoDB.


Key Features

  • Uses Selenium for browser automation and web interaction.
  • Includes proxy support for enhanced anonymity.
  • Saves scraped data in MongoDB.
  • Allows scheduling of scraping tasks using the schedule library.
  • Checks the robots.txt file of the target website for scraping permissions.
  • Logs all activities and errors in webscraper.log.

Important Notes

Data Privacy and Legal Compliance

  1. Respect robots.txt: This scraper checks the robots.txt file of the target website before scraping. If scraping is disallowed, the scraper will log a warning and terminate.
  2. Terms of Service: Always adhere to the terms of service of the target website. Unauthorized scraping may violate these terms.
  3. No Personal Data: This scraper does not target or process personal data.
  4. Commercial Use: This project is for educational purposes only. Do not use it for commercial purposes without explicit consent from the website owner.

What is robots.txt?

robots.txt is a file used by websites to communicate with web crawlers. It specifies which parts of the website are allowed or disallowed for automated access. Learn more at robotstxt.org.


Requirements

  • Python: Version 3.8 or higher
  • Browser: Google Chrome
  • Database: MongoDB (local or remote instance)

Dependencies

All required Python libraries are listed in requirements.txt. Install them using:

pip install -r requirements.txt

About

This Python-based Web Scraper uses Selenium for browser automation, . It supports proxy use for anonymity and checks robots.txt for compliance. Designed for educational purposes, it demonstrates web scraping concepts and data handling but is not intended for commercial or large-scale scraping projects.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages