[Code Addition Request]: Automate Workflows through Web Scraping (#738)

Fixes #736 ## Pull Request for PyVerse 💡 ### Requesting to submit a pull request to the PyVerse repository. --- #### Issue Title *Add Web Scraping Workflow Automation* - [YES] I have provided the issue title. --- #### Name *Sanchit Chauhan* - [YES] I have provided my name. --- #### GitHub ID *sanchitc05* - [YES] I have provided my GitHub ID. --- #### Email ID *[email protected]* - [YES] I have provided my email ID. --- #### Identify Yourself **Mention in which program you are contributing (e.g., WoB, GSSOC, SSOC, SWOC).** *GSSOC, HACKTOBERFEST* - [YES] I have mentioned my participant role. --- #### Closes *Closes: #736 * - [YES] I have provided the issue number. --- #### Describe the Add-ons or Changes You've Made *### **Description** This PR introduces an automated web scraping workflow to extract data from static and dynamic web pages. The solution uses `requests` and `BeautifulSoup` for static pages, and `Selenium` for dynamic content. The scraped data is logged for easy tracking and error management. This feature streamlines repetitive data collection tasks and enables automated scheduling for regular scraping. ### **Technical Implementation** - **Libraries Used**: - `requests`: Fetch web pages for static content. - `BeautifulSoup`: Parse and extract relevant data from HTML. - `Selenium`: Automate browser interaction for dynamic content. - **Logging Module**: Tracks activities and errors in `scraper.log`. - **Project Structure**: - `scraper.py`: Main script containing scraping logic. - `requirements.txt`: Dependency list for easy setup. ### **Usage** 1. Clone the repository and install dependencies: ```bash git clone https://github.com/yourusername/web_scraper.git cd web_scraper pip install -r requirements.txt ``` 2. Update `static_url` and `dynamic_url` variables in `scraper.py`. 3. Run the scraper: ```bash python scraper.py ``` 4. Check logs in `scraper.log` for activity status. ### **Benefits** - **Automates data collection**, saving time and effort. - **Handles dynamic content**, making it adaptable to complex websites. - **Error tracking** ensures smooth, continuous scraping. ### **Testing** - Successfully tested scraping both static and dynamic pages. - Verified proper logging of activities and error handling.* - [YES] I have described my changes. --- #### Type of Change **Select the type of change:** - [YES] Bug fix (non-breaking change which fixes an issue) - [YES] New feature (non-breaking change which adds functionality) - [YES] Code style update (formatting, local variables) - [YES] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [YES] This change requires a documentation update --- #### How Has This Been Tested? **Describe how your changes have been tested.** *Describe your testing process here.* - [YES] I have described my testing process. --- #### Checklist **Please confirm the following:** - [YES] My code follows the guidelines of this project. - [YES] I have performed a self-review of my code. - [YES] I have commented on my code, particularly wherever it was hard to understand. - [YES] I have made corresponding changes to the documentation. - [YES] My changes generate no new warnings. - [YES] I have added things that prove my fix is effective or that my feature works. - [NO] Any dependent changes have been merged and published in downstream modules.
UTSAVS26 · Oct 22, 2024 · f290e13 · f290e13
2 parents 4e53c5c + e70c694
commit f290e13
Show file tree

Hide file tree

Showing 3 changed files with 132 additions and 0 deletions.
diff --git a/web_scraper/readme.md b/web_scraper/readme.md
@@ -0,0 +1,61 @@
+# Web Scraper
+
+## Overview
+This project automates the workflow of data extraction from both static and dynamic web pages using Python. It leverages libraries like `requests`, `BeautifulSoup`, and `Selenium` to scrape data efficiently.
+
+## Features
+- Scrapes static and dynamic web pages.
+- Logs data extraction activities.
+- Handles errors gracefully.
+- Outputs data for easy analysis.
+
+## Tech Stack
+- **Languages**: Python
+- **Libraries**: 
+  - `requests`: For making HTTP requests.
+  - `beautifulsoup4`: For parsing HTML.
+  - `selenium`: For scraping dynamic content.
+- **Logging**: Built-in logging module for tracking scraping activities.
+
+## Installation
+1. Clone this repository:
+   ```bash
+   git clone https://github.com/yourusername/web_scraper.git
+   cd web_scraper
+   ```
+2. Install the required libraries:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+## Usage
+1. Open `scraper.py` and update the `static_url` and `dynamic_url` variables with the target URLs you want to scrape.
+2. Run the script:
+   ```bash
+   python scraper.py
+   ```
+3. Check `scraper.log` for the scraping activity logs.
+
+## Benefits
+- **Time-Saving**: Automates the tedious process of manual data collection.
+- **Efficiency**: Quickly extracts large volumes of data.
+- **Error Monitoring**: Keeps track of errors and successes through logging.
+
+## Contributing
+Feel free to submit issues or pull requests. For major changes, please open an issue first to discuss what you would like to change.
+
+## License
+This project is licensed under the MIT License.
+```
+
+### Key Sections Explained:
+- **Overview**: Provides a brief description of the project.
+- **Features**: Lists the main functionalities.
+- **Tech Stack**: Details the technologies used.
+- **Installation**: Step-by-step instructions for setting up the project.
+- **Usage**: Guides users on how to run the scraper.
+- **Benefits**: Highlights the advantages of using this tool.
+- **Contributing**: Encourages collaboration and contributions.
+- **License**: Mentions the licensing of the project.
+
+You can adjust the GitHub repository link and any other details to fit your project specifics!
diff --git a/web_scraper/requirements.txt b/web_scraper/requirements.txt
@@ -0,0 +1,3 @@
+requests
+beautifulsoup4
+selenium
diff --git a/web_scraper/scraper.py b/web_scraper/scraper.py
@@ -0,0 +1,68 @@
+import requests
+from bs4 import BeautifulSoup
+from selenium import webdriver
+from selenium.webdriver.chrome.service import Service
+from selenium.webdriver.common.by import By
+from selenium.webdriver.chrome.options import Options
+import time
+import logging
+
+# Configure logging
+logging.basicConfig(filename='scraper.log', level=logging.INFO)
+
+def scrape_static_page(url):
+    """Scrape data from a static web page."""
+    try:
+        response = requests.get(url)
+        response.raise_for_status()  # Raise an error for bad responses
+        soup = BeautifulSoup(response.text, 'html.parser')
+
+        # Extract data (example: product prices)
+        products = []
+        for product in soup.find_all('div', class_='product'):
+            name = product.find('h2').text if product.find('h2') else 'N/A'
+            price = product.find('span', class_='price').text if product.find('span', class_='price') else 'N/A'
+            products.append((name, price))
+            logging.info(f"Scraped {name}: {price}")
+
+        return products
+
+    except Exception as e:
+        logging.error(f"Error scraping static page: {e}")
+        return []
+
+def scrape_dynamic_page(url):
+    """Scrape data from a dynamic web page using Selenium."""
+    try:
+        options = Options()
+        options.add_argument('--headless')  # Run in headless mode
+        service = Service('/usr/local/bin/chromedriver')  # Path for GitHub Codespaces
+        driver = webdriver.Chrome(service=service, options=options)
+
+        driver.get(url)
+        time.sleep(2)  # Wait for the page to load
+
+        # Extract data (example: product prices)
+        products = []
+        for product in driver.find_elements(By.CLASS_NAME, 'product'):
+            name = product.find_element(By.TAG_NAME, 'h2').text if product.find_element(By.TAG_NAME, 'h2') else 'N/A'
+            price = product.find_element(By.CLASS_NAME, 'price').text if product.find_element(By.CLASS_NAME, 'price') else 'N/A'
+            products.append((name, price))
+            logging.info(f"Scraped {name}: {price}")
+
+        driver.quit()
+        return products
+
+    except Exception as e:
+        logging.error(f"Error scraping dynamic page: {e}")
+        return []
+
+if __name__ == "__main__":
+    static_url = 'https://example.com/static'
+    dynamic_url = 'https://example.com/dynamic'
+
+    static_data = scrape_static_page(static_url)
+    dynamic_data = scrape_dynamic_page(dynamic_url)
+
+    print("Static Data:", static_data)
+    print("Dynamic Data:", dynamic_data)