Skip to content

Commit

Permalink
[Code Addition Request]: Automate Workflows through Web Scraping (#738)
Browse files Browse the repository at this point in the history
Fixes #736

## Pull Request for PyVerse 💡

### Requesting to submit a pull request to the PyVerse repository.

---

#### Issue Title
*Add Web Scraping Workflow Automation*

- [YES] I have provided the issue title.

---

#### Name 
*Sanchit Chauhan*

- [YES] I have provided my name.

---

#### GitHub ID 
*sanchitc05*

- [YES] I have provided my GitHub ID.

---

#### Email ID
*[email protected]*

- [YES] I have provided my email ID.

---

#### Identify Yourself
**Mention in which program you are contributing (e.g., WoB, GSSOC, SSOC,
SWOC).**
*GSSOC, HACKTOBERFEST*

- [YES] I have mentioned my participant role.

---

#### Closes  
*Closes: #736 *

- [YES] I have provided the issue number.

---

#### Describe the Add-ons or Changes You've Made
*### **Description**  
This PR introduces an automated web scraping workflow to extract data
from static and dynamic web pages. The solution uses `requests` and
`BeautifulSoup` for static pages, and `Selenium` for dynamic content.
The scraped data is logged for easy tracking and error management. This
feature streamlines repetitive data collection tasks and enables
automated scheduling for regular scraping.

### **Technical Implementation**  
- **Libraries Used**:  
  - `requests`: Fetch web pages for static content.  
  - `BeautifulSoup`: Parse and extract relevant data from HTML.  
  - `Selenium`: Automate browser interaction for dynamic content.  
  - **Logging Module**: Tracks activities and errors in `scraper.log`.

- **Project Structure**:  
  - `scraper.py`: Main script containing scraping logic.
  - `requirements.txt`: Dependency list for easy setup.

### **Usage**  
1. Clone the repository and install dependencies:
   ```bash
   git clone https://github.com/yourusername/web_scraper.git
   cd web_scraper
   pip install -r requirements.txt
   ```
2. Update `static_url` and `dynamic_url` variables in `scraper.py`.
3. Run the scraper:
   ```bash
   python scraper.py
   ```
4. Check logs in `scraper.log` for activity status.

### **Benefits**  
- **Automates data collection**, saving time and effort.
- **Handles dynamic content**, making it adaptable to complex websites.
- **Error tracking** ensures smooth, continuous scraping.

### **Testing**  
- Successfully tested scraping both static and dynamic pages.  
- Verified proper logging of activities and error handling.*

- [YES] I have described my changes.

---

#### Type of Change
**Select the type of change:**  
- [YES] Bug fix (non-breaking change which fixes an issue)
- [YES] New feature (non-breaking change which adds functionality)
- [YES] Code style update (formatting, local variables)
- [YES] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [YES] This change requires a documentation update

---

#### How Has This Been Tested?
**Describe how your changes have been tested.**  
*Describe your testing process here.*

- [YES] I have described my testing process.

---

#### Checklist
**Please confirm the following:**  
- [YES] My code follows the guidelines of this project.
- [YES] I have performed a self-review of my code.
- [YES] I have commented on my code, particularly wherever it was hard
to understand.
- [YES] I have made corresponding changes to the documentation.
- [YES] My changes generate no new warnings.
- [YES] I have added things that prove my fix is effective or that my
feature works.
- [NO] Any dependent changes have been merged and published in
downstream modules.
  • Loading branch information
UTSAVS26 authored Oct 22, 2024
2 parents 4e53c5c + e70c694 commit f290e13
Show file tree
Hide file tree
Showing 3 changed files with 132 additions and 0 deletions.
61 changes: 61 additions & 0 deletions web_scraper/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Web Scraper

## Overview
This project automates the workflow of data extraction from both static and dynamic web pages using Python. It leverages libraries like `requests`, `BeautifulSoup`, and `Selenium` to scrape data efficiently.

## Features
- Scrapes static and dynamic web pages.
- Logs data extraction activities.
- Handles errors gracefully.
- Outputs data for easy analysis.

## Tech Stack
- **Languages**: Python
- **Libraries**:
- `requests`: For making HTTP requests.
- `beautifulsoup4`: For parsing HTML.
- `selenium`: For scraping dynamic content.
- **Logging**: Built-in logging module for tracking scraping activities.

## Installation
1. Clone this repository:
```bash
git clone https://github.com/yourusername/web_scraper.git
cd web_scraper
```
2. Install the required libraries:
```bash
pip install -r requirements.txt
```

## Usage
1. Open `scraper.py` and update the `static_url` and `dynamic_url` variables with the target URLs you want to scrape.
2. Run the script:
```bash
python scraper.py
```
3. Check `scraper.log` for the scraping activity logs.

## Benefits
- **Time-Saving**: Automates the tedious process of manual data collection.
- **Efficiency**: Quickly extracts large volumes of data.
- **Error Monitoring**: Keeps track of errors and successes through logging.

## Contributing
Feel free to submit issues or pull requests. For major changes, please open an issue first to discuss what you would like to change.

## License
This project is licensed under the MIT License.
```

### Key Sections Explained:
- **Overview**: Provides a brief description of the project.
- **Features**: Lists the main functionalities.
- **Tech Stack**: Details the technologies used.
- **Installation**: Step-by-step instructions for setting up the project.
- **Usage**: Guides users on how to run the scraper.
- **Benefits**: Highlights the advantages of using this tool.
- **Contributing**: Encourages collaboration and contributions.
- **License**: Mentions the licensing of the project.

You can adjust the GitHub repository link and any other details to fit your project specifics!
3 changes: 3 additions & 0 deletions web_scraper/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
requests
beautifulsoup4
selenium
68 changes: 68 additions & 0 deletions web_scraper/scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time
import logging

# Configure logging
logging.basicConfig(filename='scraper.log', level=logging.INFO)

def scrape_static_page(url):
"""Scrape data from a static web page."""
try:
response = requests.get(url)
response.raise_for_status() # Raise an error for bad responses
soup = BeautifulSoup(response.text, 'html.parser')

# Extract data (example: product prices)
products = []
for product in soup.find_all('div', class_='product'):
name = product.find('h2').text if product.find('h2') else 'N/A'
price = product.find('span', class_='price').text if product.find('span', class_='price') else 'N/A'
products.append((name, price))
logging.info(f"Scraped {name}: {price}")

return products

except Exception as e:
logging.error(f"Error scraping static page: {e}")
return []

def scrape_dynamic_page(url):
"""Scrape data from a dynamic web page using Selenium."""
try:
options = Options()
options.add_argument('--headless') # Run in headless mode
service = Service('/usr/local/bin/chromedriver') # Path for GitHub Codespaces
driver = webdriver.Chrome(service=service, options=options)

driver.get(url)
time.sleep(2) # Wait for the page to load

# Extract data (example: product prices)
products = []
for product in driver.find_elements(By.CLASS_NAME, 'product'):
name = product.find_element(By.TAG_NAME, 'h2').text if product.find_element(By.TAG_NAME, 'h2') else 'N/A'
price = product.find_element(By.CLASS_NAME, 'price').text if product.find_element(By.CLASS_NAME, 'price') else 'N/A'
products.append((name, price))
logging.info(f"Scraped {name}: {price}")

driver.quit()
return products

except Exception as e:
logging.error(f"Error scraping dynamic page: {e}")
return []

if __name__ == "__main__":
static_url = 'https://example.com/static'
dynamic_url = 'https://example.com/dynamic'

static_data = scrape_static_page(static_url)
dynamic_data = scrape_dynamic_page(dynamic_url)

print("Static Data:", static_data)
print("Dynamic Data:", dynamic_data)

0 comments on commit f290e13

Please sign in to comment.