Skip to content

Commit

Permalink
feat(crawl_terraform) add readme
Browse files Browse the repository at this point in the history
  • Loading branch information
rfa447eh committed Nov 22, 2024
1 parent e3db9fd commit 2934a13
Showing 1 changed file with 39 additions and 0 deletions.
39 changes: 39 additions & 0 deletions crawl/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Documentation for Web Content Scraper

## Overview
This script is designed to scrape data from a list of URLs provided in a CSV file. It fetches the content, extracts specific product information, and logs the operations performed. Optionally, the extracted content can also be saved. The script utilizes various libraries such as `requests`, `BeautifulSoup`, and `argparse` to ensure efficient and robust operation.

## Prerequisites
Make sure the following Python packages are installed:
- `requests`
- `beautifulsoup4`
- `urllib3`

To install the dependencies, run the following command:
```sh
pip install requests beautifulsoup4
```
## How to Use
Arguments
The script accepts command-line arguments that allow customization of behavior:
--csv_path: The path to the CSV file containing URLs to scrape. The default value is ./urls.csv.
--save_result: A boolean flag indicating whether to save the scraped results. The default value is False.
## Running the Script
You can run the script by using the following command:

```sh
Copy code
python main.py --csv_path <path_to_csv> --save_result <True/False>
```
For example:
```sh
Copy code
python main.py --csv_path ./urls.csv --save_result True
```
## CSV File Format
The CSV file should contain a list of URLs, with each URL on a new line. Here is an example:
```
https://example.com/page1
https://example.com/page2
```

0 comments on commit 2934a13

Please sign in to comment.