Web Crawler Project

This project contains a Python script for a simple web crawler that extracts links from a given website and recursively visits each link. It is built using Python's requests and BeautifulSoup libraries.

Features

Fetches HTML content from a specified URL.
Parses the HTML to extract all links.
Recursively visits each link to continue the crawl.
Tracks visited URLs to prevent re-crawling the same page.
Uses a queue to manage URLs to be crawled.

Workflow Diagram

graph TD;
    A[Start Crawler] --> B{Retrieve HTML}
    B -- Success --> C[Parse HTML for Links]
    C --> D{Any New Links?}
    D -- Yes --> E[Add to Queue]
    D -- No --> F[Continue]
    E --> F
    F -- Queue Empty? --> G[Stop]
    F -- Queue Not Empty --> B
    G --> H[End Crawler]

Installation

To run this web crawler, you need Python installed on your system along with the following Python packages:

pip install requests beautifulsoup4

Usage

Modify the initial_url in the script to point to the website you want to start the crawl from. To run the script, simply execute:

python wcrawler.py

The script will print out the list of URLs visited during the crawl.

Limitations

The crawler does not handle JavaScript.
It only fetches and processes 'a' tags for URLs.
Performance may degrade with a large number of URLs due to the single-threaded nature of the script.

For contributions or issues, please open a pull request or an issue in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
LICENSE		LICENSE
README.md		README.md
wcrawler.py		wcrawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler Project

Features

Workflow Diagram

Installation

Usage

Limitations

About

Releases

Packages

Languages

License

dcl4k/web_scraper.py

Folders and files

Latest commit

History

Repository files navigation

Web Crawler Project

Features

Workflow Diagram

Installation

Usage

Limitations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages