Skip to content

A Python-based tool adept at navigating and extracting content from JavaScript-heavy websites using Selenium and BeautifulSoup, ensuring complete data capture

Notifications You must be signed in to change notification settings

braisdev/dynamic-full-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler with Selenium and BeautifulSoup

This Python script utilizes Selenium and BeautifulSoup to crawl a website, save the HTML content of each page, and collect all page links for further crawling.

Key Features

  • Dynamic Content Handling: Seamlessly interacts with web pages that require JavaScript to load, ensuring comprehensive crawling of modern web applications.
  • Automated browser interactions using Selenium.
  • Extraction and preservation of HTML content.
  • Collection of hyperlinks from web pages for recursive crawling.
  • Utilization of BeautifulSoup for advanced HTML parsing.
  • Smart crawl management to avoid redundancy.

Prerequisites

Make sure you have the following installed on your system:

  • Python 3.x
  • Pipenv (Install it using pip install pipenv if not already installed)

Installation

  1. Clone the repository: git clone
  2. Navigate to the cloned project's directory: cd
  3. Use Pipenv to install the dependencies and create a virtual environment:
pipenv install
  1. Activate the Pipenv shell:
pipenv shell

Usage

Run the crawler using the following command within the Pipenv shell:

python crawler.py

About

A Python-based tool adept at navigating and extracting content from JavaScript-heavy websites using Selenium and BeautifulSoup, ensuring complete data capture

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published