This project involves web scraping Nike's product pages to extract product names, prices, and links. The project showcases three different implementations of the web crawler using Selenium and BeautifulSoup. It also includes visualisation of the scraped data using Matplotlib and Seaborn.
- Project Overview
- Notable Changes Between the Versions
- Detailed Explanation of
Nike_Web_Crawler_bs4.py
- Detailed Explanation of
Nike_Web_Crawler_sel.py
- How to Run the Project Locally
- Requirements
- References
The project comprises three different versions of a Nike web crawler, each designed to scrape product details from Nike's website:
Nike_Web_Crawler_Original.py
: The first version uses Selenium to scrape specific Nike product pages, developed from between September 2023 to November 2023Nike_Web_Crawler_bs4.py
: This version is rewritten using BeautifulSoup for more efficient page scraping and parsing.Nike_Web_Crawler_sel.py
: A revamped Selenium-based scraper that crawls search result pages on Nike, scrolling dynamically and fetching multiple products.
- The first and third versions use Selenium, while the second version (
Nike_Web_Crawler_bs4.py
) uses BeautifulSoup. - BeautifulSoup is faster when dealing with static HTML content, but Selenium is required for interacting with dynamic web elements like scrolling and waiting for JavaScript-loaded content.
- In the original script, specific product pages were targeted, meaning that the product URLs were hardcoded into the script:
However, both the newer scripts scrape search result pages based on user input:
websites = [ 'https://www.nike.com/gb/t/dunk-low-retro-shoe-Kd1wZr/DD1391-103', 'https://www.nike.com/gb/t/dunk-low-retro-shoe-QgD9Gv/DD1391-100', ... ]
product_name = input("Enter the product name to search on Nike: ")
- The Selenium version (
Nike_Web_Crawler_sel.py
) can be run in headless mode (without a graphical browser) for faster execution and ease of automation:options.add_argument("--headless") # Run in headless mode (no GUI)
- In
Nike_Web_Crawler_sel.py
, the scraper scrolls down the page dynamically to load more products, simulating user behavior:driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.END) time.sleep(2) # Wait for more products to load
- Both newer scripts include a data visualization feature using Matplotlib and Seaborn:
sns.barplot(x='Name', y='Price_Clean', data=df)
One notable aspect that requires improvement and highlights a significant difference between the two Python scripts is the amount of data available for visualisation. The BeautifulSoup-based script provides a richer dataset compared to the Selenium-driven version. This discrepancy arises because BeautifulSoup retrieves data from static HTML, which is typically more straightforward and complete. In contrast, Selenium interacts with dynamically rendered JavaScript content, which can pose challenges in data extraction.
Using Selenium to scrape data can lead to less comprehensive datasets for visualisations. This is because JavaScript-driven pages often load content dynamically, meaning that certain data elements may not be immediately available in the HTML source code. As a result, Selenium may require additional steps to interact with the page, such as waiting for elements to load or executing JavaScript, which can complicate the data extraction process.
Moreover, dynamically rendered content can sometimes lead to inconsistencies or missing data, impacting the quality and completeness of the visualisations. Therefore, while Selenium is powerful for handling complex, interactive web pages, it may require more sophisticated handling and additional processing to ensure that the data collected is suitable for creating accurate and insightful visual representations.
The Nike_Web_Crawler_bs4.py
script uses the BeautifulSoup library to scrape search result pages on Nike's website.
-
Generating the Search URL:
def generate_nike_url(product_name, page_number=1): return f'https://www.nike.com/gb/w?q={product_name.replace(" ", "+")}&page={page_number}'
This generates a URL based on the product search query entered by the user.
-
Parsing Product Data:
results = soup.find_all('div', {'class': 'product-card__body'}) for item in results: title = item.find('div', {'class': 'product-card__title'}).text.strip() price = item.find('div', {'class': 'product-price'}).text.strip()
This code searches the HTML content for product details like name, price, and links.
-
Saving Data: The scraped data is saved into a CSV file using pandas:
productsdf.to_csv("Price_BS4.csv", index=False)
-
Data Visualisation: After saving the data, it is visualised using Seaborn and Matplotlib:
sns.barplot(x='Name', y='Price_Clean', data=df)
The Nike_Web_Crawler_sel.py
script utilises Selenium to handle JavaScript-loaded pages and scroll to load all products on the search result pages.
-
Setting Up Selenium WebDriver: The script sets up Selenium WebDriver with headless mode enabled for fast, non-GUI execution:
options = Options() options.add_argument("--headless") driver = webdriver.Chrome(service=service, options=options)
-
Dynamic Scrolling: The script scrolls to the bottom of the page to load more products dynamically:
while True: driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.END) time.sleep(2) # Wait for more products to load
-
Parsing Product Data: The product details are parsed using Selenium's
find_elements
method:items = driver.find_elements(By.CSS_SELECTOR, 'div.product-card__body') for item in items: title = item.find_element(By.CSS_SELECTOR, 'div.product-card__title').text.strip()
-
Saving and Visualizing Data: Like the BeautifulSoup version, the scraped data is saved in a CSV and visualized using Seaborn:
df = pd.read_csv("Price_Sel_Drive.csv") sns.barplot(x='Name', y='Price_Clean', data=df)
git clone https://github.com/gappeah/nike-web-crawler.git
cd nike-web-crawler
It's recommended to set up a virtual environment to manage dependencies.
python -m venv venv
source venv/bin/activate # On Windows, use venv\Scripts\activate
- Download ChromeDriver from here and place it in a suitable folder.
- if you encountered the issue,
NoSuchDriverException
, indicates that Selenium could not locate or use the ChromeDriver, which is required to control Chrome for web scraping. Here are a few steps to resolve the issue:
-
Ensure ChromeDriver is installed and accessible:
- Download the appropriate version of ChromeDriver for your version of Google Chrome from the ChromeDriver official website.
- Make sure the version of ChromeDriver matches your Chrome browser version. You can check your Chrome version by navigating to
chrome://settings/help
in Chrome.
-
Set the correct path to ChromeDriver:
- In your script, replace
"path_to_chromedriver"
in this line:
service = Service("path_to_chromedriver") # Replace with your ChromeDriver path
with the actual path to your
chromedriver.exe
file. For example:service = Service("C:/path/to/chromedriver.exe")
In this case in my version it will be
service = Service("C:/chromedriver.exe")
- In your script, replace
-
Add ChromeDriver to your system PATH:
- Ensure that the directory containing
chromedriver.exe
is added to your system PATH so that it can be found globally. You can follow these steps:- Right-click on
This PC
orMy Computer
, selectProperties
, then go toAdvanced system settings
. - Click on
Environment Variables
, find thePath
variable underSystem variables
, and clickEdit
. - Add the directory where
chromedriver.exe
is located.
- Right-click on
- Ensure that the directory containing
-
Check if the ChromeDriver version matches Chrome:
- Chrome updates frequently, and your ChromeDriver must match the installed version of Chrome. If you recently updated Chrome, make sure you update ChromeDriver to the compatible version.
-
Verify that headless mode is working:
- If you're using headless mode (
options.add_argument("--headless")
), try running it in normal mode by commenting out the headless argument. This will allow you to see if the browser is being correctly opened and controlled:# options.add_argument("--headless")
- If you're using headless mode (
You can run any of the scripts by using:
python Nike_Web_Crawler_bs4.py # For BeautifulSoup version
python Nike_Web_Crawler_sel.py # For Selenium version
Follow the prompts for the product name and the number of pages to scrape.
- The scraped data will be saved in a CSV file (
Price.csv
orPrice_Sel_Drive.csv
). - The visualizations will pop up automatically after the scraping is completed.
- Python 3.x
- ChromeDriver (for Selenium versions)
- Python libraries:
Selenium
,BeautifulSoup
,pandas
,matplotlib
,seaborn
pip install selenium beautifulsoup4 pandas matplotlib seaborn
Here’s a reference section you can add to your README.md
file:
-
Selenium Documentation:
Official documentation for Selenium WebDriver, covering usage, troubleshooting, and best practices.
Selenium Documentation -
ChromeDriver Downloads:
Link to download the latest compatible version of ChromeDriver for your Chrome browser version.
ChromeDriver Downloads -
BeautifulSoup Documentation:
Detailed guide on using BeautifulSoup for parsing HTML and XML documents in Python.
BeautifulSoup Documentation -
Pandas Documentation:
Guide to using pandas for data manipulation and analysis, including CSV file handling.
Pandas Documentation -
Matplotlib Documentation:
Comprehensive documentation on using Matplotlib for creating static, animated, and interactive visualizations in Python.
Matplotlib Documentation -
Seaborn Documentation:
Official documentation for Seaborn, a Python data visualization library built on top of Matplotlib.
Seaborn Documentation -
Headless Chrome:
Information on running Chrome in headless mode, which allows it to run in the background without a GUI.
Headless Chrome -
Web Scraping Best Practices:
Best practices and ethical considerations when scraping websites, including avoiding overloading servers and respecting terms of service.
Web Scraping Best Practices