Skip to content

Quick and dirty example of using Prefect Core to scrape a website

Notifications You must be signed in to change notification settings

szelenka/prefect-webscraper-example

Repository files navigation

prefect-webscraper-example

This repository is a complete tutorial of how to use Prefect to scrape a website, while also deploying it to Prefect Cloud for scheduled orchestration.

This follows mostly the tutorial documented here, but written to run on Prefect Cloud on a schedule:

Following that example, it's writing data to a local SQLite table, which doesn't make much sense when our images are ephemeral, but it should illustrate the pipeline execution. In practice, when orchestrating through Prefect Cloud, we'd likely want to preserve the data to some database or repository that resides on a different dedicated system.

Installation

You'll need a Python environment with the following packages installed.

It's best practice to setup a unique environment for each project. You can accomplish this through Anaconda or pure Python:

Python Virtual Environment

pip install virtualenv
python -m venv prefect-webscraper-example
source activate prefect-webscraper-example/bin/activate

Conda Virtual Environment

conda create -n prefect-webscraper-example python=3.7
source activate prefect-webscraper-example

Package installation

To install the packages, you'll need to use PIP as not all the packages are on the Conda Channels:

pip install -r requirements.txt

Visualization Note

If you want to visualize the DAG, you'll need graphviz installed. This can be done with one command if you're using conda:

conda install graphviz

If you want to use the pure Python approach, refer to the official documentation here:

Examples

BeautifulSoup

The example on Prefect's site leverages the requests library, along with beautifulsoup4. This pattern works for basic websites that don't involve a lot of JavaScript manipulation of the DOM.

A working example of using BeautifulSoup to parse a website on a schedule in Prefect Cloud is found in:

Selenium

For more modern websites that use a lot of AJAX with JavaScript DOM manipulation, you'll need to simulate execution of the JavaScript, and parse the page as it would load in a traditional browser. For this, there are headless versions of popular web browsers, that allow you to parse it with similar CSS or XPATH syntax.

A working example of using Selenium to parse a website on a schedule in Prefect Cloud is found in:

Selenium Drivers

To leverage Selenium on your local machine, you'll need to download the appropriate driver from their website:

In this example, we're using the chromedriver located in the same directory as this code.

When deploying to Prefect Cloud, the reference code will take hints from the official selenium chrome image as a base, then add the Prefect Flow code for the final image that's orchestrated.

This can be viewed in the Dockerfile file.

Project Layout

TYPE OBJECT DESCRIPTION
📁 docker Non-source code related files used by the Dockerfile during the build process
📄 build_docker_base_image.sh Dockerfiles to build a base image for the selenium chrome driver
📄 Dockerfile Dockerfiles to build a base image for the selenium chrome driver
📄 example-bs4.py Example website scraper Prefect Flow ready for Prefect Cloud using BeautifulSoup
📄 example-selenium.py Example website scraper Prefect Flow ready for Prefect Cloud using Selenium
📄 README.md This file you're reading now
📄 requirements.txt Python packages required for local development of Prefect Flows in this repository

About

Quick and dirty example of using Prefect Core to scrape a website

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published