Running PySpark notebooks on your laptop

If you'd like to run Spark on your own laptop there are three options:

Download and install Spark natively
Use a VM (Oracle VirtualBox)
Use a Docker container

On Linux-based systems (including MacOS) installation is straightforward. On Windows, especially if you do not have Java and Python, it is easier to use the provided VM or Docker image.

Install Spark natively

Prerequisites

Java 8+
Python 3
Recommended: Anaconda

Installation

Get the latest release from the Apache Spark downloads page and unpack it.

That's it - check that it works:

cd spark-2.4.4-bin-hadoop2.7/
bin/pyspark

You should see output like this:

$ bin/pyspark 
Python 3.7.4 (default, Aug 13 2019, 15:17:50) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
19/11/01 09:11:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Python version 3.7.4 (default, Aug 13 2019 15:17:50)
SparkSession available as 'spark'.
>>>

Examples and lab exercises

Clone the git repository on your laptop:

git clone https://github.com/EPCCed/prace-spark-for-data-scientists.git

We're setting up an environment variable for the Spark installation directory (replace [INSTALLATION_PATH] below with the path of your installation):
```
export SPARK_HOME=[INSTALLATION_PATH]/spark-2.4.4-bin-hadoop2.7/
```
You can also add PySpark (and other Spark binaries) to your path if you prefer:
```
export PATH=$PATH:$SPARK_HOME/bin
```

Configure the environment for PySpark to use Jupyter notebooks:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

Start PySpark with a Jupyter notebook server:

cd prace-spark-for-data-scientists
$SPARK_HOME/bin/pyspark

Oracle VirtualBox VM

Prerequisites

Oracle VirtualBox (https://www.virtualbox.org)

Installation

Download the VirtualBox VM image from here.
Start VirtualBox and click File -> Import Appliance ... to import the image
Start the VM
Password for the SparkUser is "sparkuser"

Examples and lab exercises

A copy of the git repository is available in the VM.

Start a PySpark session:

cd prace-spark-for-data-scientists
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook pyspark

This opens a PySpark Jupyter session in a Firefox browser.

Docker image

Prerequisites

Docker

Installation

Clone the git repository on your laptop:

git clone https://github.com/EPCCed/prace-spark-for-data-scientists.git

Change into the docker directory within the cloned repository and follow the instructions to build and run the Dockerfile.
```
cd prace-spark-for-data-scientists/docker
docker build -t prace_spark_course .
./start_docker.sh
```

Examples and lab exercises

A copy of the git repository is available in the Docker container.

The command above starts a bash terminal session in the Docker container. You can now start Jupyter:

cd prace-spark-for-data-scientists/docker
./pyspark_jupyter.sh

and access it from a browser (on the host machine i.e. your laptop) here: http://localhost:8000/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get_Started_local.md

Get_Started_local.md

Running PySpark notebooks on your laptop

Install Spark natively

Prerequisites

Installation

Examples and lab exercises

Oracle VirtualBox VM

Prerequisites

Installation

Examples and lab exercises

Docker image

Prerequisites

Installation

Examples and lab exercises

Files

Get_Started_local.md

Latest commit

History

Get_Started_local.md

File metadata and controls

Running PySpark notebooks on your laptop

Install Spark natively

Prerequisites

Installation

Examples and lab exercises

Oracle VirtualBox VM

Prerequisites

Installation

Examples and lab exercises

Docker image

Prerequisites

Installation

Examples and lab exercises