This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language.
If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.
A good way of using these notebooks is by first cloning the repo, and then
starting your own IPython notebook/Jupyter in
pySpark mode. For example, if we have a standalone Spark installation
running in our localhost
with a maximum of 6Gb per node assigned to IPython:
MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark
Notice that the path to the pyspark
command will depend on your specific
installation. So as requirement, you need to have
Spark installed in
the same machine you are going to start the IPython notebook
server.
For more Spark options see here. In general it works the rule of passing options
described in the form spark.executor.memory
as SPARK_EXECUTOR_MEMORY
when
calling IPython/pySpark.
We will be using datasets from the KDD Cup 1999. The results of this competition can be found here.
The reference book for these and other Spark related topics is:
- Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.
The following notebooks can be examined individually, although there is a more or less linear 'story' when followed in sequence. By using the same dataset they try to solve a related set of tasks with it.
About reading files and parallelize.
A look at map
, filter
, and collect
.
RDD sampling methods explained.
Brief introduction to some of the RDD pseudo-set operations.
RDD actions reduce
, fold
, and aggregate
.
How to deal with key/value pairs in order to aggregate and explore data.
A notebook introducing Local Vector types, basic statistics in MLlib for Exploratory Data Analysis and model selection.
Labeled points and Logistic Regression classification of network attacks in MLlib. Application of model selection techniques using correlation matrix and Hypothesis Testing.
Use of tree-based methods and how they help explaining models and feature selection.
In this notebook a schema is inferred for our network interactions dataset. Based on that, we use
Spark's SQL DataFrame
abstraction to perform a more structured exploratory data analysis.
Beyond the basics. Close to real-world applications using Spark and other technologies.
Same tech stack this time with an AngularJS client app.
This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit.
There I've added with minor modifications to use a larger dataset and also code about how to store and reload the model for later use. On top of that we build a Flask web service so the recommender can be use to provide movie recommendations on-line.
My try using Spark with this classic dataset and Knowledge Discovery competition.