Skip to content

Latest commit

 

History

History
41 lines (22 loc) · 1.25 KB

README.md

File metadata and controls

41 lines (22 loc) · 1.25 KB

PySpark ML Crashcourse

This repository contains exercises and solutions for a one-day crash course for PySpark and Spark ML. The repository only contains Jupyter Notebooks which assume a working PySpark kernel with Python 3.5 and Spark 2.1.

Author

All notebooks have been create by Kaya Kupferschmidt @ dimajix. In case you have any questions, feel free to contact me at [email protected]

01 - PySpark DataFrame Introduction

This notebook contains some simple snippets to get a basic understanding how to interact with Spark DataFrames in Python.

02 - PySpark Word Count (exercise + solution)

These notebooks contain the classic word count, implemented with DataFrames.

03 - Linear Regression (skeleton + solution)

These notebooks contain a simple linear regression exercise as an introduction to machine learning with Spark.

04 - Text Classification (exercise + solution)

After being exposed to a simple linear regression, these notebooks contain an exercise to perform a simple statistical text classification.

05 - Hyper Parameter Tuning (exercise + solution)

As with many complex algorithms and ML pipelines, the text classification has many hyper parameters. These notebooks show how to perform hyper parameter tuning with PySpark.