Skip to content

Latest commit

 

History

History
62 lines (31 loc) · 2.12 KB

File metadata and controls

62 lines (31 loc) · 2.12 KB

Week 5: Batch Processing

5.1 Introduction

🎥 Introduction to Batch Processing

🎥 Introduction to Spark

5.2 Installation

Follow these intructions to install Spark:

And follow this to run PySpark in Jupyter

🎥 Installing Spark (Linux)

5.3 Spark SQL and DataFrames

🎥 First Look at Spark/PySpark

🎥 Spark Dataframes

🎥 (Optional) Preparing Yellow and Green Taxi Data

Script to prepare the Dataset download_data.sh

Note: The other way to infer the schema (apart from pandas) for the csv files, is to set the inferSchema option to true while reading the files in Spark.

🎥 SQL with Spark

5.4 Spark Internals

🎥 Anatomy of a Spark Cluster

🎥 GroupBy in Spark

🎥 Joins in Spark

5.5 RDDs

Coming soon

Homework

See here for more details

Community notes

Did you take notes? You can share them here.