pyspark

Pyspark and Hadoop repository for learning

This is part of Block 11 of the pre-2023 Data Science Toolbox, where there are detailed discussions about why we structure distributed data processing in this way. It is an optional block (renamed to Block 12) in the Currently active Data Science Toolbox Coursebook.

This content has the following sections, which work through the provided material:

11.2.0 Installation Notes, which explains installation on your personal (Windows or Mac) machine.
11.2.1 on Hadoop, which must be run on BC4, unless you want to go through the bother of installing Hadoop manually (not recomended).
11.2.2 on Pyspark in Jupyter, which is the main component of the learning.
11.2.3 on Pyspark on BC4, which replicates all content from the Jupyter section, but in a format that is appropriate for running on the cluster. You can follow the Jupyter notebook whilst running all code from this section on BC4. However this is not recommended.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Supplement		Supplement
data		data
.gitignore		.gitignore
1.1-map.py		1.1-map.py
1.1-reduce.py		1.1-reduce.py
1.2-map_wc.py		1.2-map_wc.py
1.2-reduce_wc.py		1.2-reduce_wc.py
11.2.0 Install notes.md		11.2.0 Install notes.md
11.2.1 Hadoop On BC4.sh		11.2.1 Hadoop On BC4.sh
11.2.2 Pyspark from Jupyter.ipynb		11.2.2 Pyspark from Jupyter.ipynb
11.2.3 Pyspark on BC4.sh		11.2.3 Pyspark on BC4.sh
2.1-SparkInputOutput.py		2.1-SparkInputOutput.py
2.2-SparkCount.py		2.2-SparkCount.py
2.3-SparkStorageLevel.py		2.3-SparkStorageLevel.py
2.4-ReadAndFilter.py		2.4-ReadAndFilter.py
2.5-MLlibRecommender.py		2.5-MLlibRecommender.py
2.6-MapReduceWordcount.py		2.6-MapReduceWordcount.py
2.7-MLpipeline.py		2.7-MLpipeline.py
LICENSE		LICENSE
README.md		README.md

Provide feedback