Skip to content

dsbristol/pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyspark

Pyspark and Hadoop repository for learning

This is part of Block 11 of the pre-2023 Data Science Toolbox, where there are detailed discussions about why we structure distributed data processing in this way. It is an optional block (renamed to Block 12) in the Currently active Data Science Toolbox Coursebook.

This content has the following sections, which work through the provided material:

  • 11.2.0 Installation Notes, which explains installation on your personal (Windows or Mac) machine.
  • 11.2.1 on Hadoop, which must be run on BC4, unless you want to go through the bother of installing Hadoop manually (not recomended).
  • 11.2.2 on Pyspark in Jupyter, which is the main component of the learning.
  • 11.2.3 on Pyspark on BC4, which replicates all content from the Jupyter section, but in a format that is appropriate for running on the cluster. You can follow the Jupyter notebook whilst running all code from this section on BC4. However this is not recommended.

About

Pyspark and Hadoop repository for learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published