Setup

Step 1

Clone this repository to your local machine.

Copy the spark installation ( the spark-3.2.3-bin-hadoop3.2 folder) into the clones repository.

Create a virtual environment for the purposes of this training. Python version 3.9.13. Activate the environment.

Install the requirements for the requirements.txt:

pip install -r requirements.txt

Run the following command when inside the project directory to validate everything works:

./spark-3.2.3-bin-hadoop3.2/bin/spark-submit src/test_spark.py

You should see some logging output from spark with the text "SUCCESS!" at the end.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
slides.pptx		slides.pptx