This guide provides step-by-step instructions to set up Apache Spark on AWS EC2 with one master node and one worker node.
Update the system and install Java:
sudo yum update -y
sudo yum install -y java-11-amazon-corretto
Download Spark:
Extract the downloaded file:
tar xvf spark-3.5.2-bin-hadoop3.tgz
Move Spark to the /opt directory:
sudo mv spark-3.5.2-bin-hadoop3 /opt/spark
Set the necessary environment variables:
echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=\$PATH:\$SPARK_HOME/bin" >> ~/.bashrc
echo "export JAVA_HOME=/usr/lib/jvm/java-11-amazon-corretto.x86_64" >> ~/.bashrc
source ~/.bashrc
Navigate to Spark configuration directory:
cd $SPARK_HOME/conf
Copy the template for the Spark environment file:
Edit the file:
Add the following line:
export SPARK_MASTER_HOST='<Master-Node-Private-IP>'
Start the Spark master node:
Install the AWS CLI to manage datasets:
sudo yum install awscli -y
Create a directory to store your data:
mkdir -p /home/ec2-user/data
Download the dataset from Amazon S3:
aws s3 cp s3:put_your_dataset_path_in_s3
for example: aws s3 cp s3://amazon-bucket23/amazon_reviews.csv /home/ec2-user/data
Start the interactive Spark shell:
Ensure that you replace with the private IP address of your master EC2 instance.
Use the Spark Web UI to monitor your cluster.
To complete the Spark cluster, the worker nodes need to be set up as follows:
Update the system and install Java:
sudo yum update -y
sudo yum install -y java-11-amazon-corretto
Download Spark:
Extract the downloaded file:
tar xvf spark-3.5.2-bin-hadoop3.tgz
Move Spark to the /opt directory:
sudo mv spark-3.5.2-bin-hadoop3 /opt/spark
Configure the environment variables for Spark and Java:
echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=\$PATH:\$SPARK_HOME/bin" >> ~/.bashrc
echo "export JAVA_HOME=/usr/lib/jvm/java-11-amazon-corretto.x86_64" >> ~/.bashrc
source ~/.bashrc
Start the worker node:
$SPARK_HOME/sbin/ spark://<Master-Node-Private-IP>:7077
Replace with the private IP address of the Spark master node.
Verify that the worker node is successfully connected to the master by accessing the Spark Web UI at:
The worker node's status should appear as ALIVE on the Spark Web UI. This indicates that the master and worker nodes are successfully communicating.