This guide provides step-by-step instructions to set up Apache Spark on AWS EC2 with one master node and one worker node.
Update the system and install Java:
sudo yum update -y
sudo yum install -y java-11-amazon-corretto
Download Spark:
wget https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz
Extract the downloaded file:
tar xvf spark-3.5.2-bin-hadoop3.tgz
Move Spark to the /opt directory:
sudo mv spark-3.5.2-bin-hadoop3 /opt/spark
Set the necessary environment variables:
echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=\$PATH:\$SPARK_HOME/bin" >> ~/.bashrc
echo "export JAVA_HOME=/usr/lib/jvm/java-11-amazon-corretto.x86_64" >> ~/.bashrc
source ~/.bashrc
Navigate to Spark configuration directory:
cd $SPARK_HOME/conf
Copy the template for the Spark environment file:
cp spark-env.sh.template spark-env.sh
Edit the spark-env.sh file:
nano spark-env.sh
Add the following line:
export SPARK_MASTER_HOST='<Master-Node-Private-IP>'
Start the Spark master node:
$SPARK_HOME/sbin/start-master.sh
Install the AWS CLI to manage datasets:
sudo yum install awscli -y
Create a directory to store your data:
mkdir -p /home/ec2-user/data
Download the dataset from Amazon S3:
aws s3 cp s3:put_your_dataset_path_in_s3
for example: aws s3 cp s3://amazon-bucket23/amazon_reviews.csv /home/ec2-user/data
Start the interactive Spark shell:
$SPARK_HOME/bin/spark-shell
Ensure that you replace with the private IP address of your master EC2 instance.
Use the Spark Web UI to monitor your cluster.
http://<Master-Node-IP>:8080
To complete the Spark cluster, the worker nodes need to be set up as follows:
Update the system and install Java:
sudo yum update -y
sudo yum install -y java-11-amazon-corretto
Download Spark:
wget https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz
Extract the downloaded file:
tar xvf spark-3.5.2-bin-hadoop3.tgz
Move Spark to the /opt directory:
sudo mv spark-3.5.2-bin-hadoop3 /opt/spark
Configure the environment variables for Spark and Java:
echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=\$PATH:\$SPARK_HOME/bin" >> ~/.bashrc
echo "export JAVA_HOME=/usr/lib/jvm/java-11-amazon-corretto.x86_64" >> ~/.bashrc
source ~/.bashrc
Start the worker node:
$SPARK_HOME/sbin/start-slave.sh spark://<Master-Node-Private-IP>:7077
Replace with the private IP address of the Spark master node.
Verify that the worker node is successfully connected to the master by accessing the Spark Web UI at:
http://<Master-Node-IP>:8080
The worker node's status should appear as ALIVE on the Spark Web UI. This indicates that the master and worker nodes are successfully communicating.