Skip to content

Comprehensive guide for setting up Apache Spark clusters on AWS EC2 with one master and one worker node.

Notifications You must be signed in to change notification settings

ShouqSaadRu/spark-setup-aws-ec2-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Spark Setup Guide on AWS EC2

This guide provides step-by-step instructions to set up Apache Spark on AWS EC2 with one master node and one worker node.

Master Node Setup

1. Install Java

Update the system and install Java:

sudo yum update -y
sudo yum install -y java-11-amazon-corretto 

2. Download and Install Spark

Download Spark:

wget https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz

Extract the downloaded file:

tar xvf spark-3.5.2-bin-hadoop3.tgz

Move Spark to the /opt directory:

sudo mv spark-3.5.2-bin-hadoop3 /opt/spark

3. Set Environment Variables

Set the necessary environment variables:

echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=\$PATH:\$SPARK_HOME/bin" >> ~/.bashrc
echo "export JAVA_HOME=/usr/lib/jvm/java-11-amazon-corretto.x86_64" >> ~/.bashrc
source ~/.bashrc

4. Configure Spark (Master Node)

Navigate to Spark configuration directory:

cd $SPARK_HOME/conf

Copy the template for the Spark environment file:

cp spark-env.sh.template spark-env.sh

Edit the spark-env.sh file:

nano spark-env.sh

Add the following line:

export SPARK_MASTER_HOST='<Master-Node-Private-IP>'

5. Start the Spark Master Node

Start the Spark master node:

$SPARK_HOME/sbin/start-master.sh

6. Install AWS CLI

Install the AWS CLI to manage datasets:

sudo yum install awscli -y

7. Create a Directory for Data

Create a directory to store your data:

mkdir -p /home/ec2-user/data

8. Download Data from S3

Download the dataset from Amazon S3:

aws s3 cp s3:put_your_dataset_path_in_s3

for example: aws s3 cp s3://amazon-bucket23/amazon_reviews.csv /home/ec2-user/data

9. Start Spark Shell

Start the interactive Spark shell:

$SPARK_HOME/bin/spark-shell

Notes

Ensure that you replace with the private IP address of your master EC2 instance.

Use the Spark Web UI to monitor your cluster.

http://<Master-Node-IP>:8080

Worker Node Setup

To complete the Spark cluster, the worker nodes need to be set up as follows:

1. Install Java

Update the system and install Java:

sudo yum update -y
sudo yum install -y java-11-amazon-corretto

2. Download and Install Spark

Download Spark:

wget https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz

Extract the downloaded file:

tar xvf spark-3.5.2-bin-hadoop3.tgz

Move Spark to the /opt directory:

sudo mv spark-3.5.2-bin-hadoop3 /opt/spark

3. Set Environment Variables

Configure the environment variables for Spark and Java:

echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
echo "export PATH=\$PATH:\$SPARK_HOME/bin" >> ~/.bashrc
echo "export JAVA_HOME=/usr/lib/jvm/java-11-amazon-corretto.x86_64" >> ~/.bashrc
source ~/.bashrc

4. Start the Worker Node

Start the worker node:

$SPARK_HOME/sbin/start-slave.sh spark://<Master-Node-Private-IP>:7077

Replace with the private IP address of the Spark master node.

Verify that the worker node is successfully connected to the master by accessing the Spark Web UI at:

http://<Master-Node-IP>:8080

Verification

The worker node's status should appear as ALIVE on the Spark Web UI. This indicates that the master and worker nodes are successfully communicating.

About

Comprehensive guide for setting up Apache Spark clusters on AWS EC2 with one master and one worker node.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published