This guide provides detailed steps to set up a multi-node Hadoop cluster on Ubuntu. It configures a master node (NameNode) and multiple worker nodes (DataNodes).
- At least 2 Ubuntu machines (or VMs): One as the Master (NameNode) and others as Workers (DataNodes).
- Passwordless SSH access between all nodes.
- Java installed on all nodes (Hadoop requires Java).
- Hadoop binaries downloaded and configured on all nodes.
sudo apt update -y
sudo adduser hduser
sudo visudo
add this to file below root:-
hduser ALL=(ALL) NOPASSWD: ALL
sudo apt install openjdk-8-jdk -y
-
Download Hadoop binaries and extract them:
wget -c -O hadoop.tar.gz https://dlcdn.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz tar xvzf hadoop.tar.gz sudo mkdir /usr/local/hadoop sudo mv hadoop-3.2.4/* /usr/local/hadoop
-
Set variables in ~/.bashrc file.
#JAVA export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ export JRE_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre #Hadoop Environment Variables export HADOOP_HOME=/usr/local/hadoop export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop export HADOOP_LOG_DIR=$HADOOP_HOME/logs export HADOOP_MAPRED_HOME=$HADOOP_HOME # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" export PDSH_RCMD_TYPE=ssh
-
Reload the .bashrc file using
source ~/.bashrc
-
Change ownership and permissions:
sudo chown -R hduser:hduser /usr/local/hadoop sudo chmod -R 755 /usr/local/hadoop
-
Edit
hadoop-env.sh
to set the Java home path:( cd /usr/local/hadoop/etc/hadoop/hadoop-env.sh (most of the commands are run on this path))nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh add this to file #JAVA export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ export JRE_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre #Hadoop Environment Variables export HADOOP_HOME=/usr/local/hadoop export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop export HADOOP_LOG_DIR=$HADOOP_HOME/logs export HDFS_NAMENODE_USER=hduser export HDFS_DATANODE_USER=hduser export HDFS_SECONDARYNAMENODE_USER=hduser export YARN_RESOURCEMANAGER_USER=hduser export YARN_NODEMANAGER_USER=hduser export YARN_NODEMANAGER_USER=hduser # AddHadoopbin/directory to PATH export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
-
Install SSH and start the service:
sudo apt install ssh -y sudo systemctl start ssh sudo systemctl enable ssh
-
Install pdsh for parallel shell access:
sudo apt install pdsh -y
-
Set the hostname to
master
:sudo hostnamectl set-hostname master sudo nano /etc/hosts Eg:- 192.168.56.103 master 192.168.56.104 node1 192.168.56.105 node2 and so on....
- Update the
/etc/hosts
file to include the IP addresses of the master and worker nodes.
- Update the
-
Edit the following Hadoop configuration files:
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml Add this to file:- <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> </configuration> sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml Add this to file:- <configuration> <property> <name>dfs.name.dir</name> <value>/usr/local/hadoop/hd-data/nn</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration> sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml Add this to file:- <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value> </property> </configuration> sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml Add this to file:- <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value> </property> </configuration>
-
Edit nano workers file
Add this to file:-(all Datanodes ip or hostname)
node1
node2
- Create directories for Hadoop's NameNode data:
mkdir -p /usr/local/hadoop/hd-data/nn
- Generate an SSH key for passwordless access:
ssh-keygen -t rsa ssh-copy-id -i .ssh/id_rsa.pub hduser@master ssh-copy-id -i .ssh/id_rsa.pub hduser@dn1 ssh-copy-id -i .ssh/id_rsa.pub hduser@dn2
-
Format the NameNode:
hadoop namenode -format
-
Start the HDFS services:
start-dfs.sh jps
-
Download Hadoop binaries and extract them:
wget -c -O hadoop.tar.gz https://dlcdn.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz tar xvzf hadoop.tar.gz sudo mkdir /usr/local/hadoop sudo mv hadoop-3.2.4/* /usr/local/hadoop
-
Change ownership and permissions:
sudo chown -R hduser:hduser /usr/local/hadoop sudo chmod -R 755 /usr/local/hadoop
- Set the hostname to
dn1
:sudo hostnamectl set-hostname dn1 sudo nano /etc/hosts Eg:- 192.168.56.103 master 192.168.56.104 node1 192.168.56.105 node2 and so on....
- Edit the necessary Hadoop configuration files:
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml Add this to file:- <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> </configuration> sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml Add this to file:- <configuration> <property> <name>dfs.data.dir</name> <value>/usr/local/hadoop/hd-data/dn</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration> sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml Add this to file:- <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value> </property> </configuration> sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml Add this to file:- <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/usr/local/hadoop/hd-data/yarn/data</value> </property> <property> <name>yarn.nodemanager.logs-dirs</name> <value>/usr/local/hadoop/hd-data/yarn/logs</value> </property> <property> <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-perdisk-percentage</name> <value>99.9</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration>
Add this to file:-(if not present)
localhost
- Create directories for DataNode storage and YARN logs:
mkdir -p /usr/local/hadoop/hd-data/dn mkdir -p /usr/local/hadoop/yarn/logs mkdir -p /usr/local/hadoop/yarn/data
-
Download Hadoop binaries and extract them:
wget -c -O hadoop.tar.gz https://dlcdn.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz tar xvzf hadoop.tar.gz sudo mkdir /usr/local/hadoop sudo mv hadoop-3.2.4/* /usr/local/hadoop
-
Change ownership and permissions:
sudo chown -R hduser:hduser /usr/local/hadoop sudo chmod -R 755 /usr/local/hadoop
- Set the hostname to
dn2
:sudo hostnamectl set-hostname dn2 sudo nano /etc/hosts Eg:- 192.168.56.103 master 192.168.56.104 node1 192.168.56.105 node2 and so on....
- Copy the configuration files from
dn1
:scp dn1:/usr/local/hadoop/etc/hadoop/*.xml /usr/local/hadoop/etc/hadoop/ (means all 4 .xml file are same for all Datanodes so just copy from node1)
- Create directories for DataNode storage and YARN logs:
mkdir -p /usr/local/hadoop/hd-data/dn mkdir -p /usr/local/hadoop/yarn/logs mkdir -p /usr/local/hadoop/yarn/data
- Run the following command to ensure everything is configured correctly:
jps start-dfs.sh start-yarn.sh
After following these steps, your multi-node Hadoop cluster should be successfully set up with one master node and multiple worker nodes.
๐จโ๐ป ๐๐ป๐ช๐ฏ๐ฝ๐ฎ๐ญ ๐ซ๐: Suraj Kumar Choudhary | ๐ฉ ๐๐ฎ๐ฎ๐ต ๐ฏ๐ป๐ฎ๐ฎ ๐ฝ๐ธ ๐๐ ๐ฏ๐ธ๐ป ๐ช๐ท๐ ๐ฑ๐ฎ๐ต๐น: [email protected]