Skip to content

By following all these commands, you can easily implement a multi-node Hadoop cluster, where the NameNode is hosted on a separate node and the DataNodes are distributed across different nodes.

Notifications You must be signed in to change notification settings

Surajkumar4-source/Hadoop_Multi_Node_Cluster_Implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 

Repository files navigation

Multi-Node Hadoop Cluster Setup on Ubuntu

This guide provides detailed steps to set up a multi-node Hadoop cluster on Ubuntu. It configures a master node (NameNode) and multiple worker nodes (DataNodes).

Prerequisites

  1. At least 2 Ubuntu machines (or VMs): One as the Master (NameNode) and others as Workers (DataNodes).
  2. Passwordless SSH access between all nodes.
  3. Java installed on all nodes (Hadoop requires Java).
  4. Hadoop binaries downloaded and configured on all nodes.

Step-by-Step Setup for a Multi-node Hadoop Cluster

1. Update System Packages:

 sudo apt update -y

2. Make a user and Grant sudo Privileges

sudo adduser hduser
sudo visudo

add this to file below root:-
hduser ALL=(ALL) NOPASSWD: ALL

3. Install Java (OpenJDK 8):

 sudo apt install openjdk-8-jdk -y

4. Master Node (NameNode) Setup

Download and Configure Hadoop

  • Download Hadoop binaries and extract them:

    wget -c -O hadoop.tar.gz https://dlcdn.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz
    tar xvzf hadoop.tar.gz
    sudo mkdir /usr/local/hadoop
    sudo mv hadoop-3.2.4/* /usr/local/hadoop
  • Set variables in ~/.bashrc file.

        #JAVA
       export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
       export JRE_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
       #Hadoop Environment Variables
       export HADOOP_HOME=/usr/local/hadoop
       export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
       export HADOOP_LOG_DIR=$HADOOP_HOME/logs
       export HADOOP_MAPRED_HOME=$HADOOP_HOME
       # Add Hadoop bin/ directory to PATH
       export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
       export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
       export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
       export PDSH_RCMD_TYPE=ssh
    
  • Reload the .bashrc file using

   source ~/.bashrc 
  • Change ownership and permissions:

    sudo chown -R hduser:hduser /usr/local/hadoop
    sudo chmod -R 755 /usr/local/hadoop
  • Edit hadoop-env.sh to set the Java home path:( cd /usr/local/hadoop/etc/hadoop/hadoop-env.sh (most of the commands are run on this path))

    nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh            
    
    add this to file
    
     #JAVA
     export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
     export JRE_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
     #Hadoop Environment Variables
     export HADOOP_HOME=/usr/local/hadoop
     export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
     export HADOOP_LOG_DIR=$HADOOP_HOME/logs
     export HDFS_NAMENODE_USER=hduser
     export HDFS_DATANODE_USER=hduser
     export HDFS_SECONDARYNAMENODE_USER=hduser
     export YARN_RESOURCEMANAGER_USER=hduser
     export YARN_NODEMANAGER_USER=hduser
     export YARN_NODEMANAGER_USER=hduser
     # AddHadoopbin/directory to PATH
     export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
     export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
     export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Install SSH and Other Utilities

  • Install SSH and start the service:

    sudo apt install ssh -y
    sudo systemctl start ssh
    sudo systemctl enable ssh
  • Install pdsh for parallel shell access:

    sudo apt install pdsh -y
    

Configure Hostname and Hosts File

  • Set the hostname to master:

    sudo hostnamectl set-hostname master
    sudo nano /etc/hosts
    
    Eg:-
    192.168.56.103  master
    192.168.56.104 node1
    192.168.56.105 node2  and so on....
    • Update the /etc/hosts file to include the IP addresses of the master and worker nodes.

Edit Hadoop Configuration Files

  • Edit the following Hadoop configuration files:

    sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
    
    Add this to file:-
    
    <configuration>
    <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
    </property>
    </configuration>
    
    
    
    
    sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
    
     Add this to file:-
    
          <configuration>
    <property>
        <name>dfs.name.dir</name>
        <value>/usr/local/hadoop/hd-data/nn</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    </configuration>
    
     
    sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
    
     Add this to file:-
    
    <configuration>
    <property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
    </property>
    <property>
     <name>mapreduce.application.classpath</name>
     <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
    </configuration>
    
    
    sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
    
     Add this to file:-
    
          <configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
     </configuration>
    
  • Edit nano workers file

 Add this to file:-(all Datanodes ip or hostname)
  node1
  node2

Set up Directories for Data

  • Create directories for Hadoop's NameNode data:
    mkdir -p /usr/local/hadoop/hd-data/nn
    

Generate and Configure SSH Keys

  • Generate an SSH key for passwordless access:
    ssh-keygen -t rsa
    ssh-copy-id -i .ssh/id_rsa.pub hduser@master
    ssh-copy-id -i .ssh/id_rsa.pub hduser@dn1
    ssh-copy-id -i .ssh/id_rsa.pub hduser@dn2

Format and Start Hadoop

  • Format the NameNode:

    hadoop namenode -format
    
  • Start the HDFS services:

    start-dfs.sh
    jps
    

5. Worker Node 1 (DataNode 1) Setup

Download and Configure Hadoop

  • Download Hadoop binaries and extract them:

    wget -c -O hadoop.tar.gz https://dlcdn.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz
    tar xvzf hadoop.tar.gz
    sudo mkdir /usr/local/hadoop
    sudo mv hadoop-3.2.4/* /usr/local/hadoop
  • Change ownership and permissions:

    sudo chown -R hduser:hduser /usr/local/hadoop
    sudo chmod -R 755 /usr/local/hadoop

Configure Hostname and Hosts File

  • Set the hostname to dn1:
    sudo hostnamectl set-hostname dn1
    sudo nano /etc/hosts
    
    
    Eg:-
    192.168.56.103  master
    192.168.56.104 node1
    192.168.56.105 node2  and so on....

Edit Hadoop Configuration Files

  • Edit the necessary Hadoop configuration files:
    sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
    
     Add this to file:-
    
      <configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
    </configuration>
    
    
    sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
    
     Add this to file:-
    
         <configuration>
    <property>
        <name>dfs.data.dir</name>
        <value>/usr/local/hadoop/hd-data/dn</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    </configuration>
    
    sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
    
        Add this to file:-
    
           <configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
    </configuration>
      
    sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
    
     Add this to file:-
    
              <configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.local-dirs</name>
        <value>/usr/local/hadoop/hd-data/yarn/data</value>
    </property>
    <property>
        <name>yarn.nodemanager.logs-dirs</name>
        <value>/usr/local/hadoop/hd-data/yarn/logs</value>
    </property>
    <property>
        <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-perdisk-percentage</name>
        <value>99.9</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
    </configuration>
    

nano workers

 Add this to file:-(if not present)

   localhost

Set up Directories for Data

  • Create directories for DataNode storage and YARN logs:
    mkdir -p /usr/local/hadoop/hd-data/dn
    mkdir -p /usr/local/hadoop/yarn/logs
    mkdir -p /usr/local/hadoop/yarn/data

6. Worker Node 2 (DataNode 2) Setup

Download and Configure Hadoop

  • Download Hadoop binaries and extract them:

    wget -c -O hadoop.tar.gz https://dlcdn.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz
    tar xvzf hadoop.tar.gz
    sudo mkdir /usr/local/hadoop
    sudo mv hadoop-3.2.4/* /usr/local/hadoop
  • Change ownership and permissions:

    sudo chown -R hduser:hduser /usr/local/hadoop
    sudo chmod -R 755 /usr/local/hadoop
    

Configure Hostname and Hosts File

  • Set the hostname to dn2:
    sudo hostnamectl set-hostname dn2
    sudo nano /etc/hosts
    
     Eg:-
    192.168.56.103  master
    192.168.56.104 node1
    192.168.56.105 node2  and so on....

Copy Hadoop Configuration Files from Worker 1

  • Copy the configuration files from dn1:
    scp dn1:/usr/local/hadoop/etc/hadoop/*.xml /usr/local/hadoop/etc/hadoop/                  (means all  4 .xml file are same for all Datanodes  so just copy from node1)

Set up Directories for Data

  • Create directories for DataNode storage and YARN logs:
    mkdir -p /usr/local/hadoop/hd-data/dn
    mkdir -p /usr/local/hadoop/yarn/logs
    mkdir -p /usr/local/hadoop/yarn/data

Start Hadoop DFS on Worker Node by running this commands from Master node or NamenNode

  • Run the following command to ensure everything is configured correctly:
    jps
    start-dfs.sh
    start-yarn.sh

After following these steps, your multi-node Hadoop cluster should be successfully set up with one master node and multiple worker nodes.





๐Ÿ‘จโ€๐Ÿ’ป ๐“’๐“ป๐“ช๐“ฏ๐“ฝ๐“ฎ๐“ญ ๐“ซ๐”‚: Suraj Kumar Choudhary | ๐Ÿ“ฉ ๐“•๐“ฎ๐“ฎ๐“ต ๐“ฏ๐“ป๐“ฎ๐“ฎ ๐“ฝ๐“ธ ๐““๐“œ ๐“ฏ๐“ธ๐“ป ๐“ช๐“ท๐”‚ ๐“ฑ๐“ฎ๐“ต๐“น: [email protected]


About

By following all these commands, you can easily implement a multi-node Hadoop cluster, where the NameNode is hosted on a separate node and the DataNodes are distributed across different nodes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published