Practical: 1
Aim: Configure Hadoop cluster in pseudo distributed mode and run basic Hadoop
commands.
Installation of Hadoop 3.3.2 on Ubuntu 18.04 LTS
1. Installing Java
$ sudo apt update
$ sudo apt install openjdk-8-jdk openjdk-8-jre
$ java -version
Set JAVA_HOME in .bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:/usr/lib/jvm/java-8-openjdk-amd64/bin
Apply changes of bashrc in ubuntu environment either by rebooting the system or
applying source ~/.bashrc
2. Adding dedicated hadoop user
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
3. Adding hduser in sudoers file
$ sudo visudo
Add following line in the /etc/sudoers.tmp file
hduser ALL=(ALL:ALL) ALL
4. Now switch to hduser
$ su -hduser
5. Setting up SSH
Hadoop services like Resource Manager & Node Manager uses ssh to share the status of
nodes b/w slave to master & master to master.
$ sudo apt-get install openssh-server openssh-client
After installing ssh, generate ssh keys and copy them in ~/.ssh/authorized_keys.
Generate Keys for secure communication:
$ ssh-keygen -t rsa -P “”
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
6. Download Hadoop 3.3.2 tar file, extract it into /usr/local/Hadoop folder.
$ sudo tar xvzf hadoop-3.0.2.tar.gz
$ sudo mv -r hadoop-3.0.2 /usr/local/hadoop
7. Changing ownership to hduser:Hadoop group and full permission to them.
$ sudo chown -R hduser:hadoop /usr/local/hadoop $ sudo chmod -R 777
/usr/local/Hadoop
8. Hadoop Setup
This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as
a single Java process. A Hadoop environment is configured by editing a set of
configuration files:
bashrc hadoop-env.sh core-site.xml hdfs-site.xml mapred-site-xml yarn-site.xml
8.1 bashrc
$ sudo gedit ~/.bashrc
Add following lines at the end:
#Hadoop Related Options
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
$ source ~/.bashrc
8.2 hadoop-env.sh
Lets change the working directory to hadoop configurations location $ cd
/usr/local/hadoop/etc/hadoop/
$ sudo gedit hadoop-env.sh
Add this line:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
8.3 yarn-site.xml
$ sudo gedit yarn-site.xml
Add following lines:
<property> <name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
8.4 hdfs-site.xml
$ sudo gedit hdfs-site.xml
Add following lines: <property> <name>dfs.replication</name> <value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/yarn_data/hdfs/namenode</value> </property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/yarn_data/hdfs/datanode</value> </property>
8.5 core-site.xml
$ sudo gedit core-site.xml
Add following lines:
<property>
<name>hadoop.tmp.dir</name> <value>/home/hduser/hadoop/tmp</value>
</property>
<property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value>
</property>
8.6 mapred-site.xml
$ sudo gedit mapred-site.xml
Add following lines:
<property> <name>mapred.framework.name</name> <value>yarn</value>
</property>
<property> <name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
9. Create temp directory, directory for datanode and namenode
$ sudo mkdir -p /home/hduser/hadoop/tmp
$ sudo chown -R hduser:hadoop /home/hduser/hadoop/tmp
$ sudo chmod -R 777 /home/hduser/hadoop/tmp
$ sudo mkdir -p /usr/local/hadoop/yarn_data/hdfs/namenode
$ sudo mkdir -p /usr/local/hadoop/yarn_data/hdfs/datanode
$ sudo chmod -R 777 /usr/local/hadoop/yarn_data/hdfs/namenode
$ sudo chmod -R 777 /usr/local/hadoop/yarn_data/hdfs/datanode
$ sudo chown -R hduser:hadoop /usr/local/hadoop/yarn_data/hdfs/namenode $ sudo
chown -R hduser:hadoop /usr/local/hadoop/yarn_data/hdfs/datanode
10. Format Hadoop namenode to get the fresh start
$ hdfs namenode -format
Start all hadoop services by executing command one by one. $ start-dfs.sh
$ start-yarn.sh
or
$ start-all.sh
Type this simple command to check if all the daemons are active and running as Java
processes:
$ jps
Following output is expected if all went well:
6960 SecondaryNameNode 7380 NodeManager
6632 NameNode
11066 Jps
7244 ResourceManager 6766 DataNode
Access Hadoop UI from Browser
The default port number 9870 gives you access to the Hadoop NameNode UI:
http://localhost:9870
The NameNode user interface provides a comprehensive overview of the entire cluster.
The default port 9864 is used to access individual DataNodes directly from your
browser:
http://localhost:9864
The YARN Resource Manager is accessible on port 8088: http://localhost:8088