.
Pre-requisites for Installing Hadoop
Before installing Hadoop, make sure the following software and system requirements are
met:
System Requirements:
Operating System: Linux (Ubuntu or CentOS) is the most commonly used for
Hadoop installations. It can also be installed on Windows using Cygwin, but Linux is
preferred for production environments.
Memory: At least 4GB of RAM.
Disk Space: At least 10GB of free disk space.
Java: Hadoop requires Java 8 or later. Ensure Java is installed on your system.
SSH: Hadoop requires SSH for communication between the master and slave nodes
(even in a single-node setup).
2. Step-by-Step Guide to Install Hadoop
Step 1: Install Java
Hadoop requires Java to be installed on the system. You can check whether Java is installed
by typing:
bash
CopyEdit
java -version
If Java is not installed, you can install it as follows:
For Ubuntu:
bash
CopyEdit
sudo apt update
sudo apt install openjdk-8-jdk
Istallation, set the JAVA_HOME environment variable. For Ubuntu, you can do this by editing
the .bashrc file:
bash
CopyEdit
nano ~/.bashrc
Add the following line:
bash
CopyEdit
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
Then, source the .bashrc file to apply the changes:
bash
CopyEdit
source ~/.bashrc
Step 2: Download Hadoop
Go to the official Apache Hadoop website (https://hadoop.apache.org/) and download the
latest stable version of Hadoop. Alternatively, you can download Hadoop using wget:
bash
CopyEdit
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-
3.3.1.tar.gz
Once downloaded, extract the tarball:
bash
CopyEdit
tar -xvzf hadoop-3.3.1.tar.gz
Move the extracted files to a directory of your choice:
bash
CopyEdit
sudo mv hadoop-3.3.1 /usr/local/hadoop
Step 3: Set Up Environment Variables
Add the Hadoop environment variables in the .bashrc file:
bash
CopyEdit
nano ~/.bashrc
Add the following lines at the end of the file:
bash
CopyEdit
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Then, apply the changes:
bash
CopyEdit
source ~/.bashrc
Step 4: Configure Hadoop
Before starting Hadoop, several configuration files need to be modified. These files are
located in the $HADOOP_HOME/etc/hadoop directory.
1. hadoop-env.sh
Edit the hadoop-env.sh file to specify the JAVA_HOME path:
bash
CopyEdit
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Find the line with # export JAVA_HOME and update it as follows:
bash
CopyEdit
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
2. core-site.xml
The core-site.xml file contains the configuration for Hadoop's core settings, including the
file system URI. Edit the file as follows:
bash
CopyEdit
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following configuration inside the <configuration> tags:
xml
CopyEdit
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
3. hdfs-site.xml
The hdfs-site.xml file contains the configuration for Hadoop's HDFS. Edit the file to
configure the directories for the NameNode and DataNode:
bash
CopyEdit
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following configuration:
xml
CopyEdit
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/hadoop/hdfs/datanode</value>
</property>
4. yarn-site.xml
The yarn-site.xml file configures Hadoop YARN (Yet Another Resource Negotiator). Edit
it to configure the ResourceManager and NodeManager settings:
bash
CopyEdit
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Add the following configuration:
xml
CopyEdit
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
3. Format the Hadoop Filesystem (HDFS)
Before you start HDFS, you must format the filesystem. Run the following command:
bash
CopyEdit
hdfs namenode -format
This will initialize the HDFS file system.
4. Start Hadoop Daemons
After configuring Hadoop, you can start the necessary daemons to launch Hadoop:
1. Start HDFS:
bash
CopyEdit
start-dfs.sh
This will start the NameNode and DataNode services.
2. Start YARN:
bash
CopyEdit
start-yarn.sh
This will start the ResourceManager and NodeManager services.
3. Check the Status:
To check if the Hadoop daemons are running, use the following commands:
bash
CopyEdit
jps
This will list the running Java processes. Look for the following processes to ensure that
Hadoop is running:
NameNode
DataNode
ResourceManager
NodeManager
5. Access the Hadoop Web Interfaces
Hadoop provides web interfaces for monitoring and managing HDFS and YARN:
HDFS NameNode UI: http://localhost:9870
YARN ResourceManager UI: http://localhost:8088
You can open these URLs in your web browser to check the status and health of your Hadoop
services.
6. Stop Hadoop Services
Once you are done with your work, you can stop the Hadoop daemons with the following
commands:
1. Stop HDFS:
bash
CopyEdit
stop-dfs.sh
2. Stop YARN:
bash
CopyEdit
Stop-yarn.sh
YARN Configuration in Hadoop
YARN (Yet Another Resource Negotiator) is a key component of the Hadoop ecosystem that
manages resources and schedules jobs across the cluster. It separates the resource
management and job scheduling functionalities in Hadoop, which were previously handled by
MapReduce. YARN allows multiple applications to share resources in the Hadoop cluster
efficiently.
This lecture will cover the key configuration steps involved in setting up YARN on a Hadoop
cluster.
1. Understanding YARN Architecture
YARN consists of the following components:
ResourceManager (RM): Manages resources in the cluster and schedules
applications.
NodeManager (NM): Manages the resources on individual nodes and reports to the
ResourceManager.
ApplicationMaster (AM): Manages the lifecycle of an application, including job
scheduling, monitoring, and resource negotiation.
Container: A resource allocation for running an application on a node.
2. Key Configuration Files for YARN
The configuration of YARN is primarily done through XML configuration files located in the
etc/hadoop directory. The main configuration files for YARN include:
yarn-site.xml: The primary configuration file for YARN.
mapred-site.xml: Contains the configuration for MapReduce-related tasks in YARN.
3. Configuration for YARN in yarn-site.xml
The yarn-site.xml file contains important configuration parameters for YARN. Below are
the key configurations to set up YARN:
Edit yarn-site.xml
1. ResourceManager Address: Configure the ResourceManager’s address. This is the
central point that clients and NodeManagers will connect to.
xml
CopyEdit
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
2. ResourceManager Web UI: This configuration specifies the web UI of the
ResourceManager, which allows you to monitor YARN resource usage.
xml
CopyEdit
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>localhost:8088</value>
</property>
3. NodeManager Local Directory: Defines the directory where the NodeManager
stores temporary data.
xml
CopyEdit
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/tmp/nm-local-dir</value>
</property>
4. NodeManager Log Directory: Defines the directory where the NodeManager stores
log files.
xml
CopyEdit
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/tmp/logs</value>
</property>
5. NodeManager Resource Memory: Configures the amount of memory the
NodeManager can allocate for containers on each node.
xml
CopyEdit
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
6. NodeManager Virtual Cores: This parameter controls the number of virtual cores
(CPU) available for containers on the NodeManager.
xml
CopyEdit
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
7. YARN ResourceManager Scheduler: You can configure the YARN scheduler
(default is CapacityScheduler).
xml
CopyEdit
<property>
<name>yarn.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci
ty.CapacityScheduler</value>
</property>
8. ResourceManager Admins: You can define a list of administrators who can access
the ResourceManager's web interface and manage jobs.
xml
CopyEdit
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>localhost:8050</value>
</property>
4. Configuration for MapReduce in mapred-site.xml
The mapred-site.xml file is used for configuring MapReduce, but since YARN runs
MapReduce tasks, certain configurations here are also necessary.
Edit mapred-site.xml
1. MapReduce Framework: In YARN, MapReduce tasks run in containers, so you
need to set the framework to YARN.
xml
CopyEdit
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
2. JobHistory Server: Configure the JobHistory server to keep track of job history in
the YARN environment.
xml
CopyEdit
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
3. JobHistory Web UI: Configure the JobHistory web UI so that you can monitor jobs.
xml
CopyEdit
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
4. ResourceManager for MapReduce: Specify the ResourceManager for managing
MapReduce tasks.
xml
CopyEdit
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
5. Starting YARN Daemons
Once the configuration files are properly set, start the necessary YARN daemons:
1. Start ResourceManager:
bash
CopyEdit
start-resourcemanager.sh
2. Start NodeManager:
bash
CopyEdit
start-nodemanager.sh
3. Start YARN (HDFS services must be running):
bash
CopyEdit
start-yarn.sh
4. Check YARN Status:
Use the jps command to check if YARN processes are running, such as ResourceManager,
NodeManager, etc.
bash
CopyEdit
jps
6. Monitoring YARN
YARN provides web interfaces for monitoring resources and running applications:
ResourceManager Web UI: http://localhost:8088/
NodeManager Web UI: http://localhost:8042/
These interfaces allow you to view job statuses, resource allocation, and overall cluster
health.
7. YARN Logs
To access logs for a specific YARN application, use the following command:
bash
CopyEdit
yarn logs -applicationId <application_123456789_0001>
This command will provide details about the logs for the given application.
YARN is a powerful resource management system that allows multiple applications to run
concurrently and efficiently share resources across the Hadoop cluster. Proper configuration
of YARN, including ResourceManager, NodeManager, and MapReduce settings, is crucial to
optimize resource usage and improve the overall performance of the Hadoop cluster. By
understanding and configuring YARN correctly, organizations can handle large-scale data
processing with ease, supporting diverse applications and workloads.
Sample Map Reduce program Application
Sample input
output