0% found this document useful (0 votes)
23 views11 pages

Unit 3 PART 2

This document provides a comprehensive guide for installing Hadoop, detailing system requirements, installation steps, and configuration settings for both Hadoop and YARN. It covers prerequisites like Java installation, environment variable setup, and configuration of essential XML files for Hadoop's operation. Additionally, it outlines how to start and monitor Hadoop services, emphasizing the importance of YARN in managing resources and scheduling jobs in a Hadoop cluster.

Uploaded by

Abdul Samad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views11 pages

Unit 3 PART 2

This document provides a comprehensive guide for installing Hadoop, detailing system requirements, installation steps, and configuration settings for both Hadoop and YARN. It covers prerequisites like Java installation, environment variable setup, and configuration of essential XML files for Hadoop's operation. Additionally, it outlines how to start and monitor Hadoop services, emphasizing the importance of YARN in managing resources and scheduling jobs in a Hadoop cluster.

Uploaded by

Abdul Samad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

.

Pre-requisites for Installing Hadoop

Before installing Hadoop, make sure the following software and system requirements are
met:

System Requirements:

 Operating System: Linux (Ubuntu or CentOS) is the most commonly used for
Hadoop installations. It can also be installed on Windows using Cygwin, but Linux is
preferred for production environments.
 Memory: At least 4GB of RAM.
 Disk Space: At least 10GB of free disk space.
 Java: Hadoop requires Java 8 or later. Ensure Java is installed on your system.
 SSH: Hadoop requires SSH for communication between the master and slave nodes
(even in a single-node setup).

2. Step-by-Step Guide to Install Hadoop

Step 1: Install Java

Hadoop requires Java to be installed on the system. You can check whether Java is installed
by typing:

bash
CopyEdit
java -version

If Java is not installed, you can install it as follows:

 For Ubuntu:

bash
CopyEdit
sudo apt update
sudo apt install openjdk-8-jdk

Istallation, set the JAVA_HOME environment variable. For Ubuntu, you can do this by editing
the .bashrc file:

bash
CopyEdit
nano ~/.bashrc

Add the following line:

bash
CopyEdit
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

Then, source the .bashrc file to apply the changes:


bash
CopyEdit
source ~/.bashrc

Step 2: Download Hadoop

Go to the official Apache Hadoop website (https://hadoop.apache.org/) and download the


latest stable version of Hadoop. Alternatively, you can download Hadoop using wget:

bash
CopyEdit
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-
3.3.1.tar.gz

Once downloaded, extract the tarball:

bash
CopyEdit
tar -xvzf hadoop-3.3.1.tar.gz

Move the extracted files to a directory of your choice:

bash
CopyEdit
sudo mv hadoop-3.3.1 /usr/local/hadoop

Step 3: Set Up Environment Variables

Add the Hadoop environment variables in the .bashrc file:

bash
CopyEdit
nano ~/.bashrc

Add the following lines at the end of the file:

bash
CopyEdit
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Then, apply the changes:

bash
CopyEdit
source ~/.bashrc

Step 4: Configure Hadoop

Before starting Hadoop, several configuration files need to be modified. These files are
located in the $HADOOP_HOME/etc/hadoop directory.
1. hadoop-env.sh

Edit the hadoop-env.sh file to specify the JAVA_HOME path:

bash
CopyEdit
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Find the line with # export JAVA_HOME and update it as follows:

bash
CopyEdit
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
2. core-site.xml

The core-site.xml file contains the configuration for Hadoop's core settings, including the
file system URI. Edit the file as follows:

bash
CopyEdit
nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration inside the <configuration> tags:

xml
CopyEdit
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
3. hdfs-site.xml

The hdfs-site.xml file contains the configuration for Hadoop's HDFS. Edit the file to
configure the directories for the NameNode and DataNode:

bash
CopyEdit
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration:

xml
CopyEdit
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/hadoop/hdfs/datanode</value>
</property>
4. yarn-site.xml

The yarn-site.xml file configures Hadoop YARN (Yet Another Resource Negotiator). Edit
it to configure the ResourceManager and NodeManager settings:

bash
CopyEdit
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following configuration:

xml
CopyEdit
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

3. Format the Hadoop Filesystem (HDFS)

Before you start HDFS, you must format the filesystem. Run the following command:

bash
CopyEdit
hdfs namenode -format

This will initialize the HDFS file system.

4. Start Hadoop Daemons

After configuring Hadoop, you can start the necessary daemons to launch Hadoop:

1. Start HDFS:

bash
CopyEdit
start-dfs.sh

This will start the NameNode and DataNode services.

2. Start YARN:

bash
CopyEdit
start-yarn.sh
This will start the ResourceManager and NodeManager services.

3. Check the Status:

To check if the Hadoop daemons are running, use the following commands:

bash
CopyEdit
jps

This will list the running Java processes. Look for the following processes to ensure that
Hadoop is running:

 NameNode
 DataNode
 ResourceManager
 NodeManager

5. Access the Hadoop Web Interfaces

Hadoop provides web interfaces for monitoring and managing HDFS and YARN:

 HDFS NameNode UI: http://localhost:9870


 YARN ResourceManager UI: http://localhost:8088

You can open these URLs in your web browser to check the status and health of your Hadoop
services.

6. Stop Hadoop Services

Once you are done with your work, you can stop the Hadoop daemons with the following
commands:

1. Stop HDFS:

bash
CopyEdit
stop-dfs.sh

2. Stop YARN:

bash
CopyEdit
Stop-yarn.sh
YARN Configuration in Hadoop

YARN (Yet Another Resource Negotiator) is a key component of the Hadoop ecosystem that
manages resources and schedules jobs across the cluster. It separates the resource
management and job scheduling functionalities in Hadoop, which were previously handled by
MapReduce. YARN allows multiple applications to share resources in the Hadoop cluster
efficiently.

This lecture will cover the key configuration steps involved in setting up YARN on a Hadoop
cluster.

1. Understanding YARN Architecture

YARN consists of the following components:

 ResourceManager (RM): Manages resources in the cluster and schedules


applications.
 NodeManager (NM): Manages the resources on individual nodes and reports to the
ResourceManager.
 ApplicationMaster (AM): Manages the lifecycle of an application, including job
scheduling, monitoring, and resource negotiation.
 Container: A resource allocation for running an application on a node.

2. Key Configuration Files for YARN

The configuration of YARN is primarily done through XML configuration files located in the
etc/hadoop directory. The main configuration files for YARN include:

 yarn-site.xml: The primary configuration file for YARN.


 mapred-site.xml: Contains the configuration for MapReduce-related tasks in YARN.

3. Configuration for YARN in yarn-site.xml

The yarn-site.xml file contains important configuration parameters for YARN. Below are
the key configurations to set up YARN:

Edit yarn-site.xml

1. ResourceManager Address: Configure the ResourceManager’s address. This is the


central point that clients and NodeManagers will connect to.

xml
CopyEdit
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>

2. ResourceManager Web UI: This configuration specifies the web UI of the


ResourceManager, which allows you to monitor YARN resource usage.

xml
CopyEdit
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>localhost:8088</value>
</property>

3. NodeManager Local Directory: Defines the directory where the NodeManager


stores temporary data.

xml
CopyEdit
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/tmp/nm-local-dir</value>
</property>

4. NodeManager Log Directory: Defines the directory where the NodeManager stores
log files.

xml
CopyEdit
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/tmp/logs</value>
</property>

5. NodeManager Resource Memory: Configures the amount of memory the


NodeManager can allocate for containers on each node.

xml
CopyEdit
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>

6. NodeManager Virtual Cores: This parameter controls the number of virtual cores
(CPU) available for containers on the NodeManager.

xml
CopyEdit
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
7. YARN ResourceManager Scheduler: You can configure the YARN scheduler
(default is CapacityScheduler).

xml
CopyEdit
<property>
<name>yarn.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capaci
ty.CapacityScheduler</value>
</property>

8. ResourceManager Admins: You can define a list of administrators who can access
the ResourceManager's web interface and manage jobs.

xml
CopyEdit
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>localhost:8050</value>
</property>

4. Configuration for MapReduce in mapred-site.xml

The mapred-site.xml file is used for configuring MapReduce, but since YARN runs
MapReduce tasks, certain configurations here are also necessary.

Edit mapred-site.xml

1. MapReduce Framework: In YARN, MapReduce tasks run in containers, so you


need to set the framework to YARN.

xml
CopyEdit
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

2. JobHistory Server: Configure the JobHistory server to keep track of job history in
the YARN environment.

xml
CopyEdit
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>

3. JobHistory Web UI: Configure the JobHistory web UI so that you can monitor jobs.

xml
CopyEdit
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>

4. ResourceManager for MapReduce: Specify the ResourceManager for managing


MapReduce tasks.

xml
CopyEdit
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>

5. Starting YARN Daemons

Once the configuration files are properly set, start the necessary YARN daemons:

1. Start ResourceManager:

bash
CopyEdit
start-resourcemanager.sh

2. Start NodeManager:

bash
CopyEdit
start-nodemanager.sh

3. Start YARN (HDFS services must be running):

bash
CopyEdit
start-yarn.sh

4. Check YARN Status:

Use the jps command to check if YARN processes are running, such as ResourceManager,
NodeManager, etc.

bash
CopyEdit
jps

6. Monitoring YARN

YARN provides web interfaces for monitoring resources and running applications:

 ResourceManager Web UI: http://localhost:8088/


 NodeManager Web UI: http://localhost:8042/
These interfaces allow you to view job statuses, resource allocation, and overall cluster
health.

7. YARN Logs

To access logs for a specific YARN application, use the following command:

bash
CopyEdit
yarn logs -applicationId <application_123456789_0001>

This command will provide details about the logs for the given application.

YARN is a powerful resource management system that allows multiple applications to run
concurrently and efficiently share resources across the Hadoop cluster. Proper configuration
of YARN, including ResourceManager, NodeManager, and MapReduce settings, is crucial to
optimize resource usage and improve the overall performance of the Hadoop cluster. By
understanding and configuring YARN correctly, organizations can handle large-scale data
processing with ease, supporting diverse applications and workloads.

Sample Map Reduce program Application


Sample input

output

You might also like