0% found this document useful (0 votes)
6 views26 pages

7.hadoop YARN

Hadoop YARN, or Yet Another Resource Negotiator, is a resource management layer introduced in Hadoop 2.0 to enhance scalability and efficiency in processing large-scale data. It features a master-slave architecture with key components including Resource Manager, Node Manager, Application Master, and Containers, allowing for better resource allocation and management across applications. YARN supports various scheduling strategies like FIFO, Capacity Scheduler, and Fair Scheduler to optimize resource usage in multi-tenant environments.

Uploaded by

22bce012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views26 pages

7.hadoop YARN

Hadoop YARN, or Yet Another Resource Negotiator, is a resource management layer introduced in Hadoop 2.0 to enhance scalability and efficiency in processing large-scale data. It features a master-slave architecture with key components including Resource Manager, Node Manager, Application Master, and Containers, allowing for better resource allocation and management across applications. YARN supports various scheduling strategies like FIFO, Capacity Scheduler, and Fair Scheduler to optimize resource usage in multi-tenant environments.

Uploaded by

22bce012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Hadoop YARN

• YARN stands for “Yet Another Resource Negotiator“. It was


introduced in Hadoop 2.0 to remove the bottleneck on Job Tracker
which was present in Hadoop 1.0. YARN was described as a
“Redesigned Resource Manager” at the time of its launching, but it
has now evolved to be known as large-scale distributed operating
system used for Big Data processing
The main components of YARN architecture include:

• Client
• Resource Manager
• Scheduler
• Application Manager
• Node Manager
• Application Master
• Container
Resource Manager
It is the master daemon of YARN and is responsible for resource assignment
and management among all the applications.
Whenever it receives a processing request, it forwards it to the corresponding
node manager and allocates resources for the completion of the request
accordingly.
It has two major components:
1. Scheduler:
• It performs scheduling based on the allocated application and available resources.
• It is a pure scheduler, means it does not perform other tasks such as monitoring or tracking and
does not guarantee a restart if a task fails.
• The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition the
cluster resources.
2. Application manager:
• It is responsible for accepting the application and negotiating the first container from the resource
manager.
• It also restarts the Application Manager container if a task fails.
Node Manager
It take care of individual node on Hadoop cluster and manages
application and workflow and that particular node.
Its primary job is to keep-up with the Node Manager.
It monitors resource usage, performs log management and also kills a
container based on directions from the resource manager.
It is also responsible for creating the container process and start it on
the request of Application master.
Application Master
An application is a single job submitted to a framework. The application
manager is responsible for negotiating resources with the resource
manager, tracking the status and monitoring progress of a single
application.
The application master requests the container from the node manager
by sending a Container Launch Context(CLC) which includes everything
an application needs to run.
Once the application is started, it sends the health report to the
resource manager from time-to-time.
Container
It is a collection of physical resources such as RAM, CPU cores and disk
on a single node.
The containers are invoked by Container Launch Context(CLC) which is
a record that contains information such as environment variables,
security tokens, dependencies etc.
How Hadoop runs Map reduce application
The benefits to using YARN

• Scalability – YARN can run on larger clusters than MapReduce 1.


MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes
and 40,000 tasks, stemming from the fact that the jobtracker has to
manage both jobs and tasks. YARN overcomes these limitations by
virtue of its split resource manager/application master architecture it
is designed to scale up to 10,000 nodes and 100,000 tasks.In contrast
to the jobtracker, each instance of an application—here, a
MapReduce job—has a dedicated application master, which runs for
the duration of the application.
Availability
• Same as HA in HDFS
Multitenancy
• In some ways, the biggest benefit of YARN is that it opens up Hadoop
to other types of distributed application beyond MapReduce.
MapReduce is just one YARN application among many.
Scheduling in Yarn
• FIFO
• Yarn places applications in a queue and runs them in the order of
submission (first in, first out).
• Requests for the first application in the queue are allocated first and
once its requests have been satisfied, the next application in the
queue is served, and so on.
• This scheduler is simple to understand and not needing any
configuration, but it’s not suitable for shared clusters. Large
applications will use all the resources in a cluster, so each application
has to wait its turn
Capacity Scheduler
• Here a separate dedicated queue allows the small job to start as soon
as it is submitted, although this is at the cost of overall cluster
utilization since the queue capacity is reserved for jobs in that queue.
This means that the large job finishes later than when using the FIFO
Scheduler.
• The Capacity Scheduler allows sharing of a Hadoop cluster along
organizational lines, whereby each organization is allocated a certain
capacity of the overall cluster.
• Each organization is set up with a dedicated queue that is configured
to use a given fraction of the cluster capacity.
• Queues may be further divided in hierarchical fashion, allowing each
organization to share its cluster allowance between different groups of
users within the organization.
• Within a queue, applications are scheduled using FIFO scheduling. If
there are idle resources available, then the Capacity Scheduler may
allocate the spare resources to jobs in the queue, even if that causes
the queue’s capacity to be exceeded. This behavior is known as queue
elasticity.
Example
• Example: A company might allocate 50% of cluster resources to the
data analysis team, 30% to the research team, and 20% to the IT
team. The Capacity Scheduler ensures that each team has access to
its allocated resources.
Fair Scheduler
• Here there is no need to reserve a set amount of capacity, since it will
dynamically balance resources between all running jobs. Just after the
first (large) job starts, it is the only job running, so it gets all the
resources in the cluster.
• When the second (small) job starts, it is allocated half of the cluster
resources so that each job is using its fair share of resources.
• Here Scheduler attempts to allocate resources so that all running
applications get the same share of resources.
• To understand how resources are shared between queues, imagine
two users A and B, each with their own queue.
• A starts a job, and it is allocated all the resources available since there
is no demand from B. Then B starts a job while A’s job is still running,
and after a while each job is using half of the resources .
• Now if B starts a second job while the other jobs are still running, it
will share its resources with B’s other job, so each of B’s jobs will have
one-fourth of the resources, while A’s will continue to have half.
• The result is that resources are shared fairly between users
Example
• Example: If there are three jobs (A, B, and C), each job will get roughly
one-third of the cluster resources, regardless of the order in which
they were submitted. If job A finishes, the remaining two jobs will
split the resources evenly.
When to Use Which Scheduler?
•FIFO Scheduler: Use it for simple environments where job execution order is
important, and there aren’t too many users or long jobs.

•Capacity Scheduler: Ideal for multi-tenant environments where multiple users


or departments share the same cluster and each group has different resource
requirements.

•Fair Scheduler: Suitable for environments with a mix of short and long jobs,
where fairness and efficient resource sharing among all jobs are crucial.
Practically…..
• Details of logs :
http://geekdirt.com/blog/introduction-and-working-of-yarn/
K-Means Clustering
• In the map step
• Read the cluster centers into memory from a sequencefile
• Iterate over each cluster center for each input key/value pair.
• Measure the distances and save the nearest center which has the lowest
distance to the vector
• Write the clustercenter with its vector to the filesystem.
• In the reduce step (we get associated vectors for each center)
• Iterate over each value vector and calculate the average vector. (Sum each
vector and devide each part by the number of vectors we received).
• This is the new center, save it into a SequenceFile.
• Check the convergence between the clustercenter that is stored in the key
object and the new center.
• If it they are not equal, increment an update counter
• Run this whole thing until nothing was updated anymore.

You might also like