0% found this document useful (0 votes)

27 views12 pages

Bda Module 2

Hadoop is an open-source framework that facilitates the distributed processing of large datasets using a scalable architecture. It consists of core components like HDFS for storage and MapReduce for processing, ensuring reliability and fault tolerance through data replication. The YARN resource management system enhances efficiency by separating job processing from resource management, allowing for better resource allocation and scalability in handling Big Data.

Uploaded by

Gurudev Mehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views12 pages

Bda Module 2

Uploaded by

Gurudev Mehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Big Data Analytics 21CS71

MODULE 2
Introduction to Hadoop (T1)
Introduction to Hadoop
Hadoop is an Apache open-source framework written in Java that enables the distributed processing
of large datasets across clusters of computers using simple programming models. It allows
applications to work in an environment that supports distributed storage and computation. Hadoop is
scalable, meaning it can grow from a single server to thousands of machines, each providing local
computation and storage. It is designed to handle Big Data and enable efficient processing of massive
datasets.
Big Data Store Model
The Big Data store model in Hadoop is based on a distributed file system. Data is stored in blocks,
which are physical divisions of data spread across multiple nodes. The architecture is organized in
clusters and racks:
 Data Nodes: Store data in blocks.
 Racks: A collection of data nodes, scalable across clusters.
 Clusters: Racks are grouped into clusters to form the overall storage and processing system.
Hadoop ensures reliability by replicating data blocks across nodes. If a data link or node fails, the
system can still access the replicated data from other nodes.
Big Data Programming Model
In Hadoop's Big Data programming model, jobs and tasks are scheduled to run on the same servers
where the data is stored, minimizing data transfer time. This programming model is enabled by
MapReduce, a powerful tool that divides processing tasks into smaller subtasks that can be executed
in parallel across the cluster.
Example of Jobs in Hadoop
 Query Processing: A job that processes queries on datasets and returns results to an
application.
 Sorting Data: Sorting performance data from an examination or another large dataset.
Hadoop and Its Ecosystem
The Hadoop framework was developed as part of an Apache project for Big Data storage and
processing, initiated by Doug Cutting and Mike Cafarella. The name Hadoop came from Cutting’s
son, who named his stuffed toy elephant "Hadoop."
Hadoop has two main components:
1. Hadoop Distributed File System (HDFS): A system for storing data in blocks across
clusters.
2. MapReduce: A computational framework that processes data in parallel across the clusters.
Hadoop is written primarily in Java, with some native code in C, and the utilities are managed using
shell scripts. The framework operates on cloud-based infrastructure, making it a cost-effective
solution for managing and processing terabytes of data in minutes.

1
Big Data Analytics 21CS71

Characteristics of Hadoop
Hadoop offers several key advantages for managing Big Data:
 Scalable: Easily scales from a few machines to thousands.
 Self-manageable: Requires minimal manual intervention for management.
 Self-healing: Automatically manages node failures by replicating data.
 Distributed File System: Ensures reliable storage and quick access to large datasets.
Hadoop Core Components
The Apache Hadoop framework is made up of several core components, which work together to store
and process large datasets in a distributed computing environment. The core components of Hadoop
are as follows:

1. Hadoop Common:
o Description: This is the foundational module that contains the libraries and utilities
required by other Hadoop components. It provides various common services like file
system and input/output operations, serialization, and Remote Procedure Calls
(RPCs).
o Features:
 Common utilities shared across the Hadoop modules.
 File-based data structures.
 Essential interfaces for interacting with the distributed file system.
2. Hadoop Distributed File System (HDFS):
o Description: HDFS is a Java-based distributed file system designed to run on
commodity hardware. It allows Hadoop to store large datasets by distributing data
blocks across multiple machines (nodes) in the cluster.
o Features:
 Data is stored in blocks and replicated for fault tolerance.
 Highly scalable and reliable.
 Optimized for batch processing and provides high throughput for data access.

2
Big Data Analytics 21CS71

3. MapReduce v1:
o Description: MapReduce v1 is a programming model that allows for the processing
of large datasets in parallel across multiple nodes. The model divides a job into
smaller sub-tasks, which are then executed across the cluster.
o Features:
 Jobs are divided into Map tasks and Reduce tasks.
 Suitable for batch processing large sets of data.

4. YARN (Yet Another Resource Negotiator):

o Description: YARN is responsible for managing computing resources in Hadoop. It
schedules and manages jobs and sub-tasks by allocating resources to applications and
ensuring they run efficiently in a distributed environment.
o Features:
 Resource management for Hadoop clusters.
 Parallel execution of tasks across clusters.
 Supports dynamic allocation of resources to applications.
5. MapReduce v2:
o Description: An updated version of MapReduce that operates under the YARN
architecture. It improves resource management and scalability compared to
MapReduce v1.
o Features:
 YARN-based system for distributed parallel processing.
 Allows better resource allocation for running large applications.
Features of Hadoop
Hadoop has several features that make it an essential tool for handling Big Data:
1. Scalability and Modularity:
o Hadoop is highly scalable, meaning you can add more nodes to the cluster as data
grows.
o Its modular design allows components to be easily added or replaced.
2. Robust HDFS:
o The Hadoop Distributed File System (HDFS) is designed to handle large-scale data
reliably.
o Data is replicated (default: three copies), ensuring backup and recovery in case of
node failures.
3. Big Data Processing:

3
Big Data Analytics 21CS71

o Hadoop processes Big Data characterized by the 3Vs: Volume, Variety, and
Velocity.
4. Distributed Cluster Computing with Data Locality:
o Hadoop optimizes processing by running tasks on the same nodes where the data is
stored, enhancing efficiency.
o High-speed processing is achieved by distributing tasks across multiple nodes in a
cluster.
5. Fault Tolerance:
o Hadoop automatically handles hardware failures. If a node fails, the system recovers
by using data replicated across other nodes.
6. Open-Source Framework:
o Hadoop is open-source, making it cost-effective for handling large data workloads. It
can run on inexpensive hardware and cloud infrastructure.
7. Java and Linux Based:
o Hadoop is built in Java and runs primarily on Linux. It also includes its own set of
shell commands for easy management.
Hadoop Ecosystem Components
Hadoop's ecosystem consists of multiple layers, each responsible for different aspects of storage,
resource management, processing, and application support. The key components are:

1. Distributed Storage Layer:

o HDFS: Manages the distributed storage of large datasets.

4
Big Data Analytics 21CS71

2. Resource Manager Layer:

o YARN: Manages and schedules the distribution of resources for jobs and sub-tasks in
the cluster.
3. Processing Framework Layer:
o MapReduce: Processes data in parallel by dividing jobs into Mapper and Reducer
tasks.
4. APIs at the Application Support Layer:
o Provides application interfaces for interacting with the Hadoop ecosystem.
This layered architecture enables Hadoop to efficiently store, manage, and process vast amounts of
data, making it an essential tool for organizations working with Big Data.
HDFS Data Storage
1. Data Distribution in Clusters:
o Hadoop's storage concept involves distributing data across a cluster. A cluster
consists of multiple racks, and each rack contains several DataNodes.
o DataNodes are responsible for storing the actual data blocks, while the NameNode
manages the file system metadata and keeps track of where the data is stored.
2. Data Blocks:
o HDFS breaks down large files into data blocks. Each block is stored independently
across various DataNodes.
o By default, HDFS stores replicas of each data block on multiple DataNodes to
ensure data availability even if some nodes fail.
o Default block size: 64 MB (this can be configured to be larger, such as 128 MB or
256 MB).
3. Rack Awareness:
o HDFS is aware of the physical distribution of nodes across racks.
o When replicating blocks, Hadoop attempts to place replicas on different racks to
improve fault tolerance and reduce network bandwidth between nodes on the same
rack.
4. Fault Tolerance:
o The replication of blocks ensures that data is not lost if a node goes down. The
default replication factor is 3, meaning that each block is replicated across three
different nodes.
o In the event of a DataNode failure, the NameNode automatically re-replicates the
missing blocks on other DataNodes.
5. Processing and Storage:
o DataNodes not only store data but also have the capability to process the data stored
in them. This enables distributed processing and allows Hadoop to process large
datasets efficiently across clusters.

5
Big Data Analytics 21CS71

6. Data Block Management:

o When a file is uploaded to HDFS, it is split into blocks. Each block is distributed
across different nodes to optimize read and write performance.
o Blocks are immutable, meaning once written, they cannot be modified. Data can
only be appended to a file, but not altered in between.
Hadoop Physical Organization

In a Hadoop cluster, nodes are divided into MasterNodes and SlaveNodes.

MasterNodes:
 MasterNodes (or simply Masters) are responsible for coordinating the operations within the
cluster. These nodes handle the overall management of the Hadoop environment, including
storage and task distribution.
 Key MasterNodes:
1. NameNode: The central node that manages the file system's metadata, such as file
block locations, permissions, and access times. It plays a crucial role in managing
data within HDFS.
2. Secondary NameNode: Maintains a backup of the metadata and acts as a failover
mechanism for the NameNode. It helps in managing metadata snapshots but is not a
complete replacement for the NameNode.
3. JobTracker: Oversees the allocation of MapReduce tasks to various nodes and
ensures job completion by managing job execution across the cluster.

6
Big Data Analytics 21CS71

SlaveNodes:
 SlaveNodes (or DataNodes and Task Trackers) store actual data blocks and execute
computational tasks. Each node has a significant amount of disk space and is responsible for
both data storage and processing.
o DataNodes handle the storage and management of data blocks.
o TaskTrackers execute the processing tasks sent by the MasterNode and return the
results.
Physical Distribution of Nodes:
 A typical Hadoop cluster consists of many DataNodes that store data, while MasterNodes
handle administrative tasks. In a large cluster, multiple MasterNodes are used to balance the
load and ensure redundancy.
Client-Server Interaction:
 Clients interact with the Hadoop system by submitting queries or applications through various
Hadoop ecosystem projects, such as Hive, Pig, or Mahout.
 The MasterNode coordinates with the DataNodes to store data and process tasks. For
example, it organizes how files are distributed across the cluster, assigns jobs to the nodes,
and monitors the health of the system.

Hadoop MapReduce Framework and Programming Model

MapReduce is the primary programming model used for processing large datasets in Hadoop. The
framework is divided into two main functions:
1. Map Function:
o The Map function organizes the data into key/value pairs.
o Each mapper works on a subset of the data blocks and produces intermediate results
that are used by the Reduce function.
o Mapping distributes the task across different nodes in the cluster, where each node
processes its portion of the data.
2. Reduce Function:
o The Reduce function takes the intermediate key/value pairs generated by the Map
function and processes them to produce a final aggregated result.
o It applies aggregation, queries, or other functions to the mapped data, reducing it
into a smaller, cohesive set of results.

Hadoop MapReduce Execution Process

The MapReduce job execution involves several steps:
1. Job Submission:

7
Big Data Analytics 21CS71

o A client submits a request to the JobTracker, which estimates the required resources
and prepares the cluster for execution.
2. Task Assignment:
o The JobTracker assigns Map tasks to nodes that store the relevant data. This is
called data locality, which reduces network overhead.
3. Monitoring:
o The progress of each task is monitored, and if any task fails, it is restarted on a
different node with available resources.
4. Final Output:
o After the Map and Reduce jobs are completed, the results are serialized and
transferred back to the client, typically using formats like AVRO.

MapReduce Programming Model

MapReduce programs can be written in various languages, including Java, C++, and Python. The
basic structure of a MapReduce program includes:
1. Input Data:
o Data is typically stored in HDFS in files or directories, either structured or
unstructured.
2. Map Phase:
o The map function processes the input data by breaking it down into key/value pairs.
Each key/value pair is passed to the reduce phase after mapping.
3. Reduce Phase:
o The reduce function collects the output of the map phase and reduces the data by
aggregating, sorting, or applying user-defined functions.
Hadoop YARN: Resource Management and Execution Model
YARN (Yet Another Resource Negotiator) is a resource management framework used in Hadoop for
managing and scheduling computer resources in a distributed environment. YARN separates the job
processing function from resource management, improving scalability and efficiency.
Components in YARN:
1. Resource Manager (RM):
o The Resource Manager is the master node in the YARN architecture. There is one
RM per cluster, and it is responsible for:
 Managing the overall resources of the cluster.
 Handling job submissions from clients.
 Monitoring the availability of node resources (Node Managers).
 Allocating resources to the applications.

8
Big Data Analytics 21CS71

2. Node Manager (NM):

o The Node Manager is a slave component running on each cluster node. It manages
the individual node's resources and keeps the RM informed of its status.
Responsibilities include:
 Monitoring the resource usage (CPU, memory) of containers running on the
node.
 Starting and stopping containers (which run the actual tasks).
 Sending periodic heartbeat signals to the RM to indicate its availability.
3. Application Master (AM):
o The Application Master is created for each job submitted to YARN. It handles the
life cycle of an individual application. Its tasks include:
 Requesting resources (containers) from the RM.
 Coordinating the execution of tasks across containers.
 Monitoring task completion and handling failures.
4. Containers:
o Containers are the basic unit of resource allocation in YARN. Each container is a

collection of resources (memory, CPU) on a single node, assigned by the RM to the

Application Master for executing tasks.

9
Big Data Analytics 21CS71

o Containers run the actual tasks of the application in parallel, distributed across
multiple nodes.

YARN-Based Execution Model

The YARN execution model consists of several steps involving the interaction between different
components. Below is a breakdown of the actions in the YARN resource allocation and scheduling
process:
1. Client Submission:
o The Client Node submits a request for an application or job to the Resource
Manager (RM). The RM then takes responsibility for managing and executing the
job.
2. Job History Server:
o The Job History Server keeps track of all the jobs that have been completed in the
cluster. This helps in maintaining job execution history for future analysis or
debugging.

3. Node Manager Startup:

o In a YARN cluster, multiple Node Managers (NM) exist. Each NM starts an instance
of the Application Master (AM). The AM is responsible for managing the lifecycle
of the application and requesting resources.
4. Application Master Initialization:
o Once the AM instance (AMI) is created, it registers itself with the RM. The AM
evaluates the resource requirements for the submitted job and requests the necessary
containers.
5. Resource Allocation:
o The RM analyzes the resource availability in the cluster by tracking heartbeat signals
from active NMs. The RM allocates the required containers across different nodes
based on the resource requests from the Application Master.
6. Container Assignment:
o Each NM assigns a container to the AMI. The containers can be assigned either on
the same NM or across different NMs, depending on resource availability. Each
Application Master uses the assigned containers to execute the sub-tasks of the
application.
7. Execution of Application Sub-Tasks:
o Once the containers are assigned, the Application Master coordinates the execution
of sub-tasks across the allocated containers. The job's tasks run in parallel on different
containers, utilizing the distributed nature of the Hadoop cluster.
8. Resource Monitoring:

10
Big Data Analytics 21CS71

o During job execution, the NM monitors resource utilization and ensures the tasks are
completed successfully. If there are any failures, the RM may reassign tasks to
available containers.
Hadoop Ecosystem Tools
1. Zookeeper:
Zookeeper is a centralized coordination service for distributed applications. It provides a reliable,
efficient way to manage configuration, synchronization, and name services across distributed systems.
Zookeeper maintains data in nodes called JournalNodes, ensuring that distributed systems function
cohesively. Its main coordination services include:
 Name Service: Similar to DNS, it maps names to information, tracking servers or services
and checking their statuses.
 Concurrency Control: Manages concurrent access to shared resources, preventing
inconsistencies and ensuring that distributed processes run smoothly.
 Configuration Management: A centralized configuration manager that updates nodes with
the current system configuration when they join the system.
 Failure Management: Automatically recovers from node failures by selecting alternative
nodes to take over processing tasks.
2. Oozie:
Apache Oozie is a workflow scheduler for Hadoop that manages and coordinates complex jobs and
tasks in big data processing. Oozie allows you to create, schedule, and manage multiple workflows. It
organizes jobs into Directed Acyclic Graphs (DAGs) and supports:
 Integration of Multiple Jobs: Oozie integrates MapReduce, Hive, Pig, and Sqoop jobs in a
sequential workflow.
 Time and Data Triggers: Automatically runs workflows based on time or specific data
availability.
 Batch Management: Manages the timely execution of thousands of jobs in a Hadoop cluster.
Oozie is efficient for automating and scheduling repetitive jobs, simplifying the management of
multiple workflows.

3. Sqoop:
Apache Sqoop is a tool used for efficiently importing and exporting large amounts of data between
Hadoop and relational databases. It uses the MapReduce framework to parallelize data transfer
tasks. The workflow of Sqoop includes:
 Command-Line Parsing: Sqoop processes the arguments passed through the command line
and prepares map tasks.
 Data Import and Export: Data from external databases is distributed across multiple
mappers. Each mapper connects to the database using JDBC to fetch and import the data into
Hadoop, HDFS, Hive, or HBase.
 Parallel Processing: Sqoop leverages Hadoop's parallel processing to transfer data quickly
and efficiently. It also provides fault tolerance and schema definition for data import.

11
Big Data Analytics 21CS71

Sqoop's ability to handle structured data makes it an essential tool for integrating relational databases
with the Hadoop ecosystem.

4. Flume:
Apache Flume is a service designed for efficiently collecting, aggregating, and transferring large
volumes of streaming data into Hadoop, particularly into HDFS. It's highly useful for applications
involving continuous data streams, such as logs, social media feeds, or sensor data. Key components
of Flume include:
 Sources: These collect data from servers or applications.
 Sinks: These store the collected data into HDFS or another destination.
 Channels: These act as a buffer, holding event data (typically 4 KB in size) between sources
and sinks.
 Agents: Agents run sources and sinks. Interceptors filter or modify the data before it's
written to the target.
Flume is reliable and fault-tolerant, providing a robust solution for handling massive, continuous data
streams.
----------------------------------------END OF MODULE 2-------------------------------------------------

Module 2 CN
No ratings yet
Module 2 CN
23 pages
Module 2
No ratings yet
Module 2
23 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
HADOOP
No ratings yet
HADOOP
10 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Introduction To
No ratings yet
Introduction To
7 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
Unit 2
No ratings yet
Unit 2
17 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Big Data 2 - Part
No ratings yet
Big Data 2 - Part
40 pages
Module - 2
No ratings yet
Module - 2
84 pages
CC Unit 2
No ratings yet
CC Unit 2
29 pages
Hadoop
No ratings yet
Hadoop
11 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
42 pages
Unit 2
No ratings yet
Unit 2
9 pages
Module - 2 Half
No ratings yet
Module - 2 Half
12 pages
Unit Ii BDT F
No ratings yet
Unit Ii BDT F
13 pages
7) Intro To Hadoop and Mapreducer
No ratings yet
7) Intro To Hadoop and Mapreducer
10 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Bda 21cs71 Module 2
No ratings yet
Bda 21cs71 Module 2
23 pages
BDA Mod2@AzDOCUMENTS - in
No ratings yet
BDA Mod2@AzDOCUMENTS - in
64 pages
Attachment
No ratings yet
Attachment
11 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
2 Notes
No ratings yet
2 Notes
61 pages
Unit 3
No ratings yet
Unit 3
90 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
BDM 2
No ratings yet
BDM 2
5 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Hadoop Components
No ratings yet
Hadoop Components
5 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
Bda (21cs71) Module-2
No ratings yet
Bda (21cs71) Module-2
64 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Introduction To Hadoop (T1) :: 21CS71-BIG DATA ANLAYTICS Scheme:2021 Scheme
No ratings yet
Introduction To Hadoop (T1) :: 21CS71-BIG DATA ANLAYTICS Scheme:2021 Scheme
28 pages
BDA Module-02 Search Creators
No ratings yet
BDA Module-02 Search Creators
33 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Unit 3 Hadoop
No ratings yet
Unit 3 Hadoop
50 pages
Module 2
No ratings yet
Module 2
100 pages
Hadoop Is An Open
No ratings yet
Hadoop Is An Open
4 pages
Hadoop Ecosystem Lab Manual
0% (1)
Hadoop Ecosystem Lab Manual
40 pages
Bda Unit-2
No ratings yet
Bda Unit-2
37 pages
Gautham - Data Engineer
No ratings yet
Gautham - Data Engineer
6 pages
Kinjal Mistry Resume (CS) PDF
No ratings yet
Kinjal Mistry Resume (CS) PDF
2 pages
Module-1 Data Analytics in Healthcare Systems
No ratings yet
Module-1 Data Analytics in Healthcare Systems
23 pages
Cloud Computing Unit-5
No ratings yet
Cloud Computing Unit-5
11 pages
BDA Unlocked
100% (1)
BDA Unlocked
69 pages
Mini Project Doc 2
No ratings yet
Mini Project Doc 2
25 pages
Introduction To Apache Pig: Geeksforgeeks
No ratings yet
Introduction To Apache Pig: Geeksforgeeks
5 pages
BD MCQ
No ratings yet
BD MCQ
40 pages
Abstract Hadoop
No ratings yet
Abstract Hadoop
1 page
Loader: Big Data Huawei Course
No ratings yet
Loader: Big Data Huawei Course
11 pages
Hadoop Tools - A Brief Overview
No ratings yet
Hadoop Tools - A Brief Overview
18 pages
Bms Institute of Technology& MGMT: Department of Information Science and Engineering
No ratings yet
Bms Institute of Technology& MGMT: Department of Information Science and Engineering
2 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
Internet of Things (Design Principles For Web Connectivity) : By: Dr. Raj Kamal
No ratings yet
Internet of Things (Design Principles For Web Connectivity) : By: Dr. Raj Kamal
44 pages
Parallel Implementation of Hyperclique Miner Algorithm For Association Analysis of Weighted Protein-Protein Interaction Network
No ratings yet
Parallel Implementation of Hyperclique Miner Algorithm For Association Analysis of Weighted Protein-Protein Interaction Network
6 pages
2018, Sriyanong - A Text Preprocessing Framework For Text Mining On Big Data Infrastructure
No ratings yet
2018, Sriyanong - A Text Preprocessing Framework For Text Mining On Big Data Infrastructure
5 pages
19CS4701D
No ratings yet
19CS4701D
2 pages
FCC - Module V - Cloud Technologies and Advancements
No ratings yet
FCC - Module V - Cloud Technologies and Advancements
63 pages
Exploring Dataset in MapReduce
No ratings yet
Exploring Dataset in MapReduce
14 pages
ST1 KCS061 - Updated
No ratings yet
ST1 KCS061 - Updated
2 pages
Syllabus
No ratings yet
Syllabus
3 pages
Hadoop The Definitive Guide 3rd Edition
100% (1)
Hadoop The Definitive Guide 3rd Edition
647 pages
BIG DATA Question Bank
100% (1)
BIG DATA Question Bank
3 pages
MAPREDUCEFRAMEWORK
No ratings yet
MAPREDUCEFRAMEWORK
12 pages
Questions For Second
No ratings yet
Questions For Second
59 pages
Data Science Program 2014 PDF
No ratings yet
Data Science Program 2014 PDF
20 pages
MapReduce Algorithms and Optimization
No ratings yet
MapReduce Algorithms and Optimization
3 pages
LN 2
No ratings yet
LN 2
33 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
Abhilash Resume
No ratings yet
Abhilash Resume
5 pages

Bda Module 2

Uploaded by

Bda Module 2

Uploaded by

Big Data Analytics 21CS71

4. YARN (Yet Another Resource Negotiator):

1. Distributed Storage Layer:

o HDFS: Manages the distributed storage of large datasets.

2. Resource Manager Layer:

6. Data Block Management:

In a Hadoop cluster, nodes are divided into MasterNodes and SlaveNodes.

Hadoop MapReduce Framework and Programming Model

Hadoop MapReduce Execution Process

MapReduce Programming Model

2. Node Manager (NM):

collection of resources (memory, CPU) on a single node, assigned by the RM to the

YARN-Based Execution Model

3. Node Manager Startup:

You might also like