0% found this document useful (0 votes)

39 views12 pages

Unit2 Bda

The document provides an overview of Apache Hadoop, an open-source framework for processing and analyzing large datasets, detailing its architecture and key components such as HDFS, YARN, and MapReduce. It also discusses the Hadoop Ecosystem, which includes tools like Hive, Pig, and HBase, and highlights their roles in data storage and processing. Additionally, the document covers performance optimization techniques and the importance of data serialization in distributed systems.

Uploaded by

Chaudhri Upeksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views12 pages

Unit2 Bda

Uploaded by

Chaudhri Upeksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Unit-2 INTRODUCTION TO HADOOP AND HADOOP ARCHITECTURE

Big Data – Apache Hadoop & Hadoop EcoSystem,

Hadoop is an open source framework. It is provided by Apache to process

and analyze very huge volume of data. It is written in Java and currently used
by Google, Facebook, LinkedIn, Yahoo, Twitter etc.

Our Hadoop tutorial includes all topics of Big Data Hadoop with HDFS,
MapReduce, Yarn, Hive, HBase, Pig, Sqoop etc.

Apache Hadoop is an open-source software framework used for distributed

storage and processing of large datasets across clusters of computers using
simple programming models. It's designed to scale up from single servers
to thousands of machines, each offering local computation and storage.

Hadoop consists of four main modules:

1. Hadoop Common: A set of common utilities and libraries that support

other Hadoop modules.

2. Hadoop Distributed File System (HDFS): A distributed file system that

stores data across multiple machines. It provides high-throughput access to
application data and is designed to be fault-tolerant.

3. Hadoop YARN (Yet Another Resource Negotiator): A resource

management layer responsible for managing resources and scheduling
applications on the Hadoop cluster.

4. Hadoop MapReduce: A programming model and processing engine for

large-scale data processing. It allows users to write applications that process
large amounts of data in parallel across a distributed cluster.

Hadoop is widely used in industries such as finance, healthcare, advertising,

and social media for tasks like log processing, data warehousing, machine
learning, and more. It's known for its scalability, fault tolerance, and ability
to handle diverse types of data.

Hadoop Ecosystem
The Hadoop Ecosystem is a group of software tools and frameworks. It is
based on the core components of Apache Hadoop. It enables storing,
ASS.PRO.UPEKSHA CHAUDHRI 1
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

processing, and analyzing large amounts of data. It provides the

infrastructure needed to process large datasets. Hadoop distributes data
and processes tasks across clusters of computers.
Hadoop Ecosystem Components
The Hadoop Ecosystem is composed of several components. Each
component works together to enable the storage and analysis of data.

In the above diagram, we can see the components that collectively form a
Hadoop

Components Description

HDFS Hadoop Distributed File System

YARN Yet Another Resource Negotiator

MapReduce Programming-based Data Processing

Spark InMemory Data Processing

PIG, HIVE Processing of data services on query-based.

ASS.PRO.UPEKSHA CHAUDHRI 2
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Components Description

HBase NoSQL Database

Mahout, Spark MLLib Machine Learning algorithm libraries

Zookeeper Managing cluster

Oozie Job Scheduling

Ecosystem.
Now we will learn about each of the components in detail.
Hadoop Distributed File System
• HDFS is the primary storage system in the Hadoop Ecosystem.

• It is a distributed file system that provides reliable and scalable

storage of large datasets across multiple computers.

• HDFS divides data into blocks and distributes them across the
cluster for fault tolerance and high availability.

• It consists of 2 basic components

o Node Name
o Data Node

• Node name is a primary Node. It contains metadata, requiring

comparatively free resources than Data Nodes that store the
actual data.

• It maintains all the coordination between clusters and hardware.

ASS.PRO.UPEKSHA CHAUDHRI 3
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

HDFS Architecture

The main purpose of HDFS is to ensure that data is preserved even in the
event of failures such as NameNode failures, DataNode failures, and
network partitions.
HDFS uses a master/slave architecture, where one device (master) controls
one or more other devices (slaves).
Important points about HDFS architecture:
1. Files are split into fixed-size chunks and replicated across
multiple DataNodes.

2. The NameNode contains file system metadata and coordinates

data access.

3. Clients interact with HDFS through APIs to read, write, and

delete files.

4. DataNodes send heartbeats to the NameNode to report status

and block information.

5. HDFS is rack-aware and places replicas on different racks for

fault tolerance. Checksums are used for data integrity to ensure
the accuracy of stored data.
Yarn
• YARN (Yet Another Resource Negotiator). YARN helps manage
resources across the cluster.

• It has 3 main components:

ASS.PRO.UPEKSHA CHAUDHRI 4
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

o Resource Manager

o Node Manager

o Application Manager

• Resource manager allocates resources for applications in the

system.

• Node manager allocates resources such as CPU, bandwidth per

machine. After allocation, it is later acknowledged to the
resource manager.

• The application manager and node manager perform

negotiations according to the requirements.

Yarn Architecture

Key points about YARN architecture are:

• Distributed Resource managers have the privilege of allocating
resources to applications in the system.

• Node managers work on allocating resources such as CPU,

memory, and bandwidth per machine, and later credit resource
managers.

• The Application Manager acts as an interface between the

Resource Manager and the Node Manager, negotiating their
needs.
MapReduce
• MapReduce is a programming model and processing framework
that enables parallel processing of large data sets.

• MapReduce can work with big data. It splits tasks into smaller
parts called mapping and reducing, which can be done
simultaneously.

• Map tasks process data and produce intermediate results.

• The intermediate results are then combined by a reduction task

to produce the final output.

ASS.PRO.UPEKSHA CHAUDHRI 5
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• MapReduce makes use of two functions Map() and Reduce()

o Map() sorts and filters data, thereby organizing it into
groups.

o A Map produces results based on key-value pairs,

which are later processed by the Reduce() method.

o Reduce() performs summarization by aggregating

related data.

•Simply put, Reduce() takes the output produced by Map() as

input and combines these tuples into a set of smaller tuples.
Apache Hive
• Hive provides a data warehousing infrastructure based on
Hadoop.

• It provides a SQL-like query language called HiveQL. It allows

users to query, analyze and manage large datasets stored in
Hadoop.

•Hive turns queries into MapReduce or other execution engines.

Thus, enabling data summarization, ad-hoc queries, and data
analysis.
Apache Pig
• Pig is a high-level scripting language and platform for simplifying
data processing tasks in Hadoop.

• It provides a language called Pig Latin. It allows users to express

data transformations and analytical operations.

• Pig optimizes these operations and transforms them into

MapReduce jobs for execution.
HBase
• HBase is a distributed columnar NoSQL database that runs on
Hadoop.

• It provides real-time random read/write access to large datasets.

• HBase is suitable for applications that require low-latency access

to data.

ASS.PRO.UPEKSHA CHAUDHRI 6
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• For example: real-time analytics, time-series data, and Online

Transaction Processing Systems (OLTP).
Apache Spark
• Spark is a fast and versatile cluster computing system that
extends the capabilities of Hadoop.

• It offers in-memory processing, enabling faster data processing

and iterative analysis.

• Spark supports batch processing, real-time stream processing,

and interactive data analysis. Thus, making it a versatile tool in
the Hadoop Ecosystem.
Apache Kafka
• Kafka is a distributed streaming platform that enables the
ingestion and processing of real-time data streams.

• It provides a publish / subscribe model for streaming data. It

allows applications to process the generated data.

• Kafka is commonly used to build real-time data pipelines, event-

driven architectures, and streaming analytics applications.
Apache Sqoop
• Sqoop is a Hadoop tool that makes it easy to move data
between Hadoop and structured databases.

• This tool helps connect traditional databases with the Hadoop

Ecosystem.
Apache Flume
• Flume makes getting lots of live data easier and send it to
Hadoop.

• This helps add data from different places like log files, social
media, and sensors.
Hadoop Performance Optimization
Here are some methods for optimizing Hadoop Ecosystem in Big Data
Take Advantage of HDFS Block Size Optimization
• Configure the HDFS block size based on the typical size of your
data files.

• Larger block sizes improve performance for reading and writing

large files, while smaller block sizes are beneficial for smaller
files.
Optimize Data Replication Factor

ASS.PRO.UPEKSHA CHAUDHRI 7
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• Adjust the replication factor based on the required fault

tolerance and cluster storage capacity.

• A lower replication factor reduces storage overhead and

improves performance but at the cost of lower fault tolerance.
Optimize Your Network Settings
• Configure network settings such as network buffers and TCP
settings. It helps to maximize data transfer speeds between
nodes in your cluster.

• Hadoop performance improves when network bandwidth

increases and latency decreases.
Increase Concurrency
• Split large computing tasks into smaller, parallelizable tasks to
make optimal use of your cluster's compute resources.

• This can be achieved by adjusting the number of mappers and

reducers in your MapReduce job.
Optimize Task Scheduling
• Configure the Hadoop scheduler. For e.g.: Fair Scheduler or
Capacity Scheduler for efficient allocation.

• Fine-tuning the scheduling parameters ensures fair resource

allocation and maximizes cluster utilization.

Moving Data in and out of Hadoop –

Understanding inputs and outputs of MapReduce
MapReduce is a programming model and associated implementation for
processing and generating large datasets that can be parallelized across a
distributed cluster of computers. It consists of two main phases: the Map
phase and the Reduce phase. Let me explain the inputs and outputs for
each phase:

### Map Phase:

**Input:**

ASS.PRO.UPEKSHA CHAUDHRI 8
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

- **Key-Value pairs:** The input data is divided into chunks, and each
chunk is represented as a key-value pair. Typically, the key is used to
identify the data record, and the value contains the actual data.

**Processing:**
- **Mapper function:** A user-defined function called the "mapper" is
applied to each key-value pair independently. The mapper function takes
the input key-value pair and emits intermediate key-value pairs based on
the processing logic. It can filter, transform, or extract information from the
input data.

**Output:**
- **Intermediate Key-Value pairs:** The mapper function generates
intermediate key-value pairs as its output. These key-value pairs are
usually different from the input key-value pairs and are emitted based on
the logic defined in the mapper function. The intermediate key-value pairs
are grouped by key and shuffled across the cluster to prepare for the next
phase.

### Reduce Phase:

**Input:**
- **Grouped Key-Value pairs:** The intermediate key-value pairs generated
by the map phase are shuffled and grouped based on their keys. All
intermediate values associated with the same key are collected together
and passed to the reducer function.

**Processing:**
- **Reducer function:** A user-defined function called the "reducer" is
applied to each group of intermediate values sharing the same key. The
reducer function aggregates, summarizes, or processes these values to
produce the final output.

ASS.PRO.UPEKSHA CHAUDHRI 9
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

**Output:**
- **Final Output Key-Value pairs:** The reducer function generates the final
output key-value pairs based on the processing logic. These key-value
pairs constitute the result of the MapReduce job and typically represent
the desired computation or analysis performed on the input data.

In summary, the inputs to MapReduce are the initial dataset represented as

key-value pairs, and the outputs are the final processed results also
represented as key-value pairs, with intermediate processing stages in
between.
Data Serialization:-

What is Serialization?
Serialization is the process of converting a data object—a
combination of code and data represented within a region of data
storage—into a series of bytes that saves the state of the object in
an easily transmittable form. In this serialized form, the data can be
delivered to another data store (such as an in-memory computing
platform), application, or some other destination.

Data serialization is the process of converting an object into a

stream of bytes to more easily save or transmit it.
The reverse process—constructing a data structure or object from a
series of bytes—is deserialization. The deserialization process
recreates the object, thus making the data easier to read and modify
as a native structure in a programming language.

ASS.PRO.UPEKSHA CHAUDHRI 10
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Serialization and deserialization work together to transform/recreate

data objects to/from a portable format.
Serialization enables us to save the state of an object and recreate
the object in a new location. Serialization encompasses both the
storage of the object and exchange of data. Since objects are
composed of several components, saving or delivering all the parts
typically requires significant coding effort, so serialization is a
standard way to capture the object into a sharable format.

With serialization, we can transfer objects:

• Over the wire for messaging use cases

• From application to application via web services such as REST
APIs
• Through firewalls (as JSON or XML strings)
• Across domains
• To other data stores
• To identify changes in data over time
• While honoring security and user-specific details across
applications

Why Is Data Serialization Important for Distributed Systems?

In some distributed systems, data and its replicas are stored in
different partitions on multiple cluster members. If data is not
present on the local member, the system will retrieve that data from
another member. This requires serialization for use cases such as:

• Adding key/value objects to a map

• Putting items into a queue, set, or list
• Sending a lambda functions to another server
• Processing an entry within a map
• Locking an object
ASS.PRO.UPEKSHA CHAUDHRI 11
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• Sending a message to a topic

What Are Common Languages for Data Serialization?

A number of popular object-oriented programming languages
provide either native support for serialization or have libraries that
add non-native capabilities for serialization to their feature set. Java,
.NET, C++, Node.js, Python, and Go, for example, all either have
native serialization support or integrate with serializer libraries.

Data formats such as JSON and XML are often used as the format for
storing serialized data. Customer binary formats are also used, which
tend to be more space-efficient due to less markup/tagging in the
serialization.
What Is Data Serialization in Big Data?
Big data systems often include technologies/data that are described
as “schemaless.” This means that the managed data in these systems
are not structured in a strict format, as defined by a schema.
Serialization provides several benefits in this type of en vironment:

• Structure. By inserting some schema or criteria for a data

structure through serialization on read, we can avoid reading
data that misses mandatory fields, is incorrectly classified, or
lacks some other quality control requirement.
• Portability. Big data comes from a variety of systems and may
be written in a variety of languages. Serialization can provide
the necessary uniformity to transfer such data to other
enterprise systems or applications.
• Versioning. Big data is constantly changing. Serialization allows
us to apply version numbers to objects for lifecycle
management.

ASS.PRO.UPEKSHA CHAUDHRI 12

Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Unit 2
No ratings yet
Unit 2
17 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
INTRO Hadoop-Ecosystem
No ratings yet
INTRO Hadoop-Ecosystem
6 pages
Hadoop
No ratings yet
Hadoop
61 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Big Data Open Source Framework-Hadoop
No ratings yet
Big Data Open Source Framework-Hadoop
22 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Unit 4 Endsem PYQs
No ratings yet
Unit 4 Endsem PYQs
24 pages
Wa0005.
No ratings yet
Wa0005.
84 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
HADOOP
No ratings yet
HADOOP
19 pages
Data Analyst
No ratings yet
Data Analyst
9 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
Unit 2
No ratings yet
Unit 2
9 pages
Hadoop Ecosystem Lab Manual
0% (1)
Hadoop Ecosystem Lab Manual
40 pages
Unit 4
No ratings yet
Unit 4
85 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Attachment
No ratings yet
Attachment
11 pages
Introduction To
No ratings yet
Introduction To
7 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Hadoop
No ratings yet
Hadoop
5 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
HADOOP
No ratings yet
HADOOP
10 pages
Unit 2
No ratings yet
Unit 2
23 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 3
No ratings yet
Unit 3
90 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
BDA SansON Iat1
No ratings yet
BDA SansON Iat1
17 pages
Hadoop Components
No ratings yet
Hadoop Components
5 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
28 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
Unit 5
No ratings yet
Unit 5
32 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Hadoop Ecosystem: Overview
No ratings yet
Hadoop Ecosystem: Overview
5 pages
Hadoop Ecosystem Components Guide
No ratings yet
Hadoop Ecosystem Components Guide
19 pages
Unit 4
No ratings yet
Unit 4
21 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
AI With ML (Unit-2)
100% (1)
AI With ML (Unit-2)
26 pages
Unit1 Bda
No ratings yet
Unit1 Bda
30 pages
Unit-2 Ipdc2
No ratings yet
Unit-2 Ipdc2
28 pages
Unit1 Algorithms
No ratings yet
Unit1 Algorithms
62 pages
CORBA Chat App Development Guide
No ratings yet
CORBA Chat App Development Guide
21 pages
Tim Babych: Experience Senior Software Engineer, People - Ai
No ratings yet
Tim Babych: Experience Senior Software Engineer, People - Ai
2 pages
Unit Wise Question Bank
No ratings yet
Unit Wise Question Bank
6 pages
Guide To Use EJB With Axis2
No ratings yet
Guide To Use EJB With Axis2
3 pages
Spring
No ratings yet
Spring
65 pages
Kubernetes Cheat Sheet: Logs Pods
No ratings yet
Kubernetes Cheat Sheet: Logs Pods
1 page
Delivering Services From The Cloud
No ratings yet
Delivering Services From The Cloud
42 pages
Software Architecture Essentials
No ratings yet
Software Architecture Essentials
11 pages
Chapter 3
No ratings yet
Chapter 3
60 pages
Wa0005.
No ratings yet
Wa0005.
3 pages
Assignment Assignment 1 Cloud Computing
No ratings yet
Assignment Assignment 1 Cloud Computing
5 pages
Middleware Technologies Overview
No ratings yet
Middleware Technologies Overview
13 pages
Abhinav Resume
No ratings yet
Abhinav Resume
6 pages
AWS Lambda and Serverless Computing
No ratings yet
AWS Lambda and Serverless Computing
39 pages
System Analysis and Design Activities Report: Hanoi University Faculty of Information Technology - &
No ratings yet
System Analysis and Design Activities Report: Hanoi University Faculty of Information Technology - &
11 pages
Cloud Computing and Web Services
No ratings yet
Cloud Computing and Web Services
4 pages
Aspiring Web Developer's Journey
No ratings yet
Aspiring Web Developer's Journey
1 page
Syllabus SWE-316 PDF
No ratings yet
Syllabus SWE-316 PDF
4 pages
Ibm Apic
No ratings yet
Ibm Apic
682 pages
Exam C1000-093 IBM Cloud Pak For Integration v2020.1 Solution Architect Sample Test
No ratings yet
Exam C1000-093 IBM Cloud Pak For Integration v2020.1 Solution Architect Sample Test
4 pages
Shari Lawrence Pfleeger, Joanne M. Atlee - Software Engineering - Theory and Practice-Prentice Hall (2009)
No ratings yet
Shari Lawrence Pfleeger, Joanne M. Atlee - Software Engineering - Theory and Practice-Prentice Hall (2009)
775 pages
Oracle Certification Matrix
No ratings yet
Oracle Certification Matrix
1 page
HUAWEI CLOUD Essential Service Icons
No ratings yet
HUAWEI CLOUD Essential Service Icons
13 pages
Middleworks Oracle Service Bus 12c Development Course Syllabus
No ratings yet
Middleworks Oracle Service Bus 12c Development Course Syllabus
3 pages
Java Backend Developer Guide
No ratings yet
Java Backend Developer Guide
6 pages
Architecture Consistency: State of The Practice, Challenges and Requirements
No ratings yet
Architecture Consistency: State of The Practice, Challenges and Requirements
35 pages
NGINX IBM WebSphere Deployment-Guide
No ratings yet
NGINX IBM WebSphere Deployment-Guide
24 pages
Cloud Computing BOOK
No ratings yet
Cloud Computing BOOK
83 pages
Cloud MOD-4 Part 1
No ratings yet
Cloud MOD-4 Part 1
7 pages
Oracle Cloud: Made By
No ratings yet
Oracle Cloud: Made By
6 pages

Unit2 Bda

Uploaded by

Unit2 Bda

Uploaded by

BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Unit-2 INTRODUCTION TO HADOOP AND HADOOP ARCHITECTURE

Hadoop is an open source framework. It is provided by Apache to process

Apache Hadoop is an open-source software framework used for distributed

Hadoop consists of four main modules:

1. Hadoop Common: A set of common utilities and libraries that support

2. Hadoop Distributed File System (HDFS): A distributed file system that

3. Hadoop YARN (Yet Another Resource Negotiator): A resource

4. Hadoop MapReduce: A programming model and processing engine for

Hadoop is widely used in industries such as finance, healthcare, advertising,

processing, and analyzing large amounts of data. It provides the

HDFS Hadoop Distributed File System

YARN Yet Another Resource Negotiator

MapReduce Programming-based Data Processing

Spark InMemory Data Processing

PIG, HIVE Processing of data services on query-based.

HBase NoSQL Database

Mahout, Spark MLLib Machine Learning algorithm libraries

Zookeeper Managing cluster

Oozie Job Scheduling

• It is a distributed file system that provides reliable and scalable

• It consists of 2 basic components

• Node name is a primary Node. It contains metadata, requiring

• It maintains all the coordination between clusters and hardware.

2. The NameNode contains file system metadata and coordinates

3. Clients interact with HDFS through APIs to read, write, and

4. DataNodes send heartbeats to the NameNode to report status

5. HDFS is rack-aware and places replicas on different racks for

• It has 3 main components:

• Resource manager allocates resources for applications in the

• Node manager allocates resources such as CPU, bandwidth per

• The application manager and node manager perform

Key points about YARN architecture are:

• Node managers work on allocating resources such as CPU,

• The Application Manager acts as an interface between the

• Map tasks process data and produce intermediate results.

• The intermediate results are then combined by a reduction task

• MapReduce makes use of two functions Map() and Reduce()

o A Map produces results based on key-value pairs,

o Reduce() performs summarization by aggregating

•Simply put, Reduce() takes the output produced by Map() as

• It provides a SQL-like query language called HiveQL. It allows

•Hive turns queries into MapReduce or other execution engines.

• It provides a language called Pig Latin. It allows users to express

• Pig optimizes these operations and transforms them into

• It provides real-time random read/write access to large datasets.

• HBase is suitable for applications that require low-latency access

• For example: real-time analytics, time-series data, and Online

• It offers in-memory processing, enabling faster data processing

• Spark supports batch processing, real-time stream processing,

• It provides a publish / subscribe model for streaming data. It

• Kafka is commonly used to build real-time data pipelines, event-

• This tool helps connect traditional databases with the Hadoop

• Larger block sizes improve performance for reading and writing

• Adjust the replication factor based on the required fault

• A lower replication factor reduces storage overhead and

• Hadoop performance improves when network bandwidth

• This can be achieved by adjusting the number of mappers and

• Fine-tuning the scheduling parameters ensures fair resource

Moving Data in and out of Hadoop –

### Map Phase:

### Reduce Phase:

In summary, the inputs to MapReduce are the initial dataset represented as

Data serialization is the process of converting an object into a

Serialization and deserialization work together to transform/recreate

With serialization, we can transfer objects:

• Over the wire for messaging use cases

Why Is Data Serialization Important for Distributed Systems?

• Adding key/value objects to a map

• Sending a message to a topic

What Are Common Languages for Data Serialization?

• Structure. By inserting some schema or criteria for a data

You might also like