0% found this document useful (0 votes)

72 views19 pages

HADOOP

Uploaded by

ucebittrichy2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views19 pages

HADOOP

Uploaded by

ucebittrichy2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Apache HADOOP

Open source software framework designed

for storage and processing of large scale
data on clusters of commodity hardware
Created by Doug Cutting and Mike
Carafella in 2005.
Cutting named the program after his son’s
toy elephant.
Uses for Hadoop
Data-intensive text processing
Assembly of large genomes
Graph mining
Machine learning and data mining
Large scale social network analysis
Who Uses Hadoop?
How much data?
Facebook
◦ 500 TB per day
Yahoo
◦ Over 170 PB
eBay
◦ Over 6 PB
Getting the data to the processors becomes
the bottleneck
Requirements for Hadoop
Must support partial failure
 Must be scalable
Partial Failures
 Failure of a single component must not
cause the failure of the entire system only a
degradation of the application performance
 Failure should not result in the loss of
any data
Component Recovery
If a component fails, it should be able to
recover without restarting the entire system
Component failure or recovery during a job
must not affect the final output
Scalability
Increasing resources should increase load
capacity
Increasing the load on the system should
result in a graceful decline in performance
for all jobs
 Not system failure
The four key characteristics of Hadoop are:
 Economical: Its systems are highly
economical as ordinary computers can be
used for data processing.
 Reliable: It is reliable as it stores copies of
the data on different machines and is
resistant to hardware failure.
 Scalable: It is easily scalable both,
horizontally and vertically. A few extra
nodes help in scaling up the framework.
◦ Flexible: It is flexible and can store as much
structured and unstructured data
Difference between Traditional Database
System and Hadoop
The table given below will help you distinguish between Traditional Database System
and Hadoop.
Traditional Database
Hadoop
System
In Hadoop, the program
goes to the data. It initially
Data is stored in a distributes the data to
central location and sent multiple systems and later
to the processor at runs the computation
runtime. wherever the data is
located.

Hadoop works better when

Traditional Database the data size is big. It can
Systems cannot be used process and store a large
to process and store a amount of data efficiently
significant amount of and effectively.
data(big data).

Traditional RDBMS is Hadoop can process and

used to manage only store a variety of data,
structured and semi-
structured data. It cannot whether it is structured or
be used to control unstructured.
unstructured data.

The Hadoop Ecosystem

Hadoop
Contains Libraries and other modules
Common

HDFS Hadoop Distributed File System

Hadoop YARN Yet Another Resource Negotiator

Hadoop A programming model for large scale
MapReduce data processing

Hadoop ecosystem is continuously growing to meet

the needs of Big Data. It comprises the following
twelve components:
 HDFS(Hadoop Distributed file system)
 HBase
 Sqoop
 Flume
 Spark
 Hadoop MapReduce
 Pig
 Impala
 Hive
 Cloudera Search
 Oozie
 Hue.
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data
Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data
services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine
Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

HDFS (Hadoop Distributed File System):

HDFS is the primary or major component of

Hadoop ecosystem and is responsible for
storing large data sets of structured or
unstructured data across various nodes and
thereby maintaining the metadata in the form
of log files.

 Hadoop distributed file system (HDFS) is a

Java based file system that provides scalable,
fault tolerance, reliable and cost efficient data
storage for Big data.

HDFS is a distributed file system that runs on

Commodity hardware.

Hadoop interact directly with HDFS by shell-

like commands.

HDFS consists of two core components i.e.

1. Name node
2. Data Node

HDFS uses a master/slave architecture in which

one device (master) termed as NameNode controls
one or more other devices (slaves) termed as
DataNode.
◦ It breaks Data/Files into small blocks
(Typically 64MB or 128MB each block) and
stores on DataNode and each block
replicates on other nodes to accomplish fault
tolerance.

 NameNode keeps the track of blocks written

to the DataNode.

i. NameNode
It is also known as Master node. NameNode
does not store actual data or dataset. NameNode
stores Metadata i.e. number of blocks, their location,
on which Rack, which Datanode the data is stored
and other details. It consists of files and directories.

Tasks of HDFS NameNode

 Manage file system namespace.

 Regulates client’s access to files.

 Executes file system execution such as naming,

closing, opening files and directories.

ii. DataNode

It is also known as Slave.

HDFS Datanode is responsible for storing actual
data in HDFS.
Datanode performs read and write
operation as per the request of the clients.
Replica block of Datanode consists of 2 files on
the file system.
 The first file is for data and second file is
for recording the block’s metadata.
HDFS Metadata includes checksums for data.

At startup, each Datanode connects to its

corresponding Namenode and does
handshaking. Verification of namespace ID and
software version of DataNode take place by
handshaking. At the time of mismatch found,
DataNode goes down automatically.

Tasks of HDFS DataNode

 DataNode performs operations like block replica

creation, deletion, and replication according to the

instruction of NameNode.
 DataNode manages data storage of the system.

YARN(Yet Another Resource Negotiator):

 As the name implies, YARN is the one who helps

to manage the resources across the clusters. In

short, it performs scheduling and resource
allocation for the Hadoop System.

 Consists of three major components i.e.

1.Resource Manager
2.Nodes Manager
3.Application Manager

 Resource manager has the privilege of allocating

resources for the applications in a system whereas
 Node managers work on the allocation of
resources such as CPU,s memory, bandwidth per
machine and later on acknowledges the resource
manager.
 Application manager works as an interface
between the resource manager and node manager
and performs negotiations as per the requirement
of the two.
MapReduce
Hadoop MapReduce is the core Hadoop ecosystem
component which provides data processing.

MapReduce is a software framework for easily

writing applications that process the vast amount of
structured and unstructured data stored in the
Hadoop Distributed File system.

MapReduce programs are parallel in nature, thus are

very useful for performing large-scale data analysis
using multiple machines in the cluster. Thus, it
improves the speed and reliability of cluster this
parallel processing.
Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works
by breaking the processing into two phases:
 Map phase
 Reduce phase

Each phase has key-value pairs as input and output.

In addition, programmer also specifies two
functions: map function and reduce function
1.Map() performs sorting and filtering of data and
thereby organizing them in the form of group.
Map generates a key-value pair based result
which is later on processed by the Reduce()
method.
2.Reduce(), as the name suggests does the
summarization by aggregating the mapped data.
In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into
smaller set of tuples.

The Mapper
 Reads data as key/value pairs
◦ The key is often discarded
 Outputs zero or more key/value pairs
Shuffle and Sort
 Output from the mapper is sorted by key
 All values with the same key are guaranteed to go to
the same machine
The Reducer
 Called once for each unique key
 Gets a list of all values associated with a key as input
 The reducer outputs zero or more final key/value
pairs
◦ Usually just one output per input key
MapReduce: Word Count

Features of MapReduce
 Simplicity – MapReduce jobs are easy to run.

Applications can be written in any language such

as java, C++, and python.
 Scalability – MapReduce can process petabytes
of data.
 Speed – By means of parallel processing
problems that take days to solve, it is solved in
hours and minutes by MapReduce.
 Fault Tolerance – MapReduce takes care of
failures. If one copy of data is unavailable,
another machine has a copy of the same key pair
which can be used for solving the same subtask.

Other Tools
 Hive
◦ Hadoop processing with SQL
 Pig
◦ Hadoop processing with scripting
 Cascading
◦ Pipe and Filter processing model
 HBase
◦ Database model built on top of Hadoop
 Flume
◦ Designed for large scale data movement
Matrix Multiplication
https://www.youtube.com/watch?v=RIMA4rvNpI8

Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Module 2 HDFS
No ratings yet
Module 2 HDFS
33 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Module - 2
No ratings yet
Module - 2
84 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Unit 2
No ratings yet
Unit 2
17 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
38 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Unit 3
No ratings yet
Unit 3
90 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
IMTC634 - Data Science - Chapter 13
No ratings yet
IMTC634 - Data Science - Chapter 13
16 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Introduction To
No ratings yet
Introduction To
7 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Hadoop Ecosystem Components Guide
No ratings yet
Hadoop Ecosystem Components Guide
19 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Unit 5
No ratings yet
Unit 5
32 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Bda QB Soln
No ratings yet
Bda QB Soln
22 pages
Hadoop
No ratings yet
Hadoop
83 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data Ecosystem & Hadoop Guide
No ratings yet
Big Data Ecosystem & Hadoop Guide
31 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
HADOOP
No ratings yet
HADOOP
18 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
Updated Unit-IV Reference PPT 08-02-2022
No ratings yet
Updated Unit-IV Reference PPT 08-02-2022
103 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
BD U-2 (Anupam Sir)
No ratings yet
BD U-2 (Anupam Sir)
30 pages
Unit 2
No ratings yet
Unit 2
22 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
180 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
HADOOP
No ratings yet
HADOOP
10 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
II CSE CS3352 FDS QB Unit4
100% (1)
II CSE CS3352 FDS QB Unit4
6 pages
II CSE CS3352 FDS QB Unit1
No ratings yet
II CSE CS3352 FDS QB Unit1
4 pages
II CSE CS3352 FDS QB Unit5
No ratings yet
II CSE CS3352 FDS QB Unit5
4 pages
II CSE CS3352 FDS QB Unit2
No ratings yet
II CSE CS3352 FDS QB Unit2
3 pages
C - Manuals - Access - Access 2010 Part 1 PDF
No ratings yet
C - Manuals - Access - Access 2010 Part 1 PDF
190 pages
DBMS Model Question Paper Answer Key 1
No ratings yet
DBMS Model Question Paper Answer Key 1
19 pages
How To Unlock HR Schema
No ratings yet
How To Unlock HR Schema
4 pages
Master Data Management
100% (6)
Master Data Management
41 pages
Document 2054804.1
No ratings yet
Document 2054804.1
11 pages
Relational Algebra Query
100% (1)
Relational Algebra Query
3 pages
Database Summary
No ratings yet
Database Summary
9 pages
6.moving Data Into Hadoop
No ratings yet
6.moving Data Into Hadoop
18 pages
Week 8 - Lecture Notes
No ratings yet
Week 8 - Lecture Notes
75 pages
PHP Important Questions
No ratings yet
PHP Important Questions
4 pages
Rahi Fitness Centre Login System
No ratings yet
Rahi Fitness Centre Login System
13 pages
Spark
No ratings yet
Spark
51 pages
ORACLE For UNIX - Performance Tuning Tips
No ratings yet
ORACLE For UNIX - Performance Tuning Tips
90 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
Bkash Technical
No ratings yet
Bkash Technical
8 pages
Interview Prep Guide
No ratings yet
Interview Prep Guide
31 pages
Chapter-1 DBMS: 1) A Collection of Interrelated Files in A Computer Is A
No ratings yet
Chapter-1 DBMS: 1) A Collection of Interrelated Files in A Computer Is A
9 pages
07 PHP Lecture
No ratings yet
07 PHP Lecture
60 pages
CB19442 DT
0% (1)
CB19442 DT
1 page
From Queries To Insights - Revolutionize Your Data Strategy With Table Augmented Generation (TAG) - by Kanishk Khatter - Sep, 2024 - Medium
No ratings yet
From Queries To Insights - Revolutionize Your Data Strategy With Table Augmented Generation (TAG) - by Kanishk Khatter - Sep, 2024 - Medium
14 pages
608 Advanced Intelligent Tourist Guide System
No ratings yet
608 Advanced Intelligent Tourist Guide System
16 pages
Data Warehousing
No ratings yet
Data Warehousing
14 pages
DT Practical Slips Solutions
No ratings yet
DT Practical Slips Solutions
58 pages
SAP ABAP For HANA Sample Technical Specification PDF
No ratings yet
SAP ABAP For HANA Sample Technical Specification PDF
32 pages
Syllabus of DataStage Course
No ratings yet
Syllabus of DataStage Course
5 pages
SQL Reference For OCR A Level Computer Science (H446)
No ratings yet
SQL Reference For OCR A Level Computer Science (H446)
3 pages
Sample REsume
No ratings yet
Sample REsume
4 pages
Data Mining Techniques & Concepts
No ratings yet
Data Mining Techniques & Concepts
7 pages
Joins Practice
No ratings yet
Joins Practice
10 pages
I-1 Dbms Report Format
No ratings yet
I-1 Dbms Report Format
19 pages

HADOOP

Uploaded by

HADOOP

Uploaded by

Apache HADOOP

Open source software framework designed

Hadoop works better when

Traditional RDBMS is Hadoop can process and

The Hadoop Ecosystem

HDFS Hadoop Distributed File System

Hadoop YARN Yet Another Resource Negotiator

Hadoop ecosystem is continuously growing to meet

HDFS (Hadoop Distributed File System):

HDFS is the primary or major component of

 Hadoop distributed file system (HDFS) is a

HDFS is a distributed file system that runs on

Hadoop interact directly with HDFS by shell-

HDFS consists of two core components i.e.

HDFS uses a master/slave architecture in which

 NameNode keeps the track of blocks written

Tasks of HDFS NameNode

 Regulates client’s access to files.

 Executes file system execution such as naming,

closing, opening files and directories.

It is also known as Slave.

At startup, each Datanode connects to its

Tasks of HDFS DataNode

creation, deletion, and replication according to the

YARN(Yet Another Resource Negotiator):

to manage the resources across the clusters. In

 Consists of three major components i.e.

 Resource manager has the privilege of allocating

MapReduce is a software framework for easily

MapReduce programs are parallel in nature, thus are

Each phase has key-value pairs as input and output.

Applications can be written in any language such

You might also like