Introduction to Hadoop
Certified Big Data & Hadoop Training – DataFlair
Topics
Introduction to Hadoop
Hadoop nodes & daemons
Hadoop Architecture
Characteristics Hadoop
Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An Open Source framework that
allows distributed processing of
large data-sets across the cluster
of commodity hardware
Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An Open Source framework that Open Source
allows distributed processing of
large data-sets across the cluster Source code is freely available
of commodity hardware It may be redistributed and
modified
Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An open source framework that Distributed Processing
allows Distributed Processing of
large data-sets across the cluster Data is processed distributedly
of commodity hardware on multiple nodes / servers
Multiple machines processes
the data independently
Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An open source framework that Cluster
allows distributed processing of
large data-sets across the Cluster Multiple machines connected
of commodity hardware together
Nodes are connected via LAN
Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
An open source framework that Commodity Hardware
allows distributed processing of
large data-sets across the cluster Economic / affordable
of Commodity Hardware machines
Typically low performance
hardware
Certified Big Data & Hadoop Training – DataFlair
What is Hadoop?
• Open source framework written in Java
• Inspired by Google's Map-Reduce programming model
Certified Big Data & Hadoop Training – DataFlair
Hadoop History
Doug Cutting added Hadoop defeated
DFS & MapReduce Super computer
in
converted 4TB of
Doug Cutting started Doug Cutting
image archives over
working on joined Cloudera
100 EC2 instances
2002 2003 2004 2005 2006 2007 2008 2009
published GFS & Hadoop became
Development of
MapReduce papers top-level project
started as Lucene sub-project
launched Hive,
SQL Support for Hadoop
Certified Big Data & Hadoop Training – DataFlair
Hadoop Components
Hadoop consists of three key parts
Certified Big Data & Hadoop Training – DataFlair
Hadoop Nodes
Nodes
Master Node Slave Node
Certified Big Data & Hadoop Training – DataFlair
Hadoop Daemons
Nodes
Master Node Slave Node
Resource Node
Manager Manager
NameNode DataNode
Certified Big Data & Hadoop Training – DataFlair
Basic Hadoop Architecture
Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Work Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Certified Big Data & Hadoop Training – DataFlair
Hadoop Characteristics
Certified Big Data & Hadoop Training – DataFlair
Open Source
• Source code is freely
available Free Transparent
• Can be redistributed
• Can be modified Inter- Open Affordable
operable
Source
No vendor
Community
lock
Certified Big Data & Hadoop Training – DataFlair
Distributed Processing
• Data is processed distributedly
on cluster
• Multiple nodes in the cluster
process data independently
Centralized Processing
Distributed Processing
Certified Big Data & Hadoop Training – DataFlair
Fault Tolerance
• Failure of nodes are recovered
automatically
• Framework takes care of failure
of hardware as well tasks
Certified Big Data & Hadoop Training – DataFlair
Reliability
• Data is reliably stored on the
cluster of machines despite
machine failures
• Failure of nodes doesn’t
cause data loss
Certified Big Data & Hadoop Training – DataFlair
High Availability
• Data is highly available and
accessible despite hardware
failure
• There will be no downtime for
end user application due to
data
Certified Big Data & Hadoop Training – DataFlair
Scalability
• Vertical Scalability – New
hardware can be added to the
nodes
• Horizontal Scalability – New
nodes can be added on the fly
Certified Big Data & Hadoop Training – DataFlair
Economic
• No need to purchase costly license
• No need to purchase costly hardware
Commodity
Open Source + Hardware = Economic
Certified Big Data & Hadoop Training – DataFlair
Easy to Use
• Distributed computing challenges
are handled by framework
• Client just need to concentrate on
business logic
Certified Big Data & Hadoop Training – DataFlair
Data Locality
•
Data Data
Move computation to data
instead of data to computation
•
Data Data
Data is processed on the nodes
Storage Servers App Servers
where it is stored
Algo Algo
Data Data
Algorithm
Algo Algo
Data Data
Servers
Certified Big Data & Hadoop Training – DataFlair
Summary
• Everyday we generate 2.3 trillion GBs of data
• Hadoop handles huge volumes of data efficiently
• Hadoop uses the power of distributed computing
• HDFS & Yarn are two main components of Hadoop
• It is highly fault tolerant, reliable & available
Certified Big Data & Hadoop Training – DataFlair