Apache HADOOP
Open source software framework designed
for storage and processing of large scale
data on clusters of commodity hardware
Created by Doug Cutting and Mike
Carafella in 2005.
Cutting named the program after his son’s
toy elephant.
Uses for Hadoop
Data-intensive text processing
Assembly of large genomes
Graph mining
Machine learning and data mining
Large scale social network analysis
Who Uses Hadoop?
How much data?
Facebook
◦ 500 TB per day
Yahoo
◦ Over 170 PB
eBay
◦ Over 6 PB
Getting the data to the processors becomes
the bottleneck
Requirements for Hadoop
Must support partial failure
Must be scalable
Partial Failures
Failure of a single component must not
cause the failure of the entire system only a
degradation of the application performance
Failure should not result in the loss of
any data
Component Recovery
If a component fails, it should be able to
recover without restarting the entire system
Component failure or recovery during a job
must not affect the final output
Scalability
Increasing resources should increase load
capacity
Increasing the load on the system should
result in a graceful decline in performance
for all jobs
Not system failure
The four key characteristics of Hadoop are:
Economical: Its systems are highly
economical as ordinary computers can be
used for data processing.
Reliable: It is reliable as it stores copies of
the data on different machines and is
resistant to hardware failure.
Scalable: It is easily scalable both,
horizontally and vertically. A few extra
nodes help in scaling up the framework.
◦ Flexible: It is flexible and can store as much
structured and unstructured data
Difference between Traditional Database
System and Hadoop
The table given below will help you distinguish between Traditional Database System
and Hadoop.
Traditional Database
Hadoop
System
In Hadoop, the program
goes to the data. It initially
Data is stored in a distributes the data to
central location and sent multiple systems and later
to the processor at runs the computation
runtime. wherever the data is
located.
Hadoop works better when
Traditional Database the data size is big. It can
Systems cannot be used process and store a large
to process and store a amount of data efficiently
significant amount of and effectively.
data(big data).
Traditional RDBMS is Hadoop can process and
used to manage only store a variety of data,
structured and semi-
structured data. It cannot whether it is structured or
be used to control unstructured.
unstructured data.
The Hadoop Ecosystem
Hadoop
Contains Libraries and other modules
Common
HDFS Hadoop Distributed File System
Hadoop YARN Yet Another Resource Negotiator
Hadoop A programming model for large scale
MapReduce data processing
Hadoop ecosystem is continuously growing to meet
the needs of Big Data. It comprises the following
twelve components:
HDFS(Hadoop Distributed file system)
HBase
Sqoop
Flume
Spark
Hadoop MapReduce
Pig
Impala
Hive
Cloudera Search
Oozie
Hue.
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data
Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data
services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine
Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
HDFS (Hadoop Distributed File System):
HDFS is the primary or major component of
Hadoop ecosystem and is responsible for
storing large data sets of structured or
unstructured data across various nodes and
thereby maintaining the metadata in the form
of log files.
Hadoop distributed file system (HDFS) is a
Java based file system that provides scalable,
fault tolerance, reliable and cost efficient data
storage for Big data.
HDFS is a distributed file system that runs on
Commodity hardware.
Hadoop interact directly with HDFS by shell-
like commands.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
HDFS uses a master/slave architecture in which
one device (master) termed as NameNode controls
one or more other devices (slaves) termed as
DataNode.
◦ It breaks Data/Files into small blocks
(Typically 64MB or 128MB each block) and
stores on DataNode and each block
replicates on other nodes to accomplish fault
tolerance.
NameNode keeps the track of blocks written
to the DataNode.
i. NameNode
It is also known as Master node. NameNode
does not store actual data or dataset. NameNode
stores Metadata i.e. number of blocks, their location,
on which Rack, which Datanode the data is stored
and other details. It consists of files and directories.
Tasks of HDFS NameNode
Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming,
closing, opening files and directories.
ii. DataNode
It is also known as Slave.
HDFS Datanode is responsible for storing actual
data in HDFS.
Datanode performs read and write
operation as per the request of the clients.
Replica block of Datanode consists of 2 files on
the file system.
The first file is for data and second file is
for recording the block’s metadata.
HDFS Metadata includes checksums for data.
At startup, each Datanode connects to its
corresponding Namenode and does
handshaking. Verification of namespace ID and
software version of DataNode take place by
handshaking. At the time of mismatch found,
DataNode goes down automatically.
Tasks of HDFS DataNode
DataNode performs operations like block replica
creation, deletion, and replication according to the
instruction of NameNode.
DataNode manages data storage of the system.
YARN(Yet Another Resource Negotiator):
As the name implies, YARN is the one who helps
to manage the resources across the clusters. In
short, it performs scheduling and resource
allocation for the Hadoop System.
Consists of three major components i.e.
1.Resource Manager
2.Nodes Manager
3.Application Manager
Resource manager has the privilege of allocating
resources for the applications in a system whereas
Node managers work on the allocation of
resources such as CPU,s memory, bandwidth per
machine and later on acknowledges the resource
manager.
Application manager works as an interface
between the resource manager and node manager
and performs negotiations as per the requirement
of the two.
MapReduce
Hadoop MapReduce is the core Hadoop ecosystem
component which provides data processing.
MapReduce is a software framework for easily
writing applications that process the vast amount of
structured and unstructured data stored in the
Hadoop Distributed File system.
MapReduce programs are parallel in nature, thus are
very useful for performing large-scale data analysis
using multiple machines in the cluster. Thus, it
improves the speed and reliability of cluster this
parallel processing.
Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works
by breaking the processing into two phases:
Map phase
Reduce phase
Each phase has key-value pairs as input and output.
In addition, programmer also specifies two
functions: map function and reduce function
1.Map() performs sorting and filtering of data and
thereby organizing them in the form of group.
Map generates a key-value pair based result
which is later on processed by the Reduce()
method.
2.Reduce(), as the name suggests does the
summarization by aggregating the mapped data.
In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into
smaller set of tuples.
The Mapper
Reads data as key/value pairs
◦ The key is often discarded
Outputs zero or more key/value pairs
Shuffle and Sort
Output from the mapper is sorted by key
All values with the same key are guaranteed to go to
the same machine
The Reducer
Called once for each unique key
Gets a list of all values associated with a key as input
The reducer outputs zero or more final key/value
pairs
◦ Usually just one output per input key
MapReduce: Word Count
Features of MapReduce
Simplicity – MapReduce jobs are easy to run.
Applications can be written in any language such
as java, C++, and python.
Scalability – MapReduce can process petabytes
of data.
Speed – By means of parallel processing
problems that take days to solve, it is solved in
hours and minutes by MapReduce.
Fault Tolerance – MapReduce takes care of
failures. If one copy of data is unavailable,
another machine has a copy of the same key pair
which can be used for solving the same subtask.
Other Tools
Hive
◦ Hadoop processing with SQL
Pig
◦ Hadoop processing with scripting
Cascading
◦ Pipe and Filter processing model
HBase
◦ Database model built on top of Hadoop
Flume
◦ Designed for large scale data movement
Matrix Multiplication
https://www.youtube.com/watch?v=RIMA4rvNpI8