0% found this document useful (0 votes)
67 views3 pages

Big Data: Hadoop Framework Guide

This document discusses the Hadoop framework for addressing big data problems. It summarizes that traditional databases cannot handle the enormous and varied data being generated today. The Hadoop framework uses MapReduce for computational and analytics tasks, and the Hadoop Distributed File System (HDFS) for storage. HDFS is designed to reliably store huge amounts of data across clusters of servers. It also describes how MapReduce works by dividing data processing tasks into parallelizable map and reduce functions.

Uploaded by

portlandonline
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views3 pages

Big Data: Hadoop Framework Guide

This document discusses the Hadoop framework for addressing big data problems. It summarizes that traditional databases cannot handle the enormous and varied data being generated today. The Hadoop framework uses MapReduce for computational and analytics tasks, and the Hadoop Distributed File System (HDFS) for storage. HDFS is designed to reliably store huge amounts of data across clusters of servers. It also describes how MapReduce works by dividing data processing tasks into parallelizable map and reduce functions.

Uploaded by

portlandonline
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

International Journal of Scientific Engineering and Technology (ISSN : 2277-1581)

Volume No.3, Issue No.1 pp : 40-42 1 Jan 2014

Big Data Problems: Understanding Hadoop Framework


G S Aditya Rao1, Palak Pandey2
1
B.tech in Computer Sciences Engineering,2B.tech in Geo-Informatics Engineering
University of Petroleum and Energy Studies
adyraj@rediff.com

ABSTRACT: THE IT INDUSTRY HAS SEEN REVOLUTION FROM coming of cloud models that incorporate sound data storage
MIGRATING FROM STANDARDIZATION TO INTEGRATION TO companies are processing enormous data. This huge generated
VIRTUALIZATION TO AUTOMATION TO THE CLOUD. NOW THE data is not only a hardwaredata storage problem but also on
INDUSTRY IS ALL SET TO SPIN AROUND THE COMMERCIALIZATION file system design, designing implementation,IO Processing
THAT IS DATA ANALYTICS- BUSINESS INTELLIGENCE. FROM ALL and scalability issue. To fulfil the needs of the data generated
FIELDS DATA IS GENERATING BE IT ANY INDUSTRY SECTOR. THUS data storage has significantly improved. But HDD data access
VOLUME, VARIETY AND VELOCITY OF THE DATA HAVE BEEN has not improved that much.Thus the main problems with this
EXTREMELY HIGH. THUS TO HANDLE SUCH ENORMOUS DATA emergence of data are particularly where to store this
WHERE TRADITIONAL DATABASES IS NOT POSSIBLE THE PROBLEM enormous data or the storage capacity problem. The second
OF STORAGE, COMPUTION,LOW NETWORK BANDWIDTH AND LESS one is to make business sense out of it for analytics which is a
FAULT TOLERANT WHICH LEAD TO THE INTRODUCTION OF part of computation problem. Other important factors include
BIGDATA.IN THIS PAPER WE HAVE FOCUSED ON THE BACKEND the network bandwidth and the reliability. Reliability refers to
ARCHITECTURE AND WORKING OF THE PARTS OF THE HADOOP the response if any unfavourable condition materializes
FRAMEWORK WHICH ARE THEMAP REDUCE FOR THE which can lead to the loss of important data and in turn leads
COMPUTATIONAL AND ANALYTICS SECTION AND THE HADOOP to the analysis flaw of the system. Thus a backup of the data
DISTRIBUTED FILE SYSTEM (HDFS) FOR THE STORAGE SECTION. stored should always be present to cope up with the risk
situations of data loss. One other major concept is the concept
of network bandwidth.
Keywords: Thus storage, computation,reliability,bandwidth issues are some
Hadoop,HDFS,MapReduce,Jobtracker,Tasktracker,Namenode,D of the bigdata problems which the modern IT industry is facing.
atanode. And yes Hadoop framework can be a best framework which can
provide with these features and other additional features which
I. Introduction could turn out to be an asset for the industry. In this paper we
With the industrial revolution of data, tremendous amount of would be discussing in detail the methodology by which the
data is generated .with the emergence of companies the data Hadoop frame work can help us achieve the above discussed
which was confined to few gigabytes has now gone past peta challenges.
into zetta bytes.Technology is so much in use that we are in an
era that we are able to figure out about human behaviour ARCHITECTURE AND FUNCTIONING
through the analysis and prediction of the data generated.Data
MapReduce: The analysis part of the Hadoop framework is
is generated through machine
managed by the mrv1 framework. It is a programming model
sensors,GPS,billing,transactions. Emergence of new data
developed by google. It works on the principle of
sources has gone so high that the storage capabilities have fell
divide,sort,merge,join.It was built with the aim of batch
short. The traditional datawarehouses are limited to RDBMS
processing and parallel processing.It is natural for the ad-hoc
concept which could handle more of the structured data but
query,web search indexing, Log processing. From business
when in this era when we the data is generating in all
aspect,the main objective of MapReduce is deep data
directions flexible unstructured data storages NoSQL
analytics based on which the prediction is done observing the
databases are the new crush of the industry. The amount of
patterns. It comprises of two functions, to analyse the large
unstructured data generated can we figured out by the fact that
unstructured datasets,the "Mappers" and the "Reducers".
every month 1 lakh new users are registered on facebook.5
Both of the “Mappers” and the “Reducers” are user defined
billion mobile phones are in user in 2010, 30 billion new
functions.The model is based on parallel programming and
pieces of content is created or shared on Facebook. “Bigdata”
the datasets are parallely processed on the different nodes of
refers to datasets whose size is beyond the ability of typical
the cluster.Map and Reduce functions are available in
database software tools to capture, store, manage, and analyse.
languages such as LISP. Apart from themap and reduce
Now theindustry is in making sense of these generated figures
function also comprises the partitioner and the combiner
by analysis and prediction of different parameters.
functions. Users of MapReduce are allowed to specify the
Datawarehouses are also an important part when it comes to
number of reducer tasks they desire according to which the
data gets partitioned among these tasks through the
analytics."Bigdata" can we implemented on both structure or
partitioning function. There is also a combiner function; the
unstructured data that is on both analytical DBMS and NoSQL
combiner function is executed on every node that performs
databases.Bigdataisproved as an asset when it comes to
map function.it merges the local disk data before moving it to
analyse the data in motion or stream processing. Most of the
the network.The mechanism for MapReduce is as mainly
larger firms are generating huge amounts of data. With the

IJSET@2014 Page 40
International Journal of Scientific Engineering and Technology (ISSN : 2277-1581)
Volume No.3, Issue No.1 pp : 40-42 1 Jan 2014

divide and conquer, the main program is initiated and an mind.HDFSrepresents a distributed file system that is designed
dataset is taken as input and the according to the job to store enormously large datasets and at the same time high
configuration the master program initiates the various notes throughput to access datasets.HDFS contains many racks
for map and reduce purposes, after that the input reader is which are mounted by thousands of servers and with each
being initialized to stream the data from the datasets the input server thousands of nodes are attached so the probability of the
reader breaks file into many smaller blocks and maps the data failure of the hardware is at its peak.So the Hadoop design
blocks to the nodes which are assigned mapper nodes. As should be resistant to the fault tolerance, have high throughput
told above the map and reduce functions are user defined, for data streaming.
thus in the mapper nodes the user map function is executed Characteristics or goals of HDFS:
and based on it {key ,value} pairs are generated , the results 1. High fault tolerance.
generated by the mappers is not simply written to the 2. Moving computation is better than moving data.
disk,some sorting is done for the efficiency reasons. Map 3. able to handle large datasets.
tasks have circular memory buffer in which it stores the 4. Cross platform compatibility.
output,by default its capacity is 100 MB,it can change 5. High throughput for streaming data.
dynamically to the size,when the threshold size reaches 80%,
a background thread will start to spill the contents of Architecture:HDFS run on GNU/Linux operating system and
thread.Map blocks until the spill is complete.Before writing is built in java. HDFS works on the principle of master/slave
to the disk respective sorting is done on the pairs generated architecture. It consists of a Namenode which is unique for the
now the already initiated “Reducer” nodes comes into action. whole cluster; there is a secondary Namenode which acts as a
All the sorted data are sent to the reducer nodes by the checkpoint. Rest all the nodes of cluster are said to be the
partionerfunction here it collects the same keyvalue items Datanodes these act as the slaves. Namenode acts as the
andthe user given reduce function and aggregates result as a master instructing the Datanodes to perform operations. When
collective entity.Partion and combiner function is applied on a large dataset is set to be entering into stored in HDFS the
the output of the sort result so that there is less data to be large file is split into numerous blocks, these blocks can reside
written onto the disk.The produced result is collected by of the same file are stored on different nodes of the
output reader and thus the parallel processing terminates. cluster,each block stored is stored as a file on the local file
Architecture of MapReduce consists ofJobtrackerand system.HDFS maintains a single namespace for the distributed
multiple trackers. Job tracker acts as the master and the task file system,this namespace is maintained in the
trackers act as the slaves.Jobtrackersits onto the Namenode Namenode,since the blocks are distributed over the cluster and
and the tasktracker sits on the correspondingDatanodes.when the Datanode store in the local file system, this file system tree
the task is being submitted to theNamenode and the job and the metadata and directories in the trees is also maintained
tracker is being informed about the input, via heartbeat in the Namenode. This information is dynamic in nature.Name
protocol it checks for the free slots in the task tracker and node consists of 2 files for storing all these data which are
assigns maptask to the free tasktrackers.Maptasks track data FsImage and the edit log respectively.FSimage stores data
from the splits using record reader and input format and block file mapping and filesystem properties whereas the edit
invoke map function andaccordingly a key value pair is log consist ofall the changes done to the file system,all the
generated in the memory buffer. Once all the tasktrackers are modifications to blocks are subjected to the editlog. For the
done with the maptask the memory buffer is flushed to the proper working of HDFS the most important thing is the
local disk within mapnode with an index and the keyvalue master the Namenode if the Namenode fails the HDFS
pair the map nodes report to the Jobtracker and the Jobtracker becomes obsolete,it should be throughout functional,if it fails
starts notifying the reduce task nodes of the cluster for the there could be huge dataloss since HDFS is used by massive
next step which is the reduce task. The concerned reduce datasets. Though we cannot fully control the Namenode failure
nodes download the files (index and keyvalue pair) from the but we can minimize its effect by having checkpoints. We
respective mapnode. Now the reduce nodes reads the have secondary name node for it which merges the fsi image
downloaded file involve the userdefined reduce function and and the edit log periodically, when the metadata from the
that provides with the aggregate key value pair. Each reduce Namenode is stored on the local disk it is also mounted onto N
tasks are single threaded. The output of each reducer task is mountpoints just as a backup. These CPU intensive merge
written to HDFS temporary file. When all reducetasks are activities are on the separate system. If at any moment the
finished the temporary file is automatically renamed to final Namenode fails then the fsi image from the mounted sites is
file name. picked up and it runs as the primary Namenode.this is how
secondary Namenodes can be vitial.The working of HDFS is
HDFS: in traditional blocks of disks the maximum data that kept very simple and dynamic, when the system starts the
can be stored or read was 512bytes,later the file systems system is in a neutral state waiting for the data nodes to send
blocks came which could accommodate few kilobytes,with the information about the vacant blocks so that the name node can
current volume of data it is next to impossible to store or assign the block to Datanode, via heartbeat protocol and block
analyse this teravytes or zettabytes data over a distributed reports the name not get these messages based on which the
network using traditional system. Hadoop distributed file Namenode allots the different data chunks to the different data
system is a Hadoop data storage framework implemented on nodes. IfNamenodefails secondary node acts as the Namenode
the commodity hardware. HDFS blocks can accommodate a as discussed earlier. After this the Namenode works on the
few 68-128 MB.Block extraction is simple in HDFS like block replication if any less replication is done thanthe
replication of blocks is at block level rather than file replication factor it works for it until it fulfils. As the
level.HDFS is created keeping MapReduce in Namenode boots the FsImage and the editlog are accessedfrom

IJSET@2014 Page 41
International Journal of Scientific Engineering and Technology (ISSN : 2277-1581)
Volume No.3, Issue No.1 pp : 40-42 1 Jan 2014

the local disk and all the editlog transactions are mapped into
existingFsImage thus creating newFsImage file, meanwhile
the old editlog is flushed, that’s how it is dynamic.

Reliability:An exceptional quality which the Hadoop


framework persist is that when the input file is to store in
HDFS frame work it goes through the splitting of the large
dataset in to smaller chunks. The blocks of data are replicated
over different nodes of the cluster. Replication is done on the
data node level. Replication factor is introduced which is the
number of replicas available of the same block. This provides
fault tolerance, foreg. If a rack fails then all the corresponding
nodes to that fail so by replication we have the same data
block over other blocks thus we can access the required data
block increasing reliability. Replication over the same node is
avoided because replication or backup over the same node is
of no use since a node fail its back up is also gone thus
Hadoop uses replication around different nodes of the cluster.
Also the secondary Namenode which is the back of the CONCLUSION
primary Namenode as discussed earlier carries the backup of Maximum amount of industry generated data is unstructured.
the FsImage and editlog to act as primary name node if the Even if it is structured it is so huge that the traditional RDBMS
main Namenodefails. This ensures the reliability of the is a fail for storingEnormousvariety,volume and velocity of the
Hadoop framework. data.Hadoop framework is an asset as it helps in achieving the
mail goals of the industry such as the storage,computer and
High performance: Another concern in the distributed network analysis,reliability and fault tolerance, last but not the least the
is the Network Bandwidth. Yes ,Hadoop is the solution to network bandwidth. Thus using Hadoop we can distributed
Bandwidth constrain too.Since the Hadoop uses more of the local store the data using HDFS and compute it according to the
data .This can be understood by this example that while user defined functions in MapReduse.
replication if the Hadoop has a replication factor 3 (most
prominent case) then it means it will save three of its replication Acknowledgement
copies on the nodes. If it strore’s each replicated copy on the We would like to thank our families, friends, faculty for
different node of the different rack that would enhance the data motivating us and helping us to be focused on our goal.
reliability and availability but what about average network
bandwidth which is used when we fetch the block for read
purpose? (since while read or write operation each part is to be References
i. http://hadoop.apache.org/core/docs/current/hdfs_design.html
fetched from different racks)For this reason another strategy is
used,2/3 of the replicas are on the same rack and the rest 1/3 is ii. Pattern-Based Strategy: Getting Value from Big Data. Gartner Group
done on the different node of different racks this improves pressrelease. July 2011.
performance, availability and fasterns the access time .thus it iii. .IBM press release. “Using IBM Analytics, Santam Saves $2.4 Million
minimizes bandwidth consumption. inFraudulent Claims.”May 9, 2012. http://www-03.ibm.com/press/us/en/press
release/37653.
Diagrams/result iv. wss.bigdatabigproblems,
http://www.greenbiz.com/blog/2013/09/16/big-data-big-problems
Hadoop framework which consists of two main frameworks
which are the MapReduce framework and The v. "Big Data for Development: Opportunities & Challenges" - White
HadoopDistributed File System are interlinked .Mapreduce is Paper,http://www.unglobalpulse.org/projects/BigDataforDevelopment
mainly for the compute or analysis part which is the heart of the vi. Big Data Analytics Advanced Analytics in Oracle Database-An Oracle
BigData Analysis.where as the HDFS is mainly for the storing White Paper.
part.Both of these intra Frameworks are Hightly depended n each
vii. Big Data Adoption –Infrastructure Considerations-A TCS white paper.
other.MasterSlave architecture exists in the Hadoop. The input
file is didvided into multiple blocks and is saved on different viii. ArchitectigA Big Data Platform forAnalytics- An IBM whitepaper.
nodes(data nodes),the replication of these blocks( to increase the ix. .Architecting A Big Data Platform for Analytics-IBM research report.
reliablity in case of any accident) is also on on the same or x. .Teradatas -Big Data Analytics Architecture, Putting All Your Eggs in
differen racks keeping minization of network bandwidth usage in Three Baskets.
mind. The jobtracker sits over the namenode input file is being
xi. Magoulas, Roger, and Ben Lorica Bigdata Technologies and
sent to the namenode which divides and the file blocks are saved
on the Datanodes this is the storage section ,in case of any read or techniques for large scale data,” Release 2.0, Number 11, February 2009.
write operation the jobtracker(master) on the namenode asks the
task trackers to do the mapper and the reducers tasks respectively
this is the computation part of the Hadoop.

IJSET@2014 Page 42

You might also like