Hh
Warehouses, Big Data Analysis
with OutputFormats, Customi
Execution with Combiner, I
Data.
2.1 UNperstanpine Maprepuce
FUNDAMENTALS AND HBase
2.1.1. The Mapreduce Framework
Ql. What is MapReduce ? Explain its
components and features.
Ans:
MapReduce is a processing technique and a
Program model for distributed computing based on
java. The MapReduce algorithm contains two
important tasks, namely Map and Reduce. Map.
takes a set of data and converts it into another set
of data, where individual elements are broken down
into tuples (key/value pairs). Secondly, reduce task,
which takes the output from a map as an input and
combines those data tuples into a smaller set of
tuples. As the sequence of the name MapReduce
implies, the reduce task is always performed after
the map job.
‘The major advantage of MapReduce is that itis easy
to scale data processing over multiple computing
nodes. Under the MapReduce model, the data
processing primitives are called mappers and
reducers. Decomposing a data processing
application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run
over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a
configuration change. This simple scalability is what
has attracted many programmers to use the
| Understanding MapReduce Fundamentals and HBase: The Ma
Framework, Techniques to Optimize MapReduce Jobs, Role of HBase in
| Processing Exploring the Big Data Stack, Virtualization and Big Data, Vit
| Approaches Storing Data in Databases and Data Warehouses: RDBMS and Big
Data, Non- Relational Daiabase, Integrating Big Data with Tradit
in Big Data Era.Processing Your Data with MapReduce: Developin,
MapReduce Application, Points to Consider while Designing MapReduce.
Customizing MapReduce Execution: Controlling MapReduce Execution with |
nputformat, Reading Data with Custom RecordReader, Organizing Output Data
izing Data with RecordWriter, Optimizing MapReduce
implementing a MapReduce Program for Sorting Text |.
TAFE a EE!
HA
PReduee |
Big Data |)
itualization ||
|
ional Data ||
‘and Data Warehouse, Changing Deployment Models [
1g Simple |
The Algorithm
> Generally MapReduce paradigm is based on
sending the computer to where the data resides!
MapReduce program executes in three stages,
namely map stage, shuffle stage, and reduce
stage.
»
* Map stage : The map or mapper's job is
to process the input data. Generally the
input data isin the form of file or directory
and is stored in the Hadoop file systém
(HDFS). The input file is passed to the
™mapper function line by line. The mapper
processes the data and creates several small
chunks of data.
Reduce stage : This stage is. the
combination of the Shufflestage and
the Reduce stage. The Reducer's job is
to process the data that comes from the
mapper. After processing, it produces a new
set of output, which will be stored in the
HDFS.
During a MapReduce job, Hadoop sends the
Map and Reduce tasks to the appropriate
servers in the cluster.
The framework manages all the details of data-
Passing such as issuing tasks, verifying task
completion, and copying data around the
clustet between the nodes,
Most of the computing takes place on nodes
with data on local disks that reduces the network
MapReduce model.
Rahul Publ
traffic,—
BIG DATA ANALYTICS
unit we
‘After completion of the given tasks, the cluster collects and recluces the data to form an appropriate
,
yesult, and sends it back to the Hadoop server.
Reduce()
> The sched
+ As of today, for Hadoop, there are various schedulers
> Dealing with stragglers
+. Job execution time depends on the slowest map and reduce tasks
* Speculative execution can help with slow machines
But data locality may be at stake
> Dealing with skew in the distribution of values
+ E.g.: temperature readings from sensors
+ In this case, scheduling cannot help
* tis possible to work on customized partitioning and sampling to solve such issues
Rahul PublicationsMSc Il Year
2. Data/code co-location
» The effectiveness of the data processing
mechanism depends on the location of the
code and data required for the code to
execute,
» — The best result is obtained when both code
and data reside in the same machine.
3. Synchronization
» In MapReduce, execution of several
processes require synchronization.
» — synchronization is achieved by the “shuffle
andsort” manner.
> The framework tracks all the tasks along
with their timings and starts the reduction
process after the completion of mapping,
» The shuffle and sort mechanism is used for
collecting the rap data and preparing for
reduction
4. Handling Errors and faults
» MapReduce generally provides a high level
of fault tolerance and robustness in
handling errors.
» Using quite simple mechanisms, the
MapReduce framework dealswith:
+ Hardware failures
~ Individual machines: disks, RAM
- Networking equipment
- Power / cooling
+ Software failures
- Exceptions, bugs
+ Corrupt and/or invalid input data
» More over MapReduce engine designs the
ability to find out the tasks that are
incomplete and eventually assign them to
different nodes
I Somos,
> It is always beneficial to have multiple split
because time taken to process a split is smal %
compared tothe time taken for processing &
the whole input. When the splits are smalley
the processing is better load balanced since We
are processing the splits in parallel,
» However, it is also not desirable to have splis
too small in size. When splits are too small, the
overload of managing the splits and map tas,
creation begins to dominate the total job
execution time.
» For most jobs, itis better to make split size equal
to the size of an HDFS block (which is 64 MB,
by default)
» Execution of map tasks results into writing
output to a local disk on the respective node
and not to HDFS.
» Reason for choosing local disk over HDFS is,
to avoid replication which takes place in case
of HDFS store operation.
» Map output is intermediate output which is
processed by reduce tasks to produce the final
output.
» Once the job is complete, the map output can
be thrown away. So, storing it in HDFS with
replication becomes overkill,
> In the event of node failure before the map
output is consumed by the reduce task, Hadoop
teruns the map task on another node and re-
creates the map output.
> — Reduce task don't work on the concept of data
locality. Output of every map task is fed to the
teduce task. Map output is transferred to the
machine where reduce task is running.
>. On this machine-the output is merged and then
passed to the user defined reduce function.
» Unlike to the map output, reduce output is
stored in HDFS (the first replica is stored on
the local node‘and other replicas are stored on
off-rack nodes). So, writing the reduce output
Q2. How mapreduce will work? Explain, How MapReduce Organizes Work?
Aus : Hadoop divides the job into tasks, There are
Working of MapReduce two types of tasks :
» One map task is created for each split which | 1+ Map tasks (Spilts & Mapping)
then executes map function for each record in} 2+ Reduce tasks (Shutfling, Reducing)
the split. as mentioned above,
Rahul Publications {20} ————~~
unit =H BIG DATA ANALYTICS
The complete execution process (execution of Map and Reduce tasks, both) is controlled by two
of entities called a
1, Jobtracker : Acts like a master (responsible for complete execution of submitted job)
2, Multiple Task Trackers : Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode.
3 Task Trackers On 3 Datanodos
Jab Tacker On Nomenode
> Inaddition, tasktracker periodically sends *heartbeat’ signal to the Jobtracker so as to notify him of
current state of the system.
» Thus jobtracker keeps track of overall progress of each job. In the event of task failure, the jobtracker
can reschedule it on a different tasktracker.
Q3. Explore Map and Reduce functions.
Aas:
Exploring Map and Reduce functions
The map function has been a part of many functional programming languages for. To further your
understanding of why the map function is a good choice for big data (and the reduce function is as well),
its important to understand a little bit about functional programming.YF
“ I
MSc Il Year I Semmes,
Operators in functional languages do not modify the structure of the data; Reva ne dy
structures as their output. More importantly, the original data itself is unmodified as well.
The following is a way to represent how to construct a solution to the problem.
One way to accomplish the solution is to identify the input data and create a list:
™mylist = (“all counties in the US that participated in the most recent general election”)
Create the function howManyPeople using the map function.
This selects only the counties with more than 50,000 people:
map howManyPeople (mylist) = [ howManyPeople “county 1”;howManyPeople “county 2°
howManyPeople “county 3”; howManyPeople “county 4”; . . . ]
Now produce a new ouitput list of all the counties with populations greater than 50,000:
(no, county 1; yes, county 2; no, county 3; yes, county 4; ?, county nnn)
Adding the reduce Function
Like the map function, reduce has been a feature of functional programming languages for many
years. 7
The reduce function takes the output of a map function and “reduces” the list in whatever fashion the
Programmer desires. The first step that the reduce function requires is to place a value in something called
an accumulator, which holds an initial value.
¥ Revisit the map function example now to see what the reduce function is capable of doing.
Suppose that you need to identify the counties where the majority of the votes were for the Democratic
candidate. Remembe# that your howMany-People map function looked at each element of the input lst
and created an outputt list of the counties with more than 50,000 people (yes) and the counties with les:
than 50,000 people (no).
After invoking the howManyPeople map function, you are left with the following output list
(no, county 1; yes, county 2; no, county 3; yes, county 4; ?, county nnn)
This is now the input for your reduce function. Here is what it looks like:
countylist = (no, county 1; yes, county 2; no, county 3; yes, county 4; 2, county nnn)
reduce isDemocrat (countylist)
2.1.2 Techniques to Optimize Mapreduce Jobs
Q4. Explain various techniques to optimize mapreduce jobs ?
Aus : ,
An analysis of map reduce program execution shows that it involves a series of steps in which every
step has its own set of resource requirements.
In case of short jobs, the one’s which user needs a quick response to the quenties, the response time
became more important. Hence we need MapReduce job to be optimised.
Encountering a deadlock even for a single resource will slow down the execution process.
We can organize the MapReduce optimization techniques in the following categories:
» Hardware or network topologies D
» Synchronization
> File system
Rahul Publications (2)UNIT - I
1. Hardware/network topology
‘A distinct advantage of MapReduce is the
capability to run on inexpensive clusters of
commodity hardware and standard networks. If you
don't pay attention to where your servers are
physically organized, you won't get the best
performance and high degree of fault tolerance
necessary to support big data tasks.
Commodity hardware is often stored in racks
in the data center. The proximity of the hardware
within the rack offers a performance advantage as
slave style of distribution,where the master node
stores all the metadata, access rights, mapping
andlocation of files and blocks, and so on. ‘The slaves
are nodes where the actual data is stored. All the
requests go to the master and then are handled by
the appropriate slave node. As you contemplate the
design of the file system you need to support
MapReduce implementation, you should consider
the following :
> Keep it warm :
master node could get ove!
everything begins there. Additionally,
As you might expect, the
worked because
if the
BIG DATA ANALYTICS
master node fails, the entire file system is
inaccessible until the master is restored. A very
important optimization is to create a “warm
standby” master node that can jump into service
if a problem occurs with the online master.
>» The bigger the better : File size is also an
important consideration. Lots of small files (less
than 100MB} should be avoided. Distributed
file systems supporting MapReduce engines
work best when they are populated with a
modest number of large files.
we nae
HBasosis-acdistributed.
database built on top of the Hadoop file system. It
is an open-source project and is horizontally
scalable.
HBase is a data model that is similar to Google's
big table designed to provide quick random access
to huge amounts of structured data. It leverages
the fault tolerance provided by the Hadoop File
System (HDFS).
It is a part of the Hadoop ecosystem that
provides random real-time read/write access fo data
in the Hadoop File System.
(33} ‘Rahul PublicationsYS
MSc Il Year M Semes,
One can store the data in HFS ether directly or through HBase, Data consumer readslacceses
deta in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides re ,
and write access,
Features of HBase
> HBase is linearly scalable.
> — It has automatic failure support.
» It provides consistent read and writes,
» It integrates with Hadoop, both as a source and a destination,
> Ithas easy java API for client,
» It provides data replication across clusters.
Where to Use HBase
» Apache HBase is used to have random, real-time read/write access to Big Data.
> — It hosts very large tables on top of clusters of commodity hardware.
» Apache HBase is a non-relational database modeled after Google’s Bigtable, Bigtable acts up on :
Google File System, likewise Apache HBase works on top of Hadoop and HDES,
Applications of HBase
> Itis used whenever there is a need to write heavy applications.
» — HBase is used whenever we need to Provide fast random access to available data.
» Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
Q6. How to install HBase in various platforms ? Discuss,
Aue é
HBase - Installation
HBase can be installed in three modes. The features of these modes are mentioned below.
1. Standalone mode installation (No dependency on Hadoop system)
> This is default mode of HBase
> Itruns against local file system
> It doesn't use Hadoop HDFS*
> Only HMaster daemon can run
>» — Not recommended for production environment
>» — Runs in single JVMere TE
UNIT - Il
BIG DATA ANALYTICS
+ All Daemons run in single node
» Recommend for production environment
ye istrib
3, Fully Distributed mode installation ( MultinodeHadoop environment + HBase installation)
> It runs on Hadoop HDFS
» All daemons going to run across all nodes present in the cluster
» Highly recommended for production environment
Hbase - Standalone mode installation:
Installation is performed on Ubuntu with Hadoop already installed.
Batty, Place hbsce cl Spinstasd
eee a
§ cd hbase-1.1.1/¢
./hbase-1.1.1/conf$|
hbase-envsh x
# Set environment vartables here.
ple times over the course of 5
# This script sets vartables multt
hbase process,
a so try to keep things tdenpotent unless you want
ft into the startup scripts (bin/hbase, etc.)
to take an
# The java implementation to use. Java 1.7+ required.
# export JAVA_HOME=/us¢/ java/jak1.6.0/
ee SAVA_HOME=/usr/L4b/Jve/Java-7 -openjak-ando4/ Jre ,
# Extra Java CLASSPATH elements. Optional.
# export HBASE_CLASSPATH=
E_HOME path as shown in below
Step 4) Open ~/:bashrc file and mention HBAS
PATH= $PATH:SHBASE_HOME/bin
export HBASE_HOME=/home/hduser/hbase-1.1.1 export
Rahul PublicatMSc II Year _ MW
Lr he LU eee cua
hduser@ubuntu:~$ cd hbase-1.1.4/conf h
hduser@ubuntu:~/hbase-1.1.1/conf$ gedit hbase-env.s!
MCS ee Liles et PT Lary AZ:
bashre x
#export HADOOP_COMMON_LIB_NATIVE_DIR=$(HADOOP_PREFIX}/Uib/na
#export HADOOP_OPTS="-Djava. library. path=SHADOOP_PREFIX/1ib"
export HADOOP_COMMON_LIB_NATIVE_DIR=SHADOOP_HOME/1tb/native
export HADOOP_OPTS="-Djava. library. path=SHADOOP_HOME/1ib"
export HADOOP_INSTALL=SHADOOP_HOME
export HIVE_HOME=/home/hduser /apache-hi
export PATH=S$PATH: SHIVE_HOME/bin
HBASE PATH
export HBASE_HOME=/home/hduser/hbase-1.1.1
xport PATH=S$PATH: SHBASE_HOME/bin
Step 5) Open hbase-sitexml and place the following properties inside the file hduser@ubuntus ged
hbase-site.xml(code as below)
hbase.rootdir
file:///home/hduser/HBASE/hbase
hbase.zookeeper.property.dataDir
/home/hduser/HBASE/zookeeper
Rahul Publications {38}
JUNIT - Il
GNU nano 2.2.6
fa hast
pean
Pecetiad : fiteenan
erated Re
Paes
wists mF
PS WE cle
ea acti
ee
e
= [hb
ype 3 co
c
hosts *
Vocathost
vubunty
Tines are desirable for IP
¢. {p6- Loopback
7.0.
127.0.0-1
‘ocathos'
{p6-Localnet
Apo-ncastpref tx
F The following
i
ips-al nodes
ipé-atlrouters
BIG DATA ANALYTICS
Hodt fied
cA Cur Pos
Momo Bowes
ve capable hostsMSc Il Year
Step 7) Now Run Start-hbase.sh in hbase-1.1.1/bin location as shown below.
And we can check by jps command to see HMaster is running or not.
Step8) HBase shell can start by using “hbase shell” and it will enter into interactive shell mode as shoy
in below screenshot. Once it enters into shell mode, we can perform all type of commands.
The standalone mode does not require Hadoop daemons to start, HBase can run independently
Hbase - Pseudo Distributed mode of installation:
This is another method for Hbase Installation, known as Pseudo Distributed mode of Installation.
Below are the steps to install HBase through this method,
Step 1) Place hbase-1.1. 2-bin.tar.gz in /home/hduser
Step 2) Unzip it by executing command$tar -xuf hbase-1.1.2-bin.targe. It will unzip the contents, and
it will create hbase-1.1.2 in the location /homelhduser
Step 3) Open hbase-enush as following below and mention JAVA_HOME path and Region ser!
Path inthe location and export the command as shown
Rahul Publications—— BIG DATA ANALYTICS
HAMS BEEBE SEED Var Mention JAVA, HOME path
Koes process, evtes nuCpE HBASE_REGIONSERVERS-
5 50 try to keep things (denpotent un
ei)
eermiremenrbacireaneeynee, —HBASE_MANAGES_ZK:
a The java inplenentation to use
& export JAVA_HOME=/usr/java/jdki
‘export, JAVA_HOME=/usr/Utb/ Jvn/ Java-7 -open}dk-and6s/Jre
Java 4
export: HBASE_REGIONSERVERS=,
export RUSE. CLINE yey s/h Udy hase hbese-0,¢- Al canf/cesianseNST>
@ Extra Java CLASSPATH elenents
+ Opt
# export HBASE_CLASSPAI eee:
Jhbase-1.1.1/conf$ cd
~§ cd hbase-1.1.1/conf
/hbase-1.1.1/conf$ Pind Tere uae)
hbase: pute [aed ey eile
; WNT ts PV ed Ris
3 cbashrc x
= X} /Lib/narp
E #export HADOOP_COMMON_LIB_NATIVE_OIR ${HADOOP_PREF jna
t gexport HADOOP_OPTS: “java. Library. path=SHAD0OP_PREFIX/ LAP ;
- HOME/Lib/native
b -t HADOOP_COMMON_LIB_NATIVE_OIR SHADOOP_ ‘
export HADOOP_OPTS: pjava. Library-path=sHAD00? HOME/ TP
export HADOOP_INSTALL=SHADOOP_HOME
export saive_HoHE=/hone/hduser /apache-hy
export PATHZSPATH: SHIVE_HOME/bin
———
home/hduser /hbase- 2.1.1
HBASE, Home /bin
_
HBASE PATH
xport HBASE_HOME=/|
xport PATH=SPAT sMSc Il Year
Step 5) Open HBase-site.xml and mention the below properties in the file.(Code as below)
Ses LU EE LL get
eer Bisel B eum 2 ea
shbasessite.xml x
hdfs://locathost: 9000/hbase
5
valuestrue
Fnaneshbase. Zookeeper .quorun
[values locathost
icnameshbase. rootdtr @
‘name>hbase. cluster. distributed @
ms
valuemt
.
[anane>hbase . Zookeeper .property.clientPort
2181
Fproperty>
| nameshbase. zookeeper .property.dataDir
“ya lue> hone /hduser /hbase/zookeeper
[Iproperty>
XML» Tab Width:8 ©
Rahul Publications {2}
Setting up Hbase root directory in this property
For distributed set up we have to set this property
ZooKeeper quorum property should be set up here
Ml Se,
Mey
*
i
1n49, Col t
Replication set up done in this property. By default we are placing replication as 1.
INS
oe— wel
unit BIG DATA ANALYTICS
In the fully distributed mode, multiple data nod . ing more
In due inthe cisteplcaton mong oe present so we can increase replication by placing more
5, Client port should be mentioned in this property
6. ZooKeeper data directory can be mentioned in this property
Step 6) Start Hadoop daemons first and after that start HBase daemons as shown below
Here first you have to start Hadoop daemons by using” /start-all.sh” command as shown in below.
siallation 1s same a6 pseudo d
multiple nodes.
> Theconfigurations files mentioned in HBase-site.xml and hbase-env.sh is same as mentioned in pseudo
mode. . a
2.1.4 Exploring The Big Data Stack
Q7. Explain the big data architecture with neat diagram.
Aas:
Big data analysis also needs the creation of model, The architecture of bigdata environment must
fullll all the foundational requirements and must be albe to perform the following fucntions:
Data is first properly validated and cleaned; After that, the ingestion layer would take over.
(1) Rahul PublicationsMSc I Year
Big Data System’s Architecture
ton
PAE) Lup
{
f
y f
}
Feoure zt:
»
Viswatvution Layer
Madioap Adminis
Madoop Phittorm rtanagcn™
3 P10
Data Wate nourie:
Hsdoop,
.F
Sota asc ats |
t es Aoetianices
wre circa) | mar :
Jt a
Htadoop ge .
ee biota! Gurtxed Werner wee Eas
wisinidodiinae SS
‘ Sosunty Levor
tdi Lar
Ingestion Layer
The responsibility of this layer is to separate the noise from the relevant information. The ingestion
layer should
be able to validate, cleanse, transform, reduce, and integrate the data into the big data
tech slack for further processing.
I Identification M7
ae Filtration L] [eee
|
— ., Hadoop Storage Layer
_ Integration | Cesar
| Validation | WOES
| | Compression
Noise
Reduction
Fig. : Functions of Data Ingestion Layer
Distributed (Hadoop) Storage Layer :
The data is stored in massively distributed storage systems such as, Hadoop distributed file syste"
(HDFS). Hadoop is an open source framework that allows us to store huge volumes of data in ®
distrbufed fashion across low cost machines. Hadoop can support petabytes of data and massive
scalable map reduce engine that computes results in batches. HDFS requires complex file readiwnlé
programs to
be uritten by skilled developers. These programs are called NoSQL databases.
Rahul. Publications_
uNIT tl BIG DATA ANALYTICS
. eae :MapReduce was adopted by Google for efficiently executing a set of functions
a sa lage amount of data in batch mode, The map component distributes the problem or
distributes the re ie of systems and handles the placement of the tasks in a way that
fer the dsttiiied comin Poe
came ere computation is completed, another function called reduce combines all
that analyze large ans A provide a result. MapReduce simplifies the creation of processes
thet ane lege amounts of unsctred and suc dt in parallel. Underlying hardware
oat, nnsparently for user applications, providing a reliable and fault-tolerant
gcccurity Covey nahn ee
As big data analysis becomes a mainstream functionality for companies, security of that data becomes
a prime concern. The security requirements have to be part of the big data fabric from the beginning
and not an afterthought.
> Monitoring Layer
Monitoring systems have to be aware of large distributed cl
mode. The system should aso provide tools for data storage and visualization, Performance is @ key
parameter to monitor so that there is very low overhead and high parallelism, Open source tools
like Ganglia and Nagios are widely used for monitoring big data tech stacks.
lusters that are deployed in a federated
> Visualization Layer
A huge volume of big data can lead to information overload. However, if visualization is incorporated
carly-on es an integral part of the big data tech stack, it will be useful for data analysts and scents io
gain insights faster and increase their ability to look at diferent aspects of the data in various visual
—(} Rahul Publications- Paes
M.Sc Il Year I Sema
ty
modes. Once the big data Hadoop processing aggregated output is scooped into the traditional)
data warehouse, and data marts for further analysis along with the transaction data, the visvaizag®
layers can work on top of this consolidated aggregated data. Additionally, if real-time insights requis
the real-time engines powered by complex event processing (CEP) engines and event-die’
architectures (EDAs) can be utilized. .
E= ea Visualization Took : i
Teadtional 81 Toots | Baten ots |
Wewenauses, ae ota Scoop Gata Lakes
~ _ — i
= |
pees
| Strectured Oata ts
Visualisation Conceptual Architecture
2.1.5. Virtualization and Big Data
Q8. What is the need of virtualization in big data ?
Aue:
The essential idea with virtualization is that heterogeneous or distributed systems are’ represented a
complex systems through specific interfaces that replace physical hardware or data storage designations
with virtual components. These are called virtual machines. Each virtual machine contains a separate
copy of the operating system with its own hardware resources. Although virtualization is not very important
in big data analytics, the MapReduce works very efficiently in a virtualised environment.
Virtual Machine Monitor
Peri
‘” PublicationsUNIT - Il
The following is the basic features of
virtualization :
» Partitioning - multiple applications and
operating systems are supported by a single
physical system by partitioning the avai
psa @ the available
» Isolation - each virtual machine runs in
an isolated manner from its host physical
system and other virtual machines. The
benefit ofthis is, if one machine crashes,
the other virtual machines and host systems
are not effected.
ization for optimizing the computing
environment.
2.1.6 Virtualisation Approaches
Q9. Explain about various virtualization
approaches in big data.
Aas:
In big data environment you can virtualize
almost every element like servers, storage,
applications, data, network etc. the following are
the some form of virtualizations.
BIG DATA ANALYTICS
> ' Big data server virtualization
In server virtualization, one physical server is
partitioned into multiple virtual servers. The
hardware and resources of a machine —
including the random access memory (RAM).
CPU, hard drive, and network controller — can
be virtualized into a series of virtual machines
that each runs its own applications and
operating system.
A virtual machine (VM) is a software
representation of a physical machine that can
execute or perform the same fungtions as the
: a
hysic A.thin. lays
» Big data application virtualization
Application infrastructure virtualization provides
an efficient way to manage applications in
context with customer demand. The application
is encapsulated in a way that removes its
dependencies from the underlying physical
computer system. This helps to improve the
overall manageability and portability of the
application.
In addition, the application infrastructure
virtualization software typically allows for
Rahul PublicationsMSc 1i Year
codifying business and technical usage policies
to make sure that each of your applications
levera al and physical resources in a
predictable way. Efficiencies are gained because
you can more easily distribute IT resources
according to the relative business.value of your
applications.
es
Application infrastructure virtualization used in
Combination with server virtualization can help
to ensure that business service-level agreements
are met. Server virtualization monitors CPU and
memory usage, but does not account for
variations in business priority when allocating
resources.
Big data network virtualization
Network virtualization provides an efficient way
to use networking as a pool of connection
tesources. Instead of relying on the physical
network for managing traffic, you can create
multiple virtual networks all utilizing the same
physical implementation. This can be useful if
you need to define a network for data gathering
with a certain set of performance characteristics
and capacity and another network for
applications with different performance and
capacity.
Virtualizing the network helps reduce these
bottlenecks and improve the capability to
manage the large distributed data required for
big data analysis.
Big data processor and memory
virtualization
Processor virtualization helps to optimize the
processor and maximize performance, Memory
virtualization decouples memory from the
servers,
In big data analysis, you may have repeated
queries of large data sets and the creation of
advanced analytic algorithms, all designed to
look for pattems and trends that are not yet
understood. These advanced analytics can
tequire lots of processing power (CPU) and
memory (RAM). For some of these
computations, it can take a long time without
sufficient CPU and memory resources,
I Som,
Sty
» Big data and storage virtualization
Data virtualization can be used to cre,
platform for dynamic linked data service, x
allows data to be easily searched ang line
through a unified reference source. As a y,
data virtualization provides an abstract sen,
that delivers data ina consistent form regarq®
of the underlying physical database. In addig,
data virtualization exposes cached data to a
applications to improve performance,
Storage virtualization combines physical
resources so that they are more effectively shared,
This reduces the cost of storage and makes it easie
to manage data stores required for big data analysis
2 Storinc Data in DATABASES AND
Data WareHouses
2.2.1 RDBMS and
Q10. Explain the relationship between
RDBMS and Big Data.
Aue :
Big data is becoming an important element in
the way organizations are leveraging high-volume
data at the right speed to solve specific data
problems. Relational Database Management
Systems are important for this high volume, Big dats
does not live in isolation, To be effective, companies
often need to be able to combine the results of big
data analysis with the data that exists within the
business.
» BIG DATA BASICS : RDBMS AND
PERSISTENT DATA
One of the most important services provided
by operational databases (also called data
stores) is persistence, Persistence guarantees thet
the data stored in a database won't be changed
without permissions and that it will available as
long as it is important to the business.
Given this most important requirement, you
must then think about what kind of data you
want to persist, how can you access and update
it, and how can you use it to'make busines
decisions, At this most fundamental level, tH
choice of your database engines is critical
Dataunit =
your overall success with your big data
implementation.
Even though the underlying technology has
been around for quite some time, many of
these systems are in operation today because
the businesses they support are highly
dependent on the data. To replace them would
be akin to changing the engines of an airplane
on a transoceanic flight.
BIG DATA BASICS: RDBMS AND TABLES
Relational databases are built on one or more
In companies both small and large, most of
their important operational information is
probably stored in RDBMSs. Many companies
have different RDBMSs for different areas of
their business. Transactional data-might be
BIG DATA ANALYTICS
open source relational database. Several factors
contribute to the popularity of PostgreSQL. As
an RDBMS with support for the SQL standard,
it does all the things expected in a database
product, plus its longevity and wide usage have
made it “battle tested.” It is also available on
just about every variety of operating system,
from PCs to mainframes.
Providing the basics and doing so reliably are
only part of the story. PostgreSQL also supports
many features only found in expensive proprietary
RDBMSs, including the followin;
» Indexing methods
» Procedural languages
This high level of customization makes
PostgreSQL desirable when rigid, proprietary
products won't get the job done. It is infinitely
extensible.
stored in one vendor's database, while
customer information could be stored in
another.
> POSTGRESQL, AN OPEN SOURCE Finally, the PostgreSQL license permits
RELATIONAL DATABASE modification and distribution in any form, open or
closed source. Any modifications can be kept private
@ a ee
During your big data implementation, you'll | ©. shared with the community as you wish.
likely come across PostgreSQL, a widely used,
Rahul PublicationsMSc WYeor
2.2.2 Non-Relational Database
Q11.What are called Non-Relational
databases ?
Anas:
Non-telational databases do not rely on the
table/key model endemic to RDBMSs. In short,
specialty data in the big data world requires specialty
persistence and data manipulation techniques.
One emerging, popular class of non-relational
database is called not only SQL (NoSQL). Originally
the originators envisioned databases that did not
tequire the relational model and SQL. As these
products were introduced into the market, the
definition softened a bit and now they are thought
of as “not only SQL,” again bowing to the ubiquity
of SQL.
The other class is databases that do not support
the relational model, but rely on SQL as a primary
means of manipulating the data within, Even though
relational and non-relational databases have similar
fundamentals, how the fundamentals are
accomplished creates the differentiation.
Non-telational database technologies have the
following characteristics in common :
» Scalability: In this instance,this refers to the
capability to write data across multiple data
stores simultaneously without regard to physical
limitations of the underlying infrastructure.
Another important dimension is seamlessness.
The databases must be able to expand and
contract in response to data flows and do so
invisibly to the end users.
» Data and Query model: Instead of the row,
column, key structure, nonrelational databases
use specialty frameworks to store data with a
requisite set of specialty query APIs to
intelligently access the data.
» Persistence design: Persistence is still a
critical element in nonrelational databases. Due
to the high velocity, variety, and volume of big
data, these databases use difference
mechanisms for persisting the data. The highest
performance option is “in memory,” where the
entire database is kept in the very fast memory
system of your servers.
» Interface diversity: Although most of these
technologies support RESTful APIs as their “go
oY
MW Se
Te
to” interface, they also offer a Wide vay
connection mechanisms for programm’) 2
database managers, including analysis vo
hy,
reporting/visualization.
» Eventual Consistency: While RDB;
ACID (Atomicity, Consistency, Isciy%
Durability) for ensuring the consistency of e
non-relational DBMS use BASE, BASE
for Basically Available, Sott state, and Even”
Consistency. Eventual consistency is ny
important because it is responsible for conf,
resolution when data is in motion be
nodes ina distributed implementation. Thed,
state is maintained by the software and 4
access model relies on basic availabilty
2.2.3 Integrating Big Data with Tradition,
Data Warehouses
Q12. What are the challenges facing by b,
data in usage of traditional da
warehouses ?
Aus:
While the worlds of big data and the traditiog
data warehouse will intersect, they are unlikely,
merge anytime soon. Think of a data warehouses
a system of record for business intelligence, mus
like a customer relationship management (CR
or accounting system. These systems are high
structured and optimized for specific purposes §
addition, these systems of record tend to be high
centralized.
The diagram shows a typical approach to di
flows with warehouses and marts :
Transact onal
Systens}
Rahul Publications (4)
aunit - Il
Organizations will inevitably continue to use data
warehouses to manage the type of structured and
tional data that characterizes systems of record,
These data warehouses will stil provide business
analysts with the ability to analyze key data, trends,
snd so on. However, the advent of big data is both
challenging the role of the data warehouse and
providing a complementary approach.
The key challenges are outlined here and will
be discussed with each architecture option,
Data Loading
» With no definitive format or metadata or
Data Availability
> Data availability has been a challenge for any
system that relates to processing and
transforming data for use by end users, and
big data is no exception. The benefit of Hadoop
or NoSQL is to mitigate this riskand make data
available for analysis immediately upon
acquisition. The challenge is to load the data
quickly as there is no pre-transformation
required.
> Data availability depends on the specificity of
metadata to the SerDe or Avro layers. If data
BIG DATA ANALYTICS
can be adequately cataloged on acquisition, it
can be available for analysis and discovery
immediately.
Since there is no update of data in the big data
layers, reprocessing new data containing
updates will create duplicate data, and this
needs to be handled to minimize the impact
on availability.
Data volumes
> Big data volumes can easily get out of control
due to the intrinsic nature of the data. Care
id to, the growth
Je
: lethe
administrators with tools and tips to zone
the infrastructure to mark the data in its
own area, minimizing both risk and
performance impact.
Data exploration and mining is a very common
activity that is a driver for big data acquisition
across organizations, and also produces large
data sets as the output of processing. These data
sets need to be maintained in the big data
system by periodically sweeping and deleting
intermediate data sets. This is an area that
normally is ignored by organizations and can
be a performance drain over a period of time.
Rahul PublicationsMSc Il Year
Storage performance
» Disk performance is an important consideration
when building big data systems, and the
appliance model can provide a better focus on
the storage class and tiering architecture, This
will provide the starting kit for longer-term
planning and growth management of
the storage infrastructure.
» If a combination of in-memory, SSD and
traditional storage architecture is planned for
big data processing, the persistence and
exchange of data across the different layers can
be consuming both processing time and cycles.
Care needs to be extended in this area, and
the appliance architecture provides a reference
for such complex storage requirements.
Operational costs
Calculating the operational cost for a data
warehouse and its big data platform is a complex
task that includes initial acquisition costs for
infrastructure, plus labor costs for implementing the
architecture, plus infrastructure and labor costs for
ongoing maintenance, including external help
commissioned from consultants and experts.
2.2.4 Big Data Analysis and Data Warehouse
Q13. Discuss about big data. and data
warehouse.
Aus:
Big data is analysed to know the present
behaviour or trends and, make future predictions.
Various big data solutions are used for extracting
useful information from big data.
Big data canbe useful :
» Enables the storage of very large amounts of
heterogeneous data
> Holds data in low cost storage devices
» Keeps data in a raw or un structured format.
You will find value in bringing the capabilities
of the data warehouse and the big data environment
together. You need to create a hybrid environment
where big data can work hand in hand with the
data warehouse.
_
MI Semas,
"
The data warehouse for what it has be,
designed to do — provide a well-vetted version ®
the trth about a topic thatthe business wan
analyze. The warehouse might include informa,
about a particular company’s product line,
customers, its supplies, andthe deals of a year
worth of transactions.
The information managed in the dat,
warehouse or a departmental data mart has been
carefully constructed so that metadata is accurate
With the growth of new web-based information,
is practical and often necessary to analyze this
massive amount of data in context with historicg
data. This is where the hybrid model comes in.
Certain aspects of marrying the data warehouse
with big data can be relatively easy. For example,
many of the big data sources come from sources
that include their own well-designed metadata,
Complex e-commerce sites include well-defined
data elements. Therefore, when conducting analysis
between the warehouse and the big data source,
the information management organization is
working with two data sets with carefully designed
metadata models that have to be rationalized.
Of course, in some situations, the information
sources lack explicit metadata. Before an analyst can
combine the historical transactional data with the
less structured big data, work has to be done.
‘Typically, initial analysis of petabytes of data will reveal
interesting patterns that can help predict subtle
changes in business or potential solutions to a
patient’s diagnosis.
The initial analysis can be completed leveraging
tools like MapReduce with the Hadoop distributed
file system framework. At this point, you can begin
to understand whether it is able to help evaluate
the problem being addressed.
In the process of analysis, it is just as important
to eliminate unnecessary data as it is to identify data
relevant to the business context. When this phase’s
complete, the remaining data needs to be
transformed so that metadata definitions are precise.
In this way, when the big data is combined with
traditional, historical data from the warehouse, the
results will be accurate and meaningful.s
Rahul Publications
150)—
unit ll
‘The Big Data Integration Lynchpin
This process requires a well-defi
i -defin
integration strategy. While data eee
critical element of managing big data, it is e ie
important when creating a hybrid analysis itt te
data warehouse. In fact, the process of extractir e
data and transforming it in a hybrid ers
very similar to how this process is executed withi
traditional data warehouse. ae
In the data warehouse, data is extracted from
traditional source systems such as CRM
etree oO
systems. Itis critical that elements from these cee
th
iche
» sources cam:
neaningful
I provide the business with
‘on the need to analyze a
requires monitoring,
sciheond
typical data warehouse will
a snapshot of data based
particular business issue that
such as inventory or sales.
The distributed structure of big data will often
data into a series of
lead organizations to first load
nodes and then perform the extraction and
transformation. When creating 2 hybrid of the
traditional data warehouse and the big data
environment, the distributed nature of the big data
environment can dramatically change the capability
of organizations to analyze huge volumes of data in
context with the business.
BIG DATA ANALYTICS
2.2.5 Changing Deployment Models in Big
Data Era
Q14. What are the various deployment models
in big data era? Discuss.
Aus :
With the advent of big data, the deployment
models for managing data are changing. The
traditional data warehouse is typically implemented
ona single, large system within the data center. The
costs of this model have led organizations to optimize
these warehouses and limit the scope and size of
lytical'eng!
to simplify the process of analyzing data from
multiple sources. The appliance is therefore asingle-
purpose system that typically includes interfaces to
make it easier to connect to an existing data
warehouse.
* THE BIG DATA CLOUD MODEL
The cloud is becoming a compelling platform
to manage big data and can be used’in a hybrid
environment with ori-premises environments. Some
Gf the new innovations in loading and transferring
data are already changing the potential viability of
the cloud as a big data warehousing platform.
»
‘Rahul PublFor example, Aspera, a company that
specializes in fast data transferring between
networks, is partnering with Amazon.com to offer
cloud data management services. Other vendors
such as FileCatalyst and Data Expedition are also
focused on this market. In essence, this technology
category leverages the network and optimizes it for
the purpose-of moving files with reduced latency,
As this problem of latency in data transfer
continues to evolve, it will be the norm to store big
data systems in the cloud that can interact with a
data warehouse that is also cloud based or a
warehouse that sits in the data center.
Processine Your Dara Wit MapReDuce
2.3.1 Developing Simple Mapreduce
Application
QI5. Expalin Word Count - Hadoop Map
Reduce Example - How it works ?
Aus:
Hadoop WordCount operation occurs in 3 stages ~
i. Mapper Phase
ii, Shuffle Phase
iii, Reducer Phase
Hadoop WordCount Example- Mapper Phase
Execution
The text from the input text file’ is'tokenized
into words to form a key value pair with all the words
present in the input text file. The key is the word
from the input file and value is ‘1’.
For instance if you consider the sentence “An
elephant is an animal”. The mapper phase in the
WordCount example will split the string into
individual tokens i.e. words. In this case, the entire
sentence will be split into 5 tokens with a value 1 as
shown below -
Key-Value pairs from Hadoop Map Phase
Execution-
(an,1)
(elephant,1)
(is,1)
(an,1)
(animal,1)
MSc Il Year I Semey, ‘
~ ey
Hadoop WordCount Example- Shute py,
Execution m
After the map phase execution is comp]
cuccesil, shufle phases executed automaye
wherein the key-value pairs generated in the.
phase are taken as input and then sortes™
alphabetical order. After the shutle pha”
executed from the WordCount example code,
output will look like this -
(an,1)
(an,1)
(animal,1)
(elephant, 1)
(is,1)
Hadoop WordCount Example- Reducer Phase
Execution
In the reduce phase, all the keys are groupes
together and the values for similar keys are addej
up to find the occurrences for a particular word, |
is like an aggregation phase for the keys generated
by the map phase, The reducer phase takes the
‘output of shuffle phase as input and then reduces
the key-value pairs to unique keys with values added
up. In our example “An elephant is an animal.” is
the only word that appears twice in the sentence
After the execution of the reduce phase of
MapReduce WordCount example progran,
appears as a key only once but with a count of 2 as
shown below -
(an,2)
(animal,1)
(elephant, 1)
(is,1)
This is how the MapReduce word count
program executes and outputs the number of
occurrences of a word in any given input file. An
important point to note during the execution of the
WordCount example is that the mapper class in the
WordCount program will execute completely on the
entire input file and not just a single sentence.
Suppose if the input file has 15 lines then the
mapper class will split the words of all the 15 lines
and form initial key value pairs for the entire dataset
The reducer execution will begin only after the
mapper phase is executed successfully.
Rahul Publications
(@)——
unit =I BIG DATA ANALYTICS
Running the WordCount Example in Hadoop MapReduce using Java Project with Eclipse
step 1
Let’s create the java project with the name “Sample WordCount” as shown below -
File > New > Project > Java Project > Next,
“Sample WordCount” as our project name and click “Finish”:
Edit
Source Refactor Navigate Search Project Run Window
ee ww
Package
Class
Interface
Enum
‘Annotation
Source Folder
Java Working Set
Folder
Create a Java Project
Create a Java project in the workspace or in an extemal
& Use default location
JRE :
@ Use an execution enyironment JRE:
© Use a project specific JRE:
© Use defgult JRE (currently ‘jdk1.6.0_32') Configure IRES...
Project layout.
. (=) Rahul Publicationsa.
MSc Il Year H Semey.
Step 2 : The next step is to get references to hadoop libraries by clicking on Add JARS as follows
5 Openinten Wen
open petteahy _
ee cari on soieennewe >
eee
ose coy on
RE Sten ray CFL Ones Nene
(esi and cusy
ice ouee
utero >
Same stitetes >
tsa sence >
bout
etn 5
ise jet
Close ured ject
Assi Woteg Se.
ess >
etn as >
Yolate
en >
emo wen >
“Rest em ts ety.
Seoieweccre Comput >
io Binne i IER seca nn
maa Gp oc ws cs.
Pancreat
fepctione sas
= sous
» ian iv nace ae
aces
oo enh 2161 ant ahemeroen
— en917 poorer
actions, = SARUM asaapee |
ae 1 5 ssp arth
gray
poe 1-5 zarast0 1 catanamepeee
oe a-ha
= tc 02) ssa Coto
paced toca ange
— 2 sontone-szrteatnce
» Smashes
2 tere tet pat dep
1 = take penaniee
1 m.nesyseny nse)
Rahul Publications
eaReferenced ut S00¥ cue ierface
> citraining Copy Qualified Name om
paste carey Arnea00
Datete alte
ava Working Se
Java Package
Create a new java package
estes elders amespontng to pacages,
source ge Sanpe Werte
Name: omnes
1D Greate packagesinin java
Novilet'stimplem
project com.code.dezyre.
't Package Explorer
a
‘Open in ew Window oe
eae _—
oon auio's lee
ES 2 Sead
=
‘Copy Qual on. ‘Annotation
a
oe oes
: =
Rahul PublicationsMSc Il Year in
Sem,
et,
ea aes G
sauce tegen (Sane Worcoutin
Pactage: comeade deve browse.
Ends pe 3
| sen: ees :
| Masten: pie 0 eet
| esc td
| sues: finalrbic | bows
| mertaces:
‘we meth sts wad you eta crete?
\ peste ld mais 05)
| (Govan tom specs
|
Step 5
Create a Mapper class within the WordCount class which extends MapReduceBase Class to impleme
mapper interface. The mapper class will contain -
1, Code to implement “map” method.
2. Code for implementing the mapper-stage business logic should be written within this method,
Mapper Class Code for WordCount Example in Hadoop MapReduce
publicstaticclassMapextendsMapReduceBaseimplementsMapper{
privatefinalstatic IntWritable one = new IntWritable(1);
private Text word = new Text);
publicvoidmap(LongWritable key, Text value, OutputCollector output, Repers
reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = newStringTokenizer(line);
while (tokenizer hasMoreTokens()) {
word set(tokenizer:nextToken());
coutput.collect(word, one);
}
} }
In the mapper class code, we have used the String Tokenizer class which takes the entire line #
breaks into small tokens (string/word).
Step 6
Create a Reducer class within the WofdCount class extending MapReduceBase Class to implet#
reducer interface. The reducer class for the wordcount example in hadoop will contain the -
1. Code to implement “reduce” method
2. Code for implementing the reducer-stage business logic should be written within this method
- Rahul Publications —Cs}
ed— a eee
unit
BIG DATA ANALYTICS
| Reducer Class Code for WordCount Example in Hadoop MapRed
aticinaReduceeend Mr RodueBaiconetfahe a
ublicvoidreduce(Text key, Iterator values, OutputCollector output,
a a teporter)throws IOException {
while (values.hasNext()) {
sum += values.next().get();
}
, ones new IntWritable(sum));
conf.setMapperClass(Map.class);
IIeonf setCombinerClass(Reduce.class);
cont setRecucerClass(Reduceclas}
conf setinputFormat( TextinputFormat class);
conf setOutputFormat(TextOutputFormat.class}
FilelnputFormat:setinputPaths(cont, new Path(args{0]));
FileOutputFormat.setOutputPath(cont, new Path(args{1)));
JobClient.rundob(conf);M.Sc Il Year
Step §
Create the JAR file for the wordcount class -
cae
+ po grr passes ces
Rahul Publications
Sere
ly,BIG DATA ANALYTICS
JAR File Specification
Dehne which resources shouibe exported ito the JAR
Select the resources to gapor:
> einise % asepat
BESSY project
; ExpOr generated clas files and resources
Exportall eutput folders for checked projects
Export java source files and resources
Export refacorings for checked projects,
Select the export destination
. 5 JAR te: (momeiciouderaibesk tone
Options:
‘% Compress the contents of the JAR fle
& Age directory enties
Qvervnte existing files without warning
Bi cai
cance, is
How to execute the Hadoop MapReduce WordCount program ?
>>hadoop jar (jar file name) (className_along_with_packageName) (input file) (output folderpath)
~——hadoop jar dezyre_wordcount.jar com.code.dezyre. WordCount
Juser/cloudera/Input/war_and_peace /user/cloudera/Output
eee
it View Search Jemunal Help
count jar com
udera@localhost Desktop] hadoop jar dezyre wordcot
fatoane Juser/Cloudera/Tnput/uar and peace /user/cloudera/Output ll
code dezyre.
{#}— Rahul PublicationsM.Sc Il Year Mis,
M. . ee,
Output of Executing Hadoop WordCount Example -
HOFS/userktoudera/Outputpar.. | & | 7
cates lcaidomain on Abie ear Tete gine
Prost vistee™ \lewera loudera Manager hue £°MOFSNamelode ‘ HadooplobMacket | )MBase Mag
2.3.2 Points to Consider While Designing Mapreduce
Q16. What are the points need to be considered while designing the mapreduce ?
Aus :
You need to consider the following guidelines while designing the mapreduce applications :
When dividing information.among the map tasks , make sure that you do not create too man;
mappers.
» — Having the right number of mapper has the following benefits :
Excessive mappers might create scheduling and infrastructure overhead. In some cases it a:
even terminate the job tracker. 7
Having fewer mappers may lead to under utilization of some parts of the storage and excessie
burden on the other parts of the storage.
«Avast number of small mappers perform a large number of seek operations,
Another important factor is the number of reducers configured for the application. Too many «
too few reducers. may have negative-impact on productivity.
= Then number of reducers designed for an application is an essential variable for the applicator
= Besides planning framework overhead, having too may reducers generates a correspondi
number of outputs , which has the negative effect on name node.
= Having few reducers has the same negative impact of having few mappers.
Rahul PublicationsuNit-tl BIG DATA ANALYTICS
Use job counters properly as described below:
+ Computers are suitable for keeping track of global data of bits of information. They are not
meant to calculate aggregate statistics,
+ Counters are costly because the job tracker must keep track of all counters of each mapiteduce
task.
> You must compress the output of an application with appropriate compressor.
» Use a suitable file format for storing the output of mapreduce jobs.
» You must use the large output block size for individual input/output documents.
4 Customizine Maprepuct EXECUTION
inputFonmat
Inputspiit
RecordReader
InputKey InputValue
RecordReador
Partioner Parttioner Parttioner
Fig: Functionality of input format
{a1} : Rahul Publications"Yeo ee MN Sem,
s - “ery,
» We have 2 methods to get the data to. mapper in MapReduce: getsplits() and createRecordReny
ac shown below a)
vicabstraciclassinputFormat
abstractList getSplits(JobContext context)
\s IOCNception, InterruptedException;
‘iRecordReader
TecgrdReader(InputSplit split,
“ontext context) throws IOException,
‘Exception;
“nputSplit in Hadoop MapReduce is the logical representation of data. It describes a Unt g
‘hat contains a single map task in a MapReduce program.
»p InputSplit represents the data which is processed by an individual Mapper. The split is
Pd into records. Hence, the mapper process each record (which is a key-value pair).
“'apPeduce InputSplitength is measured in bytes and every InputSplit has storage locations (hostname
). MapReduce system use storage locations to place map tasks as close to split’s data as Possible
asks are processed in the order of the size of the splits so that the largest one gets processed fry
eedy approximation algorithm) and this is done to minimize the job runtime (Leam MapReduce jo,
opti n techniques)The important thing to notice is that Inputsplit does not contain the input data; i
reference to the data.
y
““s @ user, we don’t need to deal with InputSplit directly, because they are created by
2» InputFormat (InputFormat creates the Inputsplit and divide into records). FileInputFormat, by defaui,
sveaks a file into 128MB chunks (same as blocks in HDFS) and by setting mapred.min. split size paramete:
pred-site.xml_we can control this value or by overriding the parameter in the Job object used ty
mint particular MapReduce job, We can also contol how the fil is broken up into splits, by urtg
@ custom InputFormat.
InputSpit in Hadoop is user defined. User can control split size according to the size of data is
Reduce program. Thus the number of map tasks is equal to the number of InputSplits,
The client (running the job) can calculate the splits fora job by calling ‘getSplit(), and then sent io
‘2 application master, which uses their storage locations to schedule map tasks that will process them on
the cluster. Then, map task passes the split to the creatéRecordReader() method on InputFormat to ga
RecordReader for the split and RecordReader generate record (key-value pait), which it passes tothe
map function. ;
RecordReader
MapReduce has a simple model of data processing. Inputs and Outputs for the map and reduce
‘unctions are key-value pairs. The map and reduce functions in Hadoop MapReduce have the following
gen2ra. form ;
» map: (K1, V1) — list(K2, V2)
+ reduce: (K2, list(V2)) > list(K3, V3)
* Now before processing, it needs to know on which data to Process, this is achieved with
the InputFormat class. InputFormat is the class which selects the file fromHDFS that should
be input to the map function. An InputFormat is also responsible for creating
(alval Publications|
BIG DATA ANALYTICS
Del 3a ving them into records, The data is divded into the numberof
eae ne in . This i i it whi i i
ty a single map. sis called as inputsplit which is the input that is processed
InputFormat class calls the getSplits() function and computes splits for each file and then
sends them to the JobTracker, which uses their storage Tocations to schedule map tasks to
process them on the TaskTrackers. Map task then passes the split to the create
RecordReader() method on InputFormat in task tracker to obtain a RecordReader for that
split, The RecordReader load’s data from its source and converts into key-value pairs suitable for
reading by the mapper,
Hadoop RecordReader uses the data within the boundaries that are being created by the inpuspt
and creates Key-value pairs for the mapper. The “start” is the byte position in the file where the
RecordReader should start generating key/value pairs and the “end” i
2.4.2 Reading Data with Custom Recordreader
Q18, Explain how to read data with MapReduce record reader.
A RecordReader is more than iterator over records, and map task uses one record to generate key-
value pair which is passed to the map function. We can see this by using mapper's run function :
Publicvoid run(Context context) throws IOException, InterruptedException{
‘Spre>setup(context); .
while(context.nextKeyValue())
(#) Rahul PublicationsMSc Il Year Sem,
eee — eth,
i
map(context. setCurrentKey(),context.getCurrentValue(),context)
}
cleanup(context);
}
After running setup(), the nextKeyValue() will repeat on the context, to populate the key
value objects for the mapper. The key and value is retrieved from the record reader by way of ¢
and passed to the map() method to do its work. An input to the map function, which is a keyyy
pair(K, V), gets processed as per the logic mentioned in the map code. When the record gets to the,
of the record, the nextKeyValue() method returns false.
A RecordReader usually stays in between the boundaries created by the inputsplit to generate j,
value pairs but this is not mandatory. A custom implementation can even read more data outside of 4
inputsplit, but it is not encouraged a lot.
The RecordReader instance is defined by the InpitFormat. By default, it uses TextInputFormat ,
converting data into a key-value pair. TextInputFormat provides 2 types of RecordReaders:
Data set example
TTake the below example of a (dummy) data set where all your records are separated by a sp
String.
pleff lorem monaq morel plaff lerom baple merol plff ipsum ponaq mipsu ploff pimsu
caple supim pluff sumip qonaq issum daple ussum ronaq ossom fap25 abcde tonag fghij
merol pliff ipsum ponaq mipsu ploff pimsu caple supim pluff sumip qonaq issum daple
ussum ronaq ossom faple abc75 tonaq fghij gaple kimno vonag parst haple uvwxy nonaq
qonag issum daple ussum ronaq ossom fal25 abcde
lerom baple merol pif ipsum ponaq mipsu ploff pimsu caple supim pluff sumih Qonaq
Implementation
In that case (records are always separated by a same “10-dash” String), the implementation:
somehow out of the box. Indeed, default LineReader can take as an argument a record Delini#
Bytes byte array that can be retrieved / set directly from the Hadoop configuration. This parameter wil
used as a String delimiter to separate distinct records.
Just make sure to set it up in your MapReduce driver code
1. Configuration conf = newConfiguration(true);
2. conf.set( “textinputformat.record.delimiter”,”_—”),
..and to specify the default TextInputFormat for your MapReduce job’s InputFormat.unit - tt BIG DATA ANALYTICS
1, Job job = newJob(conf);
2, jobsetInputFormat(TextInputFormat.class);
Instead of processing 1 given line at a time, you should be able to process a full NLines record. Will be
supplied to your mappers instances the following keys / values :
> Key is the offset (location of your record’s first line)
» Value is the record itself
Note that the default delimiter is CRLF (additionally CR) character. Using the Hadoop default
configuration, LineReader can be seen as a Record-by-Record reader that uses a CRLF delimiter, thus
>» MapReduce job checks that the output directory does not already exist.”
> OutputFormat provides the RecordWriter implementation to be used to write the output files of the
job. Output files are stored in a FileSystem.
FileOutputFormat.setOutputPath() method is used to set the output directory. Every Reducer writes
separate file in a common output directory.
Types of OutputFormat in MapReduce
There are various types of Hadoop OutputFormat. Let us see some of them below :
(=) Rahul Publications
SeBS
M.Sc II Year
5.
M Sem,
— —_——*2Mey,
~y
Typesof
OutputFoat in
Mapheduce
. TextOutputFormat
MapReduce default OutputFormat is TextOutputFormat, which writes (key, value) pairs on individyy
lines of text files and its keys and values can be of any type since TextOutputFormat tums them
string by calling toString() on them. Each key-value pair is separated by a tab character, which can
changed using MapReduce.output.textoutputformat.separator property. Key Value Text Outpy
Format is used for reading these output text files since it breaks lines into key-value pairs based on
configurable separator.
SequenceFileOutputFormat
Itis an OutputFormat which writes sequences files for its output and it is intermediate format us
between MapReduce jobs, which rapidly serialize arbitrary data types to the file; and the correspon
SequenceFileInputFormat will deserialize the fle into the same types and presents the data tothe ney
mapper in the same manner as it was emitted by the previous reducer, since these are compact anj
readily compressible. Compression is controlled by the static methods on SequenceFileOutputForma,
SequenceFileAsBinaryOutputFormat
Itis another form of SequenceFilelnputFormat which writes keys and values to sequence file in binary
format.
MapFileOutputFormat
Itis another form of FileOutputFormat in Hadoop, which is used to write output as map files. The ky
in a MapFile must be added in order, so we need to ensure that reducer emits keys in sorted orde:
MultipleOutputs
It allows writing data to files whose names are derived from the output keys and values, or in fa
from an arbitrary string.
LazyOutputFormat
‘Sometimes FileOutputFormat will create output files, even if they are empty. LazyOutputFormat is ¢
wrapper OutputFormat which ensures that the output file will be created only when the record s
emitted for a given partition.
DBOutputFormat |
DBOutputFormat in Hadoop is an OutputFormat for writing to relational databases and HBase.
sends the reduce output to a SQL table. It accepts key-value pairs, where the key has a type extendind
DBuritable. Returned RecordWriter writes only the key to the database with a batch SQL query.
66
Rahul Publications 8} |ih 7 oe
ww —— __HG DATA AHALITICS
9.44 Customizing Data with Recordwriter
20 Explain about use of Hadoop Record Writer,
hus
Hadoop RecordWriter
swe know, Reducer takes as input a set of an intermediate key-value pairproduced by
the mapper and runs a reducer function on them to generate output that is again zero or more key-
value pairs.
RecordWriter writes these output key-value pairs from the Reducer phase to output files.
cetRecordlriter) is to provide a RecordWriter, checkOutputSpecs() is to check for validity of the output-
ecification for the job, and getOutputCommitter() is to return an OutputCommitter ‘As I described in
ISTUSE
jevidusTpOsts; ee fin
ination,
4
ihe RecordWriter unites key/value pairs to an output i
bie abstract class Reooraiter {
Ser
* ee a key/value pair. es
* i
| + @param key the key to write. .
| @param value the e value to
Ly
public abstract void tet ken V value
2 throws Reet, Int uledEscepen z
|
i :
5 * Close this Recorder to filuge 0 operations. | |
‘ i
* @param context the context
i
I
Agood example of the RecordWriter and the: OutputFormat is ilustated in the class “TextOutputForma,
where the LineRecordWriter is defined as follows.
public class TextOutputFormat extends FileOutputFormat {
public static String SEPERATOR = “mapreduce.output.textoutputformat.separator”;
Protected static class LineRecordWriter
extends RecordWriter {
Private static final String utf8 = “UTF-8”;
Private static final byte[] newline;
static {
ty {
-
/luse “\n” as the newline delimiter
Rahul PublicationsMSc Il Year M Se,
. ey
newline = “\n" getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IilegalArgumentException(‘‘can't find “ + utf8 + “ encoding”);
}
}
protected DataOutputStream out;
private final byte{] keyValueSeparator,
Public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
this.out = out;
try {
// set key value separator
this keyValueSeparator = keyValueSeparator getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException(“‘can't find “ + utf8 + “ encoding’);
y
}
// if not key value separator is defined, use tab as the separator
public LineRecordWriter(DataOutputStream out) {
this(out, “\t”);
y .
ne :
* White the object to the byte stream, handling Text as a special
* case.
* @param o the object to print
* @throws IOException if the write throws, we pass it on
¥ :
private void writeObject(Object 0) throws IOException {
if (0 instanceof Text) {
Text to = (Text) 0;
out.write(to.getBytes(), 0, to.getLength());
} else {
out. write(.toString(). getBytes(utf8));
}
}
public synchronized void write(K key, V value)
throws IOException {
boolean nullKey = key == null || key instanceof NullWritable;
boolean nullValue = value == null || value instanceof NullWritable;
// return if null key or null value
if (nullKey && nullValue) {
return; 7
i
Rahul Publications (3) a
—ree
NIT I BIG DATA ANALYTICS
_
sprite key first
if inullKev) {
uriteOject(key);
}
|) write separator
if ((nullKey | nullValue)) {
pal write(keyValueSeparator);
}
- Q21. Explain about MapReduce Combiner.
Aas :
Ona large dataset when we run MapReduce job, so large chunls of intermediate data is generated
by the Mapper and this intermediate data is passed on the Reducer for further processing, which leads to
enormous network congestion. MapReduce framework provides a function known as Combiner that
plays a key role in reducing network congestion.
The combiner in MapReduce is also known as ‘Mini-teducer’. The primary job of Combiner is to
Process the output data from the Mapper, before passing it to Reducer. It runs after the mapper and
before the Reducer and its use is optional.
(#2) Rahul PublicationsMSc Il Yeor
nh
ly
Mapper Output
Total 9 Keys Niece
CD)
Ch)
ca
Scie!
uae
Dal
CA)
ea
Co)
05)
Intermediate data
Total 3 keys
one
Fig.: MapReduce program without Combiner
In the above diagram, no combiner is used. Input is split into two mappers and 9 keys are generates
from the mappers. Now we have (9 key/value) intermediate data, the further mapper will send di
this data to reducer and while sending data to the reducer, it consumes some network bandwidth (bandwidth
Means time taken to transfer data between 2 machines). It will take more time to transfer data to Teducer
if the size of data is big.
Now in between mapper and reducer if we use a combiner, then combiner shuffles intermediate day
(9 key/value) before sending it to the reducer and generates 4 key/value pair as an output.
MapReduce program with Combiner
Tota 9 Keys,
Intermediate data
y
Sean eateed dic.
up rr Total dtheys
" oe
Ep Suman esorny
Reducer Output
Fig. : MapReduce program with Combiner In between Mapper and Reducer
Rahul Publications