UNIT – III
Map Reduce Technique
What is MapReduce?
A MapReduce is a data processing tool which is used to
process the data parallelly in a distributed form. It was
developed in 2004, on the basis of paper titled as
"MapReduce: Simplified Data Processing on Large Clusters,"
published by Google.
The MapReduce is a paradigm which has two phases, the
mapper phase, and the reducer phase. In the Mapper, the
input is given in the form of a key-value pair. The output of the
Mapper is fed to the reducer as input. The reducer runs only
after the Mapper is over. The reducer too takes input in key-
value format, and the output of reducer is the final output.
MapReduce
MapReduce facilitates concurrent processing by
splitting petabytes of data into smaller chunks, and
processing them in parallel on Hadoop commodity
servers. In the end, it aggregates all the data from
multiple servers to return a consolidated output back to
the application.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map
and Reduce.
The Map task takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key-value
pairs).
The Reduce task takes the output from the Map as an input and
combines those data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Input Phase − Here we have a Record Reader that translates each record in an
input file and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs
and processes each one of them to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known
as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from
the map phase into identifiable sets. It takes the intermediate keys from the
mapper as input and applies a user-defined code to aggregate the values in a
small scope of one mapper. It is not a part of the main MapReduce algorithm; it
is optional
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the
Reducer is running. The individual key-value pairs are sorted by key into a larger
data list. The data list groups the equivalent keys together so that their values can
be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and
runs a Reducer function on each one of them. Here, the data can be aggregated,
filtered, and combined in a number of ways, and it requires a wide range of
processing. Once the execution is over, it gives zero or more key-value pairs to the
final step.
Output Phase − In the output phase, we have an output formatter that
translates the final key-value pairs from the Reducer function and writes them
onto a file using a record writer.
Anatomy of a Map Reduce Job Run
Map reduce job run process is mainly depends on
JOB SUBMISSION
JOB INITIALIZAATION
TASK ASSIGNMENT
TASK EXECUTION
PROGRESS AND STATUS UPDATES
JOB COMPLETION
FAILURES
Anatomy of a Map Reduce Job Run
MapReduce can be used to work with a solitary method call: submit() on
a Job object (you can likewise call waitForCompletion(), which presents
the activity on the off chance that it hasn’t been submitted effectively, at
that point sits tight for it to finish).
Let’s understand the components –
Client: Submitting the MapReduce job.
Yarn node manager: In a cluster, it monitors and launches the
compute containers on machines.
Yarn resource manager: Handles the allocation of computing
resources coordination on the cluster.
MapReduce application master Facilitates the tasks running
the MapReduce work.
Distributed Filesystem: Shares job files with other entities.
How to submit Job?
To create an internal JobSubmitter instance, use the submit() which
further calls submitJobInternal() on it. Having submitted the job,
waitForCompletion() polls the job’s progress after submitting the job
once per second.
The resource manager asks for a new application ID that is used for
MapReduce Job ID.
Output specification of the job is checked. For e.g. an error is thrown
to the MapReduce program or the job is not submitted or the output
directory already exists or it has not been specified.
If the splits cannot be computed, it computes the input splits for the
job. This can be due to the job is not submitted and an error is thrown
to the MapReduce program.
Resources needed to run the job are copied – it includes the job JAR
file, and the computed input splits, to the shared filesystem in a
directory named after the job ID and the configuration file.
It copies job JAR with a high replication factor, which is controlled
by mapreduce.client.submit.file.replication property. AS there are a
number of copies across the cluster for the node managers to access.
By calling submitApplication(), submits the job to the resource
manager
Failures
Real user code can process crash, can be full of bugs or even the
machine can fail. The capability of Hadoop to handle such failures is
the biggest benefit of using it which allows the job to be completed
successfully. Any of the following components
Application master
Node manager
Resource manager
Task
Shuffle and Sorting
In this lesson, we will learn completely about MapReduce
Shuffling and Sorting. Here we will offer you a detailed description
of the Hadoop Shuffling and Sorting phase. Initially, we will discuss
what is MapReduce Shuffling, next with MapReduce Sorting, then
we will discuss MapReduce the secondary sorting phase in detail.
Shuffling is the process by which it transfers the mapper’s
intermediate output to the reducer. Reducer gets one or more keys
and associated values based on reducers. The intermediated key –
value generated by the mapper is sorted automatically by key. In
Sort phase merging and sorting of the map, the output takes place.
Shuffling in MapReduce
The process of moving data from the mappers to reducers is
shuffling. Shuffling is also the process by which the system
performs the sort. Then it moves the map output to the
reducer as input. This is the reason the shuffle phase is
required for the reducers. Else, they would not have any input
(or input from every mapper). Meanwhile, shuffling can begin
even before the map phase has finished. Therefore this saves
some time and completes the tasks in lesser time.
Sorting in MapReduce
MapReduce Framework automatically sorts the keys
generated by the mapper. Therefore, before starting of
reducer, all intermediate key-value pairs get sorted by key
and not by value. It does not sort values transferred to each
reducer. They can be in any order.
Map Reduce Types and Formats
MapReduce is the processing unit of Hadoop, using which the
data in Hadoop can be processed.
The MapReduce task works on <Key, Value> pair.
Two main features of MapReduce are parallel programming
model and large-scale distributed model.
MapReduce allows for the distributed processing of the map
and reduction operations.
○ Map procedure(Transform): Performs a filtering and sorting
operation.
○ Reduce procedure(Aggregates): Performs a summary
operation
MapReduce Workflow:
Mapper class’s KEYIN must be consistent with inputformat.class
Mapper class’s KEYOUT must be consistent with map.out.key.class…
Formats
Map Reduce formats is basically clasified in two types..these are:
Input formats
Text input format
Binary Input format
Multiple input formats
DB Input formats
Output formats
Text output formats
Binary output formats
Multiple output formats
Lazy outputs formats
DB output formats