0% found this document useful (0 votes)
14 views27 pages

Unit 3

MapReduce is a data processing tool developed by Google in 2004 that processes large datasets in parallel across distributed systems. It consists of two main phases: the Mapper, which converts input data into key-value pairs, and the Reducer, which aggregates those pairs into a final output. The process includes job submission, task execution, and handling failures, with shuffling and sorting phases ensuring efficient data transfer between mappers and reducers.

Uploaded by

anildudla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views27 pages

Unit 3

MapReduce is a data processing tool developed by Google in 2004 that processes large datasets in parallel across distributed systems. It consists of two main phases: the Mapper, which converts input data into key-value pairs, and the Reducer, which aggregates those pairs into a final output. The process includes job submission, task execution, and handling failures, with shuffling and sorting phases ensuring efficient data transfer between mappers and reducers.

Uploaded by

anildudla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT – III

Map Reduce Technique


What is MapReduce?
A MapReduce is a data processing tool which is used to
process the data parallelly in a distributed form. It was
developed in 2004, on the basis of paper titled as
"MapReduce: Simplified Data Processing on Large Clusters,"
published by Google.
The MapReduce is a paradigm which has two phases, the
mapper phase, and the reducer phase. In the Mapper, the
input is given in the form of a key-value pair. The output of the
Mapper is fed to the reducer as input. The reducer runs only
after the Mapper is over. The reducer too takes input in key-
value format, and the output of reducer is the final output.
MapReduce
MapReduce facilitates concurrent processing by
splitting petabytes of data into smaller chunks, and
processing them in parallel on Hadoop commodity
servers. In the end, it aggregates all the data from
multiple servers to return a consolidated output back to
the application.
How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map


and Reduce.
The Map task takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key-value
pairs).
The Reduce task takes the output from the Map as an input and
combines those data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Input Phase − Here we have a Record Reader that translates each record in an
input file and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs
and processes each one of them to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known
as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from
the map phase into identifiable sets. It takes the intermediate keys from the
mapper as input and applies a user-defined code to aggregate the values in a
small scope of one mapper. It is not a part of the main MapReduce algorithm; it
is optional
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the
Reducer is running. The individual key-value pairs are sorted by key into a larger
data list. The data list groups the equivalent keys together so that their values can
be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and
runs a Reducer function on each one of them. Here, the data can be aggregated,
filtered, and combined in a number of ways, and it requires a wide range of
processing. Once the execution is over, it gives zero or more key-value pairs to the
final step.
Output Phase − In the output phase, we have an output formatter that
translates the final key-value pairs from the Reducer function and writes them
onto a file using a record writer.
Anatomy of a Map Reduce Job Run

Map reduce job run process is mainly depends on


JOB SUBMISSION
JOB INITIALIZAATION
TASK ASSIGNMENT
TASK EXECUTION
PROGRESS AND STATUS UPDATES
JOB COMPLETION
FAILURES
Anatomy of a Map Reduce Job Run
MapReduce can be used to work with a solitary method call: submit() on
a Job object (you can likewise call waitForCompletion(), which presents
the activity on the off chance that it hasn’t been submitted effectively, at
that point sits tight for it to finish).
Let’s understand the components –
Client: Submitting the MapReduce job.
Yarn node manager: In a cluster, it monitors and launches the
compute containers on machines.
Yarn resource manager: Handles the allocation of computing
resources coordination on the cluster.
MapReduce application master Facilitates the tasks running
the MapReduce work.
Distributed Filesystem: Shares job files with other entities.
How to submit Job?
To create an internal JobSubmitter instance, use the submit() which
further calls submitJobInternal() on it. Having submitted the job,
waitForCompletion() polls the job’s progress after submitting the job
once per second.
The resource manager asks for a new application ID that is used for
MapReduce Job ID.
Output specification of the job is checked. For e.g. an error is thrown
to the MapReduce program or the job is not submitted or the output
directory already exists or it has not been specified.
If the splits cannot be computed, it computes the input splits for the
job. This can be due to the job is not submitted and an error is thrown
to the MapReduce program.
Resources needed to run the job are copied – it includes the job JAR
file, and the computed input splits, to the shared filesystem in a
directory named after the job ID and the configuration file.
It copies job JAR with a high replication factor, which is controlled
by mapreduce.client.submit.file.replication property. AS there are a
number of copies across the cluster for the node managers to access.
By calling submitApplication(), submits the job to the resource
manager
Failures
Real user code can process crash, can be full of bugs or even the
machine can fail. The capability of Hadoop to handle such failures is
the biggest benefit of using it which allows the job to be completed
successfully. Any of the following components

Application master
Node manager
Resource manager
Task
Shuffle and Sorting
In this lesson, we will learn completely about MapReduce
Shuffling and Sorting. Here we will offer you a detailed description
of the Hadoop Shuffling and Sorting phase. Initially, we will discuss
what is MapReduce Shuffling, next with MapReduce Sorting, then
we will discuss MapReduce the secondary sorting phase in detail.
Shuffling is the process by which it transfers the mapper’s
intermediate output to the reducer. Reducer gets one or more keys
and associated values based on reducers. The intermediated key –
value generated by the mapper is sorted automatically by key. In
Sort phase merging and sorting of the map, the output takes place.
Shuffling in MapReduce
The process of moving data from the mappers to reducers is
shuffling. Shuffling is also the process by which the system
performs the sort. Then it moves the map output to the
reducer as input. This is the reason the shuffle phase is
required for the reducers. Else, they would not have any input
(or input from every mapper). Meanwhile, shuffling can begin
even before the map phase has finished. Therefore this saves
some time and completes the tasks in lesser time.
Sorting in MapReduce
MapReduce Framework automatically sorts the keys
generated by the mapper. Therefore, before starting of
reducer, all intermediate key-value pairs get sorted by key
and not by value. It does not sort values transferred to each
reducer. They can be in any order.
Map Reduce Types and Formats

 MapReduce is the processing unit of Hadoop, using which the


data in Hadoop can be processed.
 The MapReduce task works on <Key, Value> pair.
 Two main features of MapReduce are parallel programming
model and large-scale distributed model.
 MapReduce allows for the distributed processing of the map
and reduction operations.
○ Map procedure(Transform): Performs a filtering and sorting
operation.
○ Reduce procedure(Aggregates): Performs a summary
operation
 MapReduce Workflow:
Mapper class’s KEYIN must be consistent with inputformat.class
Mapper class’s KEYOUT must be consistent with map.out.key.class…
Formats

Map Reduce formats is basically clasified in two types..these are:


Input formats
Text input format
Binary Input format
Multiple input formats
DB Input formats
Output formats
Text output formats
Binary output formats
Multiple output formats
Lazy outputs formats
DB output formats

You might also like