0% found this document useful (0 votes)

178 views9 pages

MapReduce for Big Data Developers

Uploaded by

shivaraj BG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

178 views9 pages

MapReduce for Big Data Developers

Uploaded by

shivaraj BG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Introduction: MapReduce

• The concept of MapReduce was pioneered by Google.

• The original paper titled "MapReduce: Simplified Data Processing on Large
Clusters" was written by Jeffrey Dean and Sanjay Ghemawat, and it was
published in 2004.
• In the paper, they introduced the MapReduce programming model and
described its implementation at Google for processing large-scale data across
distributed clusters.
• MapReduce became a fundamental framework for distributed computing and
played a significant role in the development of big data technologies.
• While Google introduced the concept, the open-source Apache Hadoop project
later implemented its own version of MapReduce, making it accessible to a
broader community of developers and organizations.

Prerequisites that can help you grasp MapReduce more effectively

1. Programming Languages:

• Proficiency in a programming language is crucial.

• Java is commonly used in the Hadoop ecosystem, and many MapReduce
examples are written in Java.
• Knowledge of Python can also be useful.

2. Distributed Systems:

• Understanding the basics of distributed computing is essential.

• Familiarize yourself with concepts like nodes, clusters, parallel processing, and
the challenges associated with distributed systems.
3. Hadoop Ecosystem:

• MapReduce is often associated with the Hadoop framework.

• Therefore, it's helpful to have a basic understanding of Hadoop and its
ecosystem components, such as HDFS (Hadoop Distributed File System) and
YARN (Yet Another Resource Negotiator).

4. Basic Understanding of Big Data:

• MapReduce is commonly used in the context of big data processing.

• It's beneficial to have a foundational understanding of what constitutes "big
data," the challenges associated with large datasets, and the motivation behind
distributed computing for big data.

5. Linux/Unix Commands:

• Many big data platforms, including Hadoop, are typically deployed on Unix-
like systems.
• Familiarity with basic command-line operations in a Unix environment can be
helpful for interacting with Hadoop clusters.

6. SQL (Structured Query Language):

• If you are planning to use tools like Apache Hive, which provides a SQL-like
interface for querying data in Hadoop, a basic understanding of SQL can be
beneficial.

7. Concepts of Data Storage and Retrieval:

• Understanding how data is stored and retrieved in a distributed environment

is crucial.
• Concepts like Sharding, replication, and indexing are relevant.

8. Algorithmic and Problem-Solving Skills:

• MapReduce involves breaking down problems into smaller tasks that can be
executed in parallel.
• Strong algorithmic and problem-solving skills are valuable for designing
efficient MapReduce jobs.
Explanation
Q: Describe MapReduce Execution steps with a neat diagram 12 M

• MapReduce is a programming model and processing technique designed for

processing and generating large datasets that can be parallelized across a
distributed cluster of computers
• A job means a MapReduce Program.
• Each job consists of several smaller unit, called MapReduce Tasks.
• The basic idea behind MapReduce is to divide a large computation into smaller
tasks that can be performed in parallel across multiple nodes in a cluster.

In a MapReduce job

1. The data is split into smaller chunks, and a "map" function is applied to each
chunk independently.
2. The results are then shuffled and sorted, and a "reduce" function is applied to
combine the intermediate results into the final output.

MapReduce Programing approach allows for efficient processing of large datasets in

a distributed computing environment.
JobTracker and Task Tracker

• MapReduce consists of a single master JobTracker and one slave TaskTracker

per cluster node.
• The master is responsible for scheduling the component tasks in a job onto the
slaves, monitoring them and re-executing the failed tasks.
• The slaves execute the tasks as directed by the master.
• The MapReduce framework operates entirely on key, value-pairs.
• * The framework views the input to the task as a set of (key, value) pairs and
produces a set of (key, value) pairs as the output of the task, with different
types.

Map-Tasks

Map task means a task that implements a map( ) function.

which runs user application codes for each key-value pair (kl, vl).

• Key kl is a set of keys.

• Key kl maps to group of data values.
• Values vl are a large string which is read from the input file(s).
• The output of map( ) would be zero (when no values are found) or intermediate
key-value pairs (k2, v2).
Reduce Task

• Refers to a task which takes the output v2 from the map as an input and
combines those data pieces into a smaller set of data using a combiner.
• The reduce task is always performed after the map task.

Key-Value Pair

Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as
input and output.
Data should be first converted into key-value pairs before it is passed to the Mapper,
as the Mapper only understands key-value pairs of data.

Key-value pairs in Hadoop MapReduce are generated as follows:

• InputSplit - Defines a logical representation of data and presents a Split data

for processing at individual map ().
• RecordReader - Communicates with the Input Split and converts the Split into
records which are in the form of key-value pairs in a format suitable for reading
by the Mapper.
• RecordReader uses TextlnputFormat by default for converting data into key-
value pairs.
• RecordReader communicates with the InputSplit until the file is read.

Grouping by Key

• When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the
value v2 append in a list of values.
• A "Group By" operation on intermediate keys creates v2.

Shuffle and Sorting Phase

• All pairs with the same group key (k2) collect and group together, creating one
group for each key.
• Shuffle output format will be a List of. Thus, a different subset of the
intermediate key space assigns to each reduce node.

Reduced Tasks

• Implements reduce () that takes the Mapper output (which shuffles and sorts),
which is grouped by key-values (k2, v2) and applies it in parallel to each
group.
• Reduce function iterates over the list of values associated with a key and
produces outputs such as aggregations and statistics.
• The reduce function sends output zero or another set of key-value pairs (k3,
v3) to the final the output file. Reduce: {(k2, list (v2) -> list (k3, v3)}
MapReduce Implementation

• MapReduce is a programming model and processing technique for handling

large datasets in a parallel and distributed fashion.
• The word count problem is a classic example of a task that can be solved using
MapReduce.
• Mathematical representation of the MapReduce algorithm for the word count
problem.
Example:

Step 1: Input Document:

D="hello Hadoop, hi Hadoop, Hello MongoDB, hi Cassandra Hadoop"

Step 2: Map Function:

The Map function processes each word in the document and emits key-value pairs
where the key is the word, and the value is 1 (indicating the count).

Map("hello”) →{("hello",1)},

Map("Hadoop”) →{("Hadoop",1), ("Hadoop",1), ("Hadoop",1)},

Map("hi”) →{("hi",1), ("hi",1)}, …

Step 3: Shuffle and Sort (Grouping by Key):

Group and sort the intermediate key-value pairs by key.

("hello", [1]), ("Hadoop", [1,1,1]), ("hi", [1,1]), …

Step 4: Reduce Function:

The Reduce function takes each unique key and the list of values and calculates the
sum.

Reduce ("hello", [1]) →{("hello",1)},

Reduce ("Hadoop", [1,1,1]) →{("Hadoop",3)},

Reduce ("hi", [1,1]) →{("hi",2)}, …

Step 5: Final Output:

{("hello",1), ("Hadoop",3), ("hi",2), ("Hello",1), ("MongoDB",1), ("Cassandra",1)}

Algorithm Design in MapReduce
No ratings yet
Algorithm Design in MapReduce
62 pages
B+ Tree in DBMS
No ratings yet
B+ Tree in DBMS
21 pages
MapReduce Algorithms For Big Data Analysis
No ratings yet
MapReduce Algorithms For Big Data Analysis
2 pages
Avro Data Serialization Guide
No ratings yet
Avro Data Serialization Guide
30 pages
Project Report
No ratings yet
Project Report
27 pages
Spark Theory
No ratings yet
Spark Theory
26 pages
Migration Strategy
No ratings yet
Migration Strategy
3 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Hadoop HDFS Commands With Examples
No ratings yet
Hadoop HDFS Commands With Examples
3 pages
PostgreSQL - Replication Progress Tracking PDF
No ratings yet
PostgreSQL - Replication Progress Tracking PDF
1 page
Database Health Monitoring
No ratings yet
Database Health Monitoring
3 pages
Oracle 12c - Administering A CDB
No ratings yet
Oracle 12c - Administering A CDB
5 pages
Database Materialized Views Guide
No ratings yet
Database Materialized Views Guide
31 pages
Data Integration Using GoldenGate
No ratings yet
Data Integration Using GoldenGate
18 pages
Module 1 - Oracle Architecture: Objectives
No ratings yet
Module 1 - Oracle Architecture: Objectives
41 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Bitmap Index
No ratings yet
Bitmap Index
20 pages
Avro
No ratings yet
Avro
5 pages
Bitmap Join Indexes
No ratings yet
Bitmap Join Indexes
15 pages
Spark Runtime Architecture Overview
No ratings yet
Spark Runtime Architecture Overview
5 pages
Liquibase Tutorial - Automate Your Database Scripts Deployment - Pretius
No ratings yet
Liquibase Tutorial - Automate Your Database Scripts Deployment - Pretius
55 pages
Data Mining and Data Warehouse
No ratings yet
Data Mining and Data Warehouse
11 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Bitmap Indexing Overview & Applications
No ratings yet
Bitmap Indexing Overview & Applications
11 pages
B Tree
No ratings yet
B Tree
63 pages
SQL Server to Oracle Migration Guide
No ratings yet
SQL Server to Oracle Migration Guide
20 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
FLUME
No ratings yet
FLUME
31 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
4 pages
Data Warehouse-Basic Concepts
No ratings yet
Data Warehouse-Basic Concepts
21 pages
HBase: Data Management & Architecture
No ratings yet
HBase: Data Management & Architecture
36 pages
Oracle Database Basics Guide
No ratings yet
Oracle Database Basics Guide
5 pages
The Secrets of Materialized Views
100% (2)
The Secrets of Materialized Views
8 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Advanced Distributed Databases
100% (1)
Advanced Distributed Databases
20 pages
Oracle 8i vs SQL Server 2000
No ratings yet
Oracle 8i vs SQL Server 2000
26 pages
Azure Project Execution Plan ADF+DBX+CICD
No ratings yet
Azure Project Execution Plan ADF+DBX+CICD
5 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Oracle Indexes
No ratings yet
Oracle Indexes
3 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
No ratings yet
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
40 pages
SQL Replication Setup Guide
No ratings yet
SQL Replication Setup Guide
22 pages
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
No ratings yet
Nosql - Journey Ahead!: Origin: Punch Cards To Dbms
54 pages
Bitmap Index Internals
No ratings yet
Bitmap Index Internals
54 pages
Business Intelligence: Data Warehouse
No ratings yet
Business Intelligence: Data Warehouse
60 pages
Chapter 12 - Data Warehousing and Online Analytical Processing
No ratings yet
Chapter 12 - Data Warehousing and Online Analytical Processing
20 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Distributed Database: GDC Thana Semester 6
No ratings yet
Distributed Database: GDC Thana Semester 6
10 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Module 1 - Oracle Architecture
No ratings yet
Module 1 - Oracle Architecture
34 pages
Database Architecture Interview Questions and Answers Guide
No ratings yet
Database Architecture Interview Questions and Answers Guide
6 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
Instance: What Is An Oracle Database?
No ratings yet
Instance: What Is An Oracle Database?
7 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Data Science
No ratings yet
Data Science
7 pages
MapReduce for Big Data Enthusiasts
No ratings yet
MapReduce for Big Data Enthusiasts
18 pages
Des - 2
No ratings yet
Des - 2
14 pages
Man in The Middle Attack
No ratings yet
Man in The Middle Attack
3 pages
1.1.1 The Age of Internet Computing
No ratings yet
1.1.1 The Age of Internet Computing
4 pages
Hadoop Installation
No ratings yet
Hadoop Installation
7 pages
Design of Motor and Power Control
No ratings yet
Design of Motor and Power Control
2 pages
AI Emergy AnalysisCBRN Final
No ratings yet
AI Emergy AnalysisCBRN Final
3 pages
Network Performance
No ratings yet
Network Performance
12 pages
English 210 - Team Charter Team 4
No ratings yet
English 210 - Team Charter Team 4
3 pages
Writing Task Sample 1: Model Answer
No ratings yet
Writing Task Sample 1: Model Answer
3 pages
CCGX Modbus TCP Register List 2.60
No ratings yet
CCGX Modbus TCP Register List 2.60
87 pages
Valvula de Expansión Electronica
No ratings yet
Valvula de Expansión Electronica
4 pages
Instrumentation List
No ratings yet
Instrumentation List
2 pages
Research Paper On Programmable Logic Controller
No ratings yet
Research Paper On Programmable Logic Controller
5 pages
Air Flow Transducer
No ratings yet
Air Flow Transducer
2 pages
2014 Vancouver Procurement Overview
No ratings yet
2014 Vancouver Procurement Overview
21 pages
Unit 1-1
No ratings yet
Unit 1-1
2 pages
Applied Data Science With Python-Specialization-1
No ratings yet
Applied Data Science With Python-Specialization-1
1 page
Vector Functions
No ratings yet
Vector Functions
93 pages
Log File
No ratings yet
Log File
26 pages
Instruction Manual - Unique To Mixproof Tank Outlet Valve - Ese00156en
No ratings yet
Instruction Manual - Unique To Mixproof Tank Outlet Valve - Ese00156en
62 pages
The Art of Spin Electronics
No ratings yet
The Art of Spin Electronics
2 pages
Blog - Welding in Bridge Building - How Electric Arc Connects The Road
No ratings yet
Blog - Welding in Bridge Building - How Electric Arc Connects The Road
8 pages
Battle of Bots 2024 Line Tracing
No ratings yet
Battle of Bots 2024 Line Tracing
2 pages
Form1583 HUAQIANGYIN
No ratings yet
Form1583 HUAQIANGYIN
4 pages
Q1M1 - Work Order and Standard Operating Procedure
100% (1)
Q1M1 - Work Order and Standard Operating Procedure
20 pages
RKNL-G (15-25 Ton)
No ratings yet
RKNL-G (15-25 Ton)
64 pages
Re Hacking Central Banks
No ratings yet
Re Hacking Central Banks
3 pages
cn250 PDF
No ratings yet
cn250 PDF
384 pages
Catalog - 副本
No ratings yet
Catalog - 副本
94 pages
Project Report On Employee Management System
87% (47)
Project Report On Employee Management System
30 pages
1 Public Transportation System in Pokhara1
100% (1)
1 Public Transportation System in Pokhara1
9 pages
CS610P Lab 1-16
100% (1)
CS610P Lab 1-16
81 pages
Active VCs
No ratings yet
Active VCs
11 pages
Employment Verification Form
No ratings yet
Employment Verification Form
1 page

MapReduce for Big Data Developers

Uploaded by

MapReduce for Big Data Developers

Uploaded by

Introduction: MapReduce

• The concept of MapReduce was pioneered by Google.

Prerequisites that can help you grasp MapReduce more effectively

• Proficiency in a programming language is crucial.

• Understanding the basics of distributed computing is essential.

• MapReduce is often associated with the Hadoop framework.

4. Basic Understanding of Big Data:

• MapReduce is commonly used in the context of big data processing.

6. SQL (Structured Query Language):

7. Concepts of Data Storage and Retrieval:

• Understanding how data is stored and retrieved in a distributed environment

8. Algorithmic and Problem-Solving Skills:

• MapReduce is a programming model and processing technique designed for

MapReduce Programing approach allows for efficient processing of large datasets in

• MapReduce consists of a single master JobTracker and one slave TaskTracker

Map task means a task that implements a map( ) function.

• Key kl is a set of keys.

Key-value pairs in Hadoop MapReduce are generated as follows:

• InputSplit - Defines a logical representation of data and presents a Split data

Shuffle and Sorting Phase

• MapReduce is a programming model and processing technique for handling

Step 1: Input Document:

D="hello Hadoop, hi Hadoop, Hello MongoDB, hi Cassandra Hadoop"

Step 2: Map Function:

Map("Hadoop”) →{("Hadoop",1), ("Hadoop",1), ("Hadoop",1)},

Map("hi”) →{("hi",1), ("hi",1)}, …

Step 3: Shuffle and Sort (Grouping by Key):

Group and sort the intermediate key-value pairs by key.

("hello", [1]), ("Hadoop", [1,1,1]), ("hi", [1,1]), …

Step 4: Reduce Function:

Reduce ("hello", [1]) →{("hello",1)},

Reduce ("Hadoop", [1,1,1]) →{("Hadoop",3)},

Reduce ("hi", [1,1]) →{("hi",2)}, …

Step 5: Final Output:

{("hello",1), ("Hadoop",3), ("hi",2), ("Hello",1), ("MongoDB",1), ("Cassandra",1)}

You might also like