0% found this document useful (0 votes)

71 views28 pages

Lecture 4

The document provides an overview of MapReduce and its data processing capabilities within Hadoop, focusing on various InputFormats such as TextInputFormat, KeyValueInputFormat, NLineInputFormat, and SequenceFileInputFormat. It discusses the structure and function of mappers and reducers in a word count example, as well as the importance of serialization in Hadoop. Additionally, it highlights the strengths and limitations of Hadoop in handling large datasets and different data types.

Uploaded by

khaledabdelazim143

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views28 pages

Lecture 4

Uploaded by

khaledabdelazim143

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Big Data Technologies

(IS 365)
Lecture 4
MapReduce
Dr. Wael Abbas
2024 - 2025

All slides in this file from the following book :”Tom white (2015) .Hadoop: The Definitive
Guide, 4th Edition . " O'Reilly Media, Inc."
Reading data in Mapreduce
 Hadoop can process many different types of data formats, from flat text
files to databases.

 There are three main Java classes provided in Hadoop to read data in
MapReduce:

1. InputSplitter

2. RecordReader

3. InputFormat
MapReduce : InputFormat
InputFormat Description Key Value Fil type
TextInputFormat Default format; The byte The line Text
reads lines of offset of the contents
text files line

KeyValueInputFormat Parses lines Everything The remainder Text

into (K, V) pairs up to the of
first tab the line
character
NLineInputFormat mappers The byte The line contents Text
receives a fixed offset of the
number of lines line
of input
SequenceFileInputFo A Hadoop- user- user-defined Binary
rmat specific high- defined
performance
binary format
MapReduce : InputFormat
Text input format
 TextInputFormat is the default InputFormat.
 Each record is a line of input.
 The key, a LongWritable, is the byte offset within the file of the beginning
of the line.
 The value is the contents of the line, excluding any line terminators (e.g.,
newline or carriage return), and is packaged as a Text object. So, a file
containing the following text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
MapReduce : InputFormat
Text input format
The text is divided into one split of four records. The records are interpreted as
the following key value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)

• byte offset is the number of character that exists count from the beginning
of the line.
MapReduce : InputFormat
Text input format
THE RELATIONSHIP BETWEEN INPUT SPLITS AND HDFS BLOCKS
A single file is broken into lines, and the line boundaries do not correspond
with the HDFS block boundaries. Splits honor logical record boundaries (in this
case, lines), so we see that the first split contains line 5, even though it spans the
first and second block. The second split starts at line 6.

Note : This image from Hadoop definition guide

MapReduce : InputFormat

Key-value input format

 TextInputFormat’s keys, being simply the offsets within the file, are not
normally very useful. It is common for each line in a file to be a key-value
pair, separated by a delimiter such as a tab character.
 For example, this is the kind of output produced by TextOutputFormat,
Hadoop’s default OutputFormat.
 To interpret such files correctly, KeyValueTextInputFormat is appropriate.
MapReduce : InputFormat

Key-value input format

 TextInputFormat’s keys, being You can specify the separator via the
mapreduce.input.keyvaluelinerecordreader.key.value.separator property.
 It is a tab character by default. Consider the following input file, where →
represents a (horizontal) tab character:
line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat..
MapReduce : InputFormat

Key-value input format

Like in the TextInputFormat case, the input is in a single split comprising four
records, although this time the keys are the Text sequences before the tab in
each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)
MapReduce : InputFormat

NLineInputFormat input format

 With TextInputFormat and KeyValueTextInputFormat, each mapper receives
a variable number of lines of input.
 The number depends on the size of the split and the length of the lines.
 If you want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the InputFormat to use.
 Like with TextInputFormat, the keys are the byte offsets within the file and
the values are the lines themselves.
MapReduce : InputFormat

NLineInputFormat input format

 Each mapper N refers to the number of lines of input that each mapper
receives. With N set to 1 (the default), each mapper receives exactly one line
of input.
 The mapreduce.input.lineinputformat.linespermap property controls the
value of N. By way of example, consider these four lines again:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
 If, for example, N is 2, then each split contains two lines. One mapper will
receive the first two key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
MapReduce : InputFormat

NLineInputFormat input format

 And another mapper will receive the second two key-value pairs:
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
 The keys and values are the same as those that TextInputFormat produces.
 The difference is in the way the splits are constructed.
MapReduce : InputFormat
Binary : input format
 Hadoop MapReduce is not restricted to processing textual data. It has
support for binary formats.
 Hadoop’s sequence file format stores sequences of binary key-value pairs.
 Sequence files are well suited as a format for MapReduce data because they
are splittable (they have sync points so that readers can synchronize with
record boundaries from an arbitrary point in the file, such as the start of a
split), they support compression as a part of the format, and they can store
arbitrary types using a variety of serialization frameworks.
MapReduce : InputFormat
Binary : input format (SequenceFileInputFormat)
 To use data from sequence files as the input to MapReduce, you can use
SequenceFileInputFormat.
 The keys and values are determined by the sequence file, and you need to
make sure that your map input types correspond.
what problems does the SequenceFile try to
solve ?
For HDFS
 SequenceFile is one of the solutions to small file problem in Hadoop.
 Small file is significantly smaller than the HDFS block size(128MB).
 Each file, directory, block in HDFS is represented as object and occupies
150 bytes.
 10 million files, would use about 3 gigabytes of memory of NameNode.
 A billion files is not feasible.
what problems does the SequenceFile try to
solve ?
For MapReduce :
 Map tasks usually process a block of input at a time (using the default
FileInputFormat).
 The more the number of files is, the more number of Map task need and the
job time can be much more slow.
Small file scenario :
 The files are pieces of a larger logical file.
how can SequenceFile help to solve the
problems?
 The concept of SequenceFile is to put each small file to a larger single file.
 For example, suppose there are 10,000 100KB files, then we can write a
program to put them into a single SequenceFile like below, where you can
use filename to be the key and content to be the value.
how can SequenceFile help to solve the
problems?
1. A smaller number of memory needed on NameNode. Continue with the
10,000 100KB files example,
o Before using SequenceFile, 10,000 objects occupy about 4.5MB of
RAM in NameNode.
o After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks,
these objects occupy about 3.6KB of RAM in NameNode.
2. SequenceFile is splittable, so is suitable for MapReduce.
3. SequenceFile is compression supported.
MapReduce : RecordReader
MapReduce word count : Mapper
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class wordcountmapper extends Mapper<Object, Text, Text, IntWritable>{
@Override
protected void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
//To change body of generated methods, choose Tools | Templates.
String mytext =value.toString();
String allwords []=mytext.split(" ");
for(String x:allwords){
context.write(new Text(x), new IntWritable(1));
} } }
MapReduce word count : Mapper

• The Mapper class is a generic type, with four formal type parameters
that specify the input key, input value, output key, and output value
types of the map function.
• In word count example , input key is object , input value is a line of
text (Text), output key is a word (Text) , and output value (Intwritable).
MapReduce word count : Reducer
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class wordcountreducer extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
//To change body of generated methods, choose Tools | Templates.
int sum = 0;
for (IntWritable iw : values) {
sum += iw.get();
}
context.write(key, new IntWritable(sum));
}

}
MapReduce word count : Reducer
• The reducer class is a generic type, with four formal type parameters that
specify the input key, input value, output key, and output value types of the
reduce function.
• The input types of the reduce function must match the output types of the
map function.
• In word count example , input key is text , input value is intwritable ,
output key is Text , and output value (Intwritable).
MapReduce word count : Driver
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat ;
public class wordcountdriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration c = new Configuration();
Job j = Job.getInstance(c, "mywordcount");
j.setMapperClass(wordcountmapper.class);
j.setReducerClass(wordcountreducer.class);
//j.setCombinerClass(wordcountreducer.class);
j.setJarByClass(wordcountdriver.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, new Path("hdfs://localhost:8020/user/cloudera/input/data.dat"));
FileOutputFormat.setOutputPath(j, new Path("hdfs://localhost:8020/user/cloudera/2019c"));
System.exit(j.waitForCompletion(true) ? 0 : 1);
}}
MapReduce word count : Driver

• The setOutputKeyClass() and setOutputValueClass() methods control

the output types for the reduce function, and must match what the Reduce
class produces .
• setMapOutputKeyClass() and setMapOutputValueClass() methods
are used when data type of map output is different from data type of
reduce output .
Serialization and deserialization in Hadoop
• Serialization is the process of turning structured objects into a byte stream
for transmission over a network or for writing to persistent storage.
• Deserialization is the reverse process of turning a byte stream back into a
series of structured objects.
• Serialization is used in two quite distinct areas of distributed data processing:
for interprocess communication and for persistent storage.
• In Hadoop, interprocess communication between nodes in the system is
implemented using remote procedure calls (RPCs). The RPC protocol uses
serialization to render the message into a binary stream to be sent to the
remote node, which then deserializes the binary stream into the original
message.
Serialization and deserialization in Hadoop
Why does Hadoop use classes such as intwritable and Text instead of
int and string ?
because java Serializable is too big or too heavy for Hadoop, Writable can
serializable the Hadoop Object in a very light way.
Why & where Hadoop is used / not used?
 What Hadoop is good for:
1. Massive amounts of data through parallelism
2. A variety of data (structured, unstructured, semi-structured)
3. Inexpensive commodity hardware
 Hadoop is not good for:
1. Not to process transactions (random access)
2. Not good when work cannot be parallelized
3. Not good for low latency data access
4. Not good for processing lots of small files
5. Not good for intensive calculations with little data

Map Reduce Programming
No ratings yet
Map Reduce Programming
64 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
81 pages
Hadoop Mapred
100% (1)
Hadoop Mapred
11 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Cloudera Academic Partnership 3 PDF
0% (1)
Cloudera Academic Partnership 3 PDF
103 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Quick HadOop Ref Card Always
No ratings yet
Quick HadOop Ref Card Always
2 pages
Hadoop MapReduce Flow Chart
No ratings yet
Hadoop MapReduce Flow Chart
28 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
S MapReduce Types Formats
100% (2)
S MapReduce Types Formats
22 pages
Palak
No ratings yet
Palak
10 pages
Hadoop Unit III DR David
No ratings yet
Hadoop Unit III DR David
12 pages
Hadoop Input/Output Formats Guide
No ratings yet
Hadoop Input/Output Formats Guide
5 pages
Job Scheduling in MR
No ratings yet
Job Scheduling in MR
6 pages
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
No ratings yet
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
83 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
MapReduce for Big Data Analysis
No ratings yet
MapReduce for Big Data Analysis
59 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
S MapReduce Types Formats Features
No ratings yet
S MapReduce Types Formats Features
15 pages
BDC Output 3
No ratings yet
BDC Output 3
4 pages
Hadoop Week 4
No ratings yet
Hadoop Week 4
13 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
100% (1)
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
15 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
No ratings yet
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
9 pages
Mapreduce Types and Formats
No ratings yet
Mapreduce Types and Formats
65 pages
MAP Reduce - 1
No ratings yet
MAP Reduce - 1
34 pages
Hadoop Developingapps PDF
No ratings yet
Hadoop Developingapps PDF
17 pages
Hadoop Framework & MapReduce Guide
No ratings yet
Hadoop Framework & MapReduce Guide
11 pages
Mapreduce, Hadoop and Amazon Aws: Yasser Ganjisaffar
No ratings yet
Mapreduce, Hadoop and Amazon Aws: Yasser Ganjisaffar
33 pages
Lecture 05
No ratings yet
Lecture 05
23 pages
Practise Quiz Ccd-333 Exam (01-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-333 Exam (01-2014) - Cloudera Quiz Learning
44 pages
Map Reduce Types and Formats
No ratings yet
Map Reduce Types and Formats
32 pages
Big Data 4 Vivek
No ratings yet
Big Data 4 Vivek
3 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Quiz - Online Test 1 (10%)
No ratings yet
Quiz - Online Test 1 (10%)
7 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
3 MapReduce Program Ex Code
No ratings yet
3 MapReduce Program Ex Code
14 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
MapReduce for Big Data Enthusiasts
No ratings yet
MapReduce for Big Data Enthusiasts
18 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Hadoop for Developers
No ratings yet
Hadoop for Developers
49 pages
MapReduce for Big Data Developers
No ratings yet
MapReduce for Big Data Developers
9 pages
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
No ratings yet
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
13 pages
Hadoop MapReduce WordCount Guide
No ratings yet
Hadoop MapReduce WordCount Guide
5 pages
Unit 3 MapReduce Part 2
No ratings yet
Unit 3 MapReduce Part 2
12 pages
Big Data Practical 2
No ratings yet
Big Data Practical 2
11 pages
Hadoop MapReduce Tutorial Guide
No ratings yet
Hadoop MapReduce Tutorial Guide
31 pages
The Writings of Henry Barrow 1587 1590 Elizabethan Non Conformist Texts 1st Edition Henry Barrow Instant Download
No ratings yet
The Writings of Henry Barrow 1587 1590 Elizabethan Non Conformist Texts 1st Edition Henry Barrow Instant Download
37 pages
Class 5 GK Syllabus (2025-26) - 1
No ratings yet
Class 5 GK Syllabus (2025-26) - 1
9 pages
7285 1 2018 AMD2 Reff2022
No ratings yet
7285 1 2018 AMD2 Reff2022
30 pages
Phil Environmental Education (1) .Case Study
No ratings yet
Phil Environmental Education (1) .Case Study
12 pages
Watlow981 982 Manual
No ratings yet
Watlow981 982 Manual
141 pages
Download and Upgrade Failures
No ratings yet
Download and Upgrade Failures
3 pages
Transformer Viva Question
No ratings yet
Transformer Viva Question
9 pages
Supplement TEAC
No ratings yet
Supplement TEAC
16 pages
Vitamin C Project Kartik Styled
No ratings yet
Vitamin C Project Kartik Styled
14 pages
The Secret History of The World 1st Edition Laura Knight-Jadczyk PDF Version
100% (3)
The Secret History of The World 1st Edition Laura Knight-Jadczyk PDF Version
121 pages
500 V1a MLFP 10004
No ratings yet
500 V1a MLFP 10004
632 pages
Job Description - Junior Mechanical Engineer
No ratings yet
Job Description - Junior Mechanical Engineer
2 pages
Countable vs. Uncountable Nouns Worksheet
No ratings yet
Countable vs. Uncountable Nouns Worksheet
4 pages
Grilon As V0
No ratings yet
Grilon As V0
5 pages
General Wiring Price List - June 01st 2021
No ratings yet
General Wiring Price List - June 01st 2021
1 page
RFQ Process
No ratings yet
RFQ Process
19 pages
Sc-Can101 Ct-Carmen-Cahayagan - DDD
No ratings yet
Sc-Can101 Ct-Carmen-Cahayagan - DDD
18 pages
Noise Thinks The Anthropocene An Experiment in Noise Poetics Aaron Zwintscher Instant Download
100% (4)
Noise Thinks The Anthropocene An Experiment in Noise Poetics Aaron Zwintscher Instant Download
82 pages
3 Chapter 3 Part 1 Foundations.
No ratings yet
3 Chapter 3 Part 1 Foundations.
40 pages
RSRM History
No ratings yet
RSRM History
33 pages
DP Placing Parenthesis
No ratings yet
DP Placing Parenthesis
26 pages
Integumentary System
100% (1)
Integumentary System
40 pages
Training Proposal and Matrix Maryknoll
No ratings yet
Training Proposal and Matrix Maryknoll
6 pages
2.10.1.12 Arc F WPS Office
No ratings yet
2.10.1.12 Arc F WPS Office
9 pages
Skye Howard - TRUE KETO - Smoothies & Shakes
No ratings yet
Skye Howard - TRUE KETO - Smoothies & Shakes
67 pages
Psychrometric Chart - Us and Si Units: Sea Level
No ratings yet
Psychrometric Chart - Us and Si Units: Sea Level
1 page
Weekly Home Learning Plan (WHLP) : Kindergarten Week 1 Quarter 1
No ratings yet
Weekly Home Learning Plan (WHLP) : Kindergarten Week 1 Quarter 1
6 pages
161-Gyro AM AlphaMidiCourse MK2 ProdInfo Present 8-7-2019
No ratings yet
161-Gyro AM AlphaMidiCourse MK2 ProdInfo Present 8-7-2019
6 pages
Sa 29 PDF
No ratings yet
Sa 29 PDF
26 pages
1000 Amps & 1001 Spikes Cheats
No ratings yet
1000 Amps & 1001 Spikes Cheats
2 pages

Lecture 4

Uploaded by

Lecture 4

Uploaded by

Big Data Technologies

KeyValueInputFormat Parses lines Everything The remainder Text

Note : This image from Hadoop definition guide

Key-value input format

Key-value input format

Key-value input format

NLineInputFormat input format

NLineInputFormat input format

NLineInputFormat input format

• The setOutputKeyClass() and setOutputValueClass() methods control

You might also like