Big Data Technologies
(IS 365)
                                    Lecture 4
                                   MapReduce
                                      Dr. Wael Abbas
                                       2024 - 2025
All slides in this file from the following book :”Tom white (2015) .Hadoop: The Definitive
Guide, 4th Edition . " O'Reilly Media, Inc."
            Reading data in Mapreduce
   Hadoop can process many different types of data formats, from flat text
    files to databases.
 There are three main Java classes provided in Hadoop to read data in
    MapReduce:
     1. InputSplitter
     2. RecordReader
     3. InputFormat
                  MapReduce : InputFormat
   InputFormat        Description           Key              Value                 Fil type
TextInputFormat      Default format;    The byte      The line              Text
                     reads lines of     offset of the contents
                     text files         line
KeyValueInputFormat Parses lines      Everything        The remainder       Text
                    into (K, V) pairs up to the         of
                                      first tab         the line
                                      character
NLineInputFormat     mappers            The byte        The line contents   Text
                     receives a fixed   offset of the
                     number of lines    line
                     of input
SequenceFileInputFo A Hadoop-           user-           user-defined        Binary
rmat                specific high-      defined
                    performance
                    binary format
                   MapReduce : InputFormat
    Text input format
 TextInputFormat is the default InputFormat.
   Each record is a line of input.
 The key, a LongWritable, is the byte offset within the file of the beginning
    of the line.
 The value is the contents of the line, excluding any line terminators (e.g.,
    newline or carriage return), and is packaged as a Text object. So, a file
    containing the following text:
                          On the top of the Crumpetty Tree
                              The Quangle Wangle sat,
                            But his face you could not see,
                            On account of his Beaver Hat.
                  MapReduce : InputFormat
Text input format
The text is divided into one split of four records. The records are interpreted as
the following key value pairs:
                      (0, On the top of the Crumpetty Tree)
                         (33, The Quangle Wangle sat,)
                       (57, But his face you could not see,)
                       (89, On account of his Beaver Hat.)
• byte offset is the number of character that exists count from the beginning
   of the line.
                  MapReduce : InputFormat
Text input format
THE RELATIONSHIP BETWEEN INPUT SPLITS AND HDFS BLOCKS
A single file is broken into lines, and the line boundaries do not correspond
with the HDFS block boundaries. Splits honor logical record boundaries (in this
case, lines), so we see that the first split contains line 5, even though it spans the
first and second block. The second split starts at line 6.
       Note : This image from Hadoop definition guide
                 MapReduce : InputFormat
Key-value input format
 TextInputFormat’s keys, being simply the offsets within the file, are not
   normally very useful. It is common for each line in a file to be a key-value
   pair, separated by a delimiter such as a tab character.
 For example, this is the kind of output produced by TextOutputFormat,
   Hadoop’s default OutputFormat.
 To interpret such files correctly, KeyValueTextInputFormat is appropriate.
                 MapReduce : InputFormat
Key-value input format
 TextInputFormat’s keys, being You can specify the separator via the
mapreduce.input.keyvaluelinerecordreader.key.value.separator property.
   It is a tab character by default. Consider the following input file, where →
   represents a (horizontal) tab character:
line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat..
                 MapReduce : InputFormat
Key-value input format
Like in the TextInputFormat case, the input is in a single split comprising four
records, although this time the keys are the Text sequences before the tab in
each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)
                 MapReduce : InputFormat
NLineInputFormat input format
 With TextInputFormat and KeyValueTextInputFormat, each mapper receives
    a variable number of lines of input.
 The number depends on the size of the split and the length of the lines.
   If you want your mappers to receive a fixed number of lines of input, then
    NLineInputFormat is the InputFormat to use.
   Like with TextInputFormat, the keys are the byte offsets within the file and
    the values are the lines themselves.
                MapReduce : InputFormat
NLineInputFormat input format
 Each mapper N refers to the number of lines of input that each mapper
   receives. With N set to 1 (the default), each mapper receives exactly one line
  of input.
 The mapreduce.input.lineinputformat.linespermap property controls the
  value of N. By way of example, consider these four lines again:
                         On the top of the Crumpetty Tree
                            The Quangle Wangle sat,
                          But his face you could not see,
                          On account of his Beaver Hat.
 If, for example, N is 2, then each split contains two lines. One mapper will
  receive the first two key-value pairs:
                       (0, On the top of the Crumpetty Tree)
                           (33, The Quangle Wangle sat,)
                MapReduce : InputFormat
NLineInputFormat input format
 And another mapper will receive the second two key-value pairs:
                       (57, But his face you could not see,)
                       (89, On account of his Beaver Hat.)
 The keys and values are the same as those that TextInputFormat produces.
 The difference is in the way the splits are constructed.
                MapReduce : InputFormat
Binary : input format
 Hadoop MapReduce is not restricted to processing textual data. It has
  support for binary formats.
 Hadoop’s sequence file format stores sequences of binary key-value pairs.
 Sequence files are well suited as a format for MapReduce data because they
  are splittable (they have sync points so that readers can synchronize with
  record boundaries from an arbitrary point in the file, such as the start of a
  split), they support compression as a part of the format, and they can store
  arbitrary types using a variety of serialization frameworks.
                MapReduce : InputFormat
Binary : input format (SequenceFileInputFormat)
 To use data from sequence files as the input to MapReduce, you can use
  SequenceFileInputFormat.
 The keys and values are determined by the sequence file, and you need to
  make sure that your map input types correspond.
  what problems does the SequenceFile try to
                  solve ?
For HDFS
 SequenceFile is one of the solutions to small file problem in Hadoop.
 Small file is significantly smaller than the HDFS block size(128MB).
 Each file, directory, block in HDFS is represented as object and occupies
   150 bytes.
 10 million files, would use about 3 gigabytes of memory of NameNode.
 A billion files is not feasible.
  what problems does the SequenceFile try to
                  solve ?
For MapReduce :
 Map tasks usually process a block of input at a time (using the default
   FileInputFormat).
 The more the number of files is, the more number of Map task need and the
   job time can be much more slow.
Small file scenario :
 The files are pieces of a larger logical file.
      how can SequenceFile help to solve the
                  problems?
 The concept of SequenceFile is to put each small file to a larger single file.
 For example, suppose there are 10,000 100KB files, then we can write a
   program to put them into a single SequenceFile like below, where you can
   use filename to be the key and content to be the value.
      how can SequenceFile help to solve the
                  problems?
1. A smaller number of memory needed on NameNode. Continue with the
    10,000 100KB files example,
        o Before using SequenceFile, 10,000 objects occupy about 4.5MB of
           RAM in NameNode.
        o After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks,
           these objects occupy about 3.6KB of RAM in NameNode.
2. SequenceFile is splittable, so is suitable for MapReduce.
3. SequenceFile is compression supported.
MapReduce : RecordReader
          MapReduce word count : Mapper
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class wordcountmapper extends Mapper<Object, Text, Text, IntWritable>{
  @Override
   protected void map(Object           key,   Text   value,   Context   context)   throws   IOException,
InterruptedException {
     //To change body of generated methods, choose Tools | Templates.
    String mytext =value.toString();
    String allwords []=mytext.split(" ");
    for(String x:allwords){
    context.write(new Text(x), new IntWritable(1));
    } } }
        MapReduce word count : Mapper
• The Mapper class is a generic type, with four formal type parameters
  that specify the input key, input value, output key, and output value
  types of the map function.
• In word count example , input key is object , input value is a line of
  text (Text), output key is a word (Text) , and output value (Intwritable).
          MapReduce word count : Reducer
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class wordcountreducer extends Reducer<Text, IntWritable, Text, IntWritable> {
   @Override
   protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
      //To change body of generated methods, choose Tools | Templates.
      int sum = 0;
      for (IntWritable iw : values) {
         sum += iw.get();
      }
     context.write(key, new IntWritable(sum));
   }
}
         MapReduce word count : Reducer
• The reducer class is a generic type, with four formal type parameters that
  specify the input key, input value, output key, and output value types of the
  reduce function.
• The input types of the reduce function must match the output types of the
  map function.
• In word count example , input key is text , input value is intwritable ,
  output key is Text , and output value (Intwritable).
                MapReduce word count : Driver
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat ;
public class wordcountdriver {
  public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
     Configuration c = new Configuration();
     Job j = Job.getInstance(c, "mywordcount");
     j.setMapperClass(wordcountmapper.class);
     j.setReducerClass(wordcountreducer.class);
     //j.setCombinerClass(wordcountreducer.class);
     j.setJarByClass(wordcountdriver.class);
     j.setOutputKeyClass(Text.class);
     j.setOutputValueClass(IntWritable.class);
     FileInputFormat.addInputPath(j, new Path("hdfs://localhost:8020/user/cloudera/input/data.dat"));
     FileOutputFormat.setOutputPath(j, new Path("hdfs://localhost:8020/user/cloudera/2019c"));
     System.exit(j.waitForCompletion(true) ? 0 : 1);
  }}
          MapReduce word count : Driver
• The setOutputKeyClass() and setOutputValueClass() methods control
  the output types for the reduce function, and must match what the Reduce
  class produces .
• setMapOutputKeyClass() and setMapOutputValueClass() methods
  are used when data type of map output is different from data type of
  reduce output .
  Serialization and deserialization in Hadoop
• Serialization is the process of turning structured objects into a byte stream
  for transmission over a network or for writing to persistent storage.
• Deserialization is the reverse process of turning a byte stream back into a
  series of structured objects.
• Serialization is used in two quite distinct areas of distributed data processing:
  for interprocess communication and for persistent storage.
• In Hadoop, interprocess communication between nodes in the system is
  implemented using remote procedure calls (RPCs). The RPC protocol uses
  serialization to render the message into a binary stream to be sent to the
  remote node, which then deserializes the binary stream into the original
  message.
  Serialization and deserialization in Hadoop
Why does Hadoop use classes such as intwritable and Text instead of
int and string ?
because java Serializable is too big or too heavy for Hadoop, Writable can
serializable the Hadoop Object in a very light way.
Why & where Hadoop is used / not used?
 What Hadoop is good for:
1. Massive amounts of data through parallelism
2. A variety of data (structured, unstructured, semi-structured)
3. Inexpensive commodity hardware
 Hadoop is not good for:
1. Not to process transactions (random access)
2.   Not good when work cannot be parallelized
3. Not good for low latency data access
4. Not good for processing lots of small files
5. Not good for intensive calculations with little data