Unit 2
INTRODUCTION
TO HADOOP AND
HADOOP
ARCHITECTURE
Dr. Nilesh M. Patil
Associate Professor
Dept. of Computer Engineering, DJSCE
                         Syllabus
• Big Data – Apache Hadoop & Hadoop EcoSystem
• Moving Data in and out of Hadoop – Understanding inputs and
  outputs of MapReduce Concept of Hadoop
• HDFS Commands
• MapReduce-The Map Tasks, Grouping by Key, The Reduce Tasks,
  Combiners, Details of MapReduce Execution
• 8 Hours
• Marks: 20 (approx.)
                 Introduction to Hadoop
• Hadoop is an open-source framework from Apache and is used to
  store process and analyze data that are very huge in volume.
• Founders: Doug Cutting and Mike Cafarella
• Hadoop is written in Java and is not OLAP (online analytical
  processing).
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn, and
  many more.
• Moreover, it can be scaled up just by adding nodes in the cluster.
History of Hadoop
•   Founders: Doug Cutting and Mike Cafarella
•   2002 – Apache Nutch (Open Source Web Crawler Software Project)
•   2003 – Google introduced GFS, proprietary distributed file system
•   2004 – Google released white paper on Map Reduce
•   2005 - Doug Cutting and Mike Cafarella introduced NDFS
•   2006 – Doug Cutting and Mike Cafarella quit Google and joined Yahoo, introduced
    HDFS. Hadoop 0.1.0 version was released.
•   2007 – Yahoo runs 2 clusters of 1000 machines
•   2008 – Hadoop became the fastest system to sort 1TB data on 900 node cluster
    within 209 seconds
•   2013 – Hadoop 2.2 was released
•   2017 – Hadoop 3.0 was released
                     Why Hadoop?
  Overcomes
                   Leverage
      the                       Provide linear
                 inexpensive,                     Low cost
  traditional                     scalability
                  commodity                      open source
limitations of                    from 1 to
                 hardware as                      software.
 storage and                    4000 servers.
                 the platform
   compute.
                       Hadoop Goals
1. Scalable: It can scale up from a single server to thousands of
   servers.
2. Fault tolerance: It is designed with a very high degree of fault
   tolerance.
3. Economical: It uses commodity hardware instead of high-end
   hardware.
4. Handle hardware failures: It has the ability to detect and handle
   failures at the application layer.
Core Hadoop
Components
             Hadoop Common Package
• Hadoop Common refers to the collection of common
  utilities and libraries that support other Hadoop
  modules.
• Hadoop Common is also known as Hadoop Core.
• Hadoop Common also contains the necessary Java
  Archive (JAR) files and scripts required to start Hadoop.
• The Hadoop Common package also provides source
  code and documentation, as well as a contribution
  section that includes different projects from the Hadoop
  Community.
      Hadoop Distributed File System (HDFS)
• Hadoop File System was developed using distributed file system
  design.
• It is run on commodity hardware.
• Unlike other distributed systems, HDFS is highly fault-tolerant and
  designed using low-cost hardware.
• HDFS holds a very large amount of data and provides easier access.
• To store such huge data, the files are stored across multiple
  machines.
• These files are stored in a redundant fashion to rescue the system
  from possible data losses in case of failure.
• HDFS also makes applications available for parallel processing.
                     HDFS Features
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of NameNode and DataNode help users to
  easily check the status of cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture
                        NameNode
• NameNode is the master node that contains the metadata.
• The NameNode is responsible for the workings of the data nodes.
• NameNode is the primary server that manages the file system
  namespace and controls client access to files.
• The NameNode performs file system namespace operations,
  including opening, closing and renaming files and directories.
• The NameNode also governs the mapping of blocks to the
  DataNodes.
                          DataNode
• The DataNodes are called the slaves.
• The DataNodes read, write, process, and replicate the data.
• They also send signals, known as heartbeats, to the NameNode.
  These heartbeats show the status of the DataNode.
• While there is only one NameNode, there can be multiple DataNodes.
• Consider that 30TB of data is loaded
  into the NameNode.
• The NameNode distributes it across
  the DataNodes, and this data is
  replicated among the DataNodes.
• You can see in the image above that
  the blue, grey, and red data are
  replicated     among      the    three
  DataNodes.
• Replication of the data is performed
  three times by default. It is done this
  way, so if a commodity machine fails,
  you can replace it with a new
  machine that has the same data.
                        Hadoop MapReduce
• Hadoop MapReduce is the processing unit of Hadoop.
• In the MapReduce approach, the processing is done at the slave nodes, and the
  final result is sent to the master node.
• A data containing code is used to process the entire data. This coded data is
  usually very small in comparison to the data itself.
• You only need to send a few kilobytes worth of code to perform a heavy-duty
  process on computers.
• MapReduce program executes in three stages, namely map stage, shuffle
  stage, and reduce stage.
   • Map stage − The map or mapper’s job is to process the input data. Generally, the
     input data is in the form of file or directory and is stored in the Hadoop file system
     (HDFS). The input file is passed to the mapper function line by line. The mapper
     processes the data and creates several small chunks of data.
   • Reduce stage − This stage is the combination of the Shuffle stage and
     the Reduce stage. The Reducer’s job is to process the data that comes from the
     mapper. After processing, it produces a new set of output, which will be stored in
     the HDFS.
MapReduce Architecture
         Components of MapReduce Architecture:
         1. Client: The MapReduce client is the one who brings the Job to
            the MapReduce for processing. There can be multiple clients
            available that continuously send jobs for processing to the
            Hadoop MapReduce Manager.
         2. Job: The MapReduce Job is the actual work that the client
            wanted to do which is comprised of so many smaller tasks
            that the client wants to process or execute.
         3. Hadoop MapReduce Master: It divides the particular job into
            subsequent job-parts.
         4. Job-Parts: The task or sub-jobs that are obtained after
            dividing the main job. The result of all the job-parts are
            combined to produce the final output.
         5. Input Data: The data set that is fed to the MapReduce for
            processing.
         6. Output Data: The final result is obtained after the processing.
MapReduce Example
                         Hadoop YARN
• Hadoop YARN stands for Yet Another Resource Negotiator.
• It is the resource management unit of Hadoop and is available as a
  component of Hadoop version 2.
• Hadoop YARN acts like an OS to Hadoop. It is a file system that is
  built on top of HDFS.
• It is responsible for managing cluster resources to make sure you don't
  overload one machine.
• It performs job scheduling to make sure that the jobs are scheduled in
  the right place.
Hadoop Ecosystem
 HDFS (Hadoop Distributed File System)
• It is the storage component of Hadoop that stores data in the form of
  files.
• Each file is divided into blocks of 128MB (configurable) and stores
  them on different machines in the cluster.
• It has a master-slave architecture with two main components: Name
  Node and Data Node.
• Name node is the master node and there is only one per cluster. Its
  task is to know where each block belonging to a file is lying in the
  cluster
• Data node is the slave node that stores the blocks of data and there
  are more than one per cluster. Its task is to retrieve the data as and
  when required. It keeps in constant touch with the Name node
  through heartbeats.
                            MapReduce
• To handle Big Data, Hadoop relies on the MapReduce algorithm introduced
  by Google and makes it easy to distribute a job and run it in parallel in a
  cluster.
• It essentially divides a single task into multiple tasks and processes them on
  different machines.
• In layman’s terms, it works in a divide-and-conquer manner and runs the
  processes on the machines to reduce traffic on the network.
• It has two important phases: Map and Reduce.
• Map phase filters, groups, and sorts the data. Input data is divided into
  multiple splits. Each map task works on a split of data in parallel on
  different machines and outputs a key-value pair. The output of this phase is
  acted upon by the reduce task and is known as the Reduce phase. It
  aggregates the data, summarizes the result, and stores it on HDFS.
                            YARN
• YARN or Yet Another Resource Negotiator manages resources
  in the cluster and manages the applications over Hadoop.
• It allows data stored in HDFS to be processed and run by various
  data processing engines such as batch processing, stream
  processing, interactive processing, graph processing, and many
  more.
• This increases efficiency with the use of YARN.
                        HBase
• HBase is a Column-based NoSQL database.
• It runs on top of HDFS and can handle any type of data.
• It allows for real-time processing and random read/write
  operations to be performed in the data.
                             Pig
• Pig was developed for analyzing large datasets and overcomes
  the difficulty to write map and reduce functions.
• It consists of two components: Pig Latin and Pig Engine.
• Pig Latin is the Scripting Language that is similar to SQL.
• Pig Engine is the execution engine on which Pig Latin runs.
• Internally, the code written in Pig is converted to MapReduce
  functions and makes it very easy for programmers who aren’t
  proficient in Java.
                             Hive
• Hive is a distributed data warehouse system developed by
  Facebook.
• It allows for easy reading, writing, and managing files on HDFS.
• It has its own querying language for the purpose known as Hive
  Querying Language (HQL) which is very similar to SQL.
• This makes it very easy for programmers to write MapReduce
  functions using simple HQL queries.
                             Sqoop
• A lot of applications still store data in relational databases, thus
  making them a very important source of data.
• Therefore, Sqoop plays an important part in bringing data from
  Relational Databases into HDFS.
• The commands written in Sqoop internally converts into
  MapReduce tasks that are executed over HDFS.
• It works with almost all relational databases like MySQL,
  Postgres, SQLite, etc.
• It can also be used to export data from HDFS to RDBMS.
                            Flume
• Flume is an open-source, reliable, and available service used to
  efficiently collect, aggregate, and move large amounts of data
  from multiple data sources into HDFS.
• It can collect data in real-time as well as in batch mode.
• It has a flexible architecture and is fault-tolerant with multiple
  recovery mechanisms.
                              Kafka
• There are a lot of applications generating data and a
  commensurate number of applications consuming that data. But
  connecting them individually is a tough task. That’s where Kafka
  comes in.
• It sits between the applications generating data (Producers) and
  the applications consuming data (Consumers).
• Kafka is distributed and has in-built partitioning, replication, and
  fault-tolerance.
• It can handle streaming data and also allows businesses to
  analyze data in real-time.
                                Oozie
• Oozie is a workflow scheduler system that allows users to link jobs
  written on various platforms like MapReduce, Hive, Pig, etc.
• Using Oozie you can schedule a job in advance and can create a
  pipeline of individual jobs to be executed sequentially or in parallel to
  achieve a bigger task.
• For example, you can use Oozie to perform ETL operations on data
  and then save the output in HDFS.
                        Zookeeper
• In a Hadoop cluster, coordinating and synchronizing nodes can
  be a challenging task. Therefore, Zookeeper is the perfect tool
  for the problem.
• It is an open-source, distributed, and centralized service for
  maintaining configuration information, naming, providing
  distributed synchronization, and providing group services across
  the cluster.
                             Mahout
• Mahout offers a platform to develop machine learning software that
  can be scaled.
• Machine learning algorithms enable the creation of self-learning
  systems that learn by themselves without having to be explicitly
  programmed.
• Based on the user’s behavior patterns, data and previous experiences,
  it can make crucial choices.
• It can be described as an ancestor from Artificial Intelligence (AI).
• Mahout is a collaborative filtering system as well as clustering and
  classification.
                               Spark
• Spark is an alternative framework to Hadoop built on Scala but
  supports varied applications written in Java, Python, etc.
• Compared to MapReduce it provides in-memory processing which
  accounts for faster processing.
• In addition to batch processing offered by Hadoop, it can also handle
  real-time processing.
                           Ambari
• Ambari is an Apache Software Foundation Project which seeks to
  make the ecosystem Hadoop easier to manage.
• It is a software solution for provisioning, and managing Apache
  Hadoop clusters.
                        Description of Hadoop components
                        •Name Node
Physical Architecture       • It is the master of HDFS (Hadoop file system).
                            • Contains Job Tracker, which keeps tracks of a file
     of Hadoop                 distributed to different data nodes.
                            • Failure of Name Node will lead to the failure of the
                               full Hadoop system.
                        •Data node
                            • Data node is the slave of HDFS.
                            • A data node can communicate with each other
                               through the name node to avoid replication in the
                               provided task.
                            • Data nodes update the change to the data node.
                        •Job Tracker
                            • Determines which file to process.
                            • There can be only one job tracker for per Hadoop
                               cluster.
                        •Task Tracker
                            • Only single task tracker is present per slave node.
                            • Performs tasks given by job tracker and also
                               continuously communicates with the job tracker.
                        •SSN (Secondary Name Node)
                            • Its main purpose is to monitor.
                            • One SSN is present per cluster.
                                    Working
1.  When the client submit his job, it will go to the NameNode.
2.  Now, NameNode will decide whether to accept the job or not.
3.  After accepting the job, the NameNode will transfer the job to the job tracker.
4.  Job tracker will divide the job into components and transfer them to DataNodes.
5.  Now, DataNodes will further transfer the jobs to the task tracker.
6.  The actual processing will be done here, means the execution of the job submitted is
    done here.
7. The job tracker continuously communicates with the task trackers. In the case in any
    moment job trackers do not get a reply from any of the task trackers, it considers that it
    failed and transfers its work to another one.
8. Then, after completing the part of the jobs assigned to them, the task tracker will
    submit the completed task to the job tracker via the DataNode.
9. The task of secondary NameNode is to just monitor the whole process ongoing.
10. There is no fixed number of data nodes, it can be as much as required or made.
Hadoop
Advantages
Limitations of
Hadoop
                               Hadoop Installation
  Steps to Install Hadoop
  •Install Java JDK 1.8
  •Download Hadoop and extract and place under C drive
  •Set Path in Environment Variables
  •Config files under Hadoop directory
  •Create folder datanode and namenode under data
  directory
  •Edit HDFS and YARN files
  •Set Java Home environment in Hadoop environment
  •Setup Complete. Test by executing start-all.cmd
There are two ways to install Hadoop, i.e.
•Single node
•Multi node
Single node cluster means only one DataNode running and setting up all the NameNode, DataNode,
ResourceManager and NodeManager on a single machine.
While in a Multi node cluster, there are more than one DataNode running and each DataNode is running
on different machines. The multi node cluster is practically used in organizations for analyzing Big Data. In
real time when we deal with petabytes of data, it needs to be distributed across hundreds of machines to
be processed. Thus, here we use multi node cluster.
Setting up a single node Hadoop cluster
• Prerequisites to install Hadoop on windows
• VIRTUAL BOX (For Linux): it is used for installing the operating system
  on it.
• OPERATING SYSTEM: You can install Hadoop on Windows or Linux
  based operating systems. Ubuntu and CentOS are very commonly
  used.
• JAVA: You need to install the Java 8 package on your system.
• HADOOP: You require Hadoop’s latest version
             • Java JDK Link to download
             https://www.oracle.com/java/technologies/javase-jdk8-
             downloads.html
             • extract and install Java in C:\Java
             • open cmd and type -> javac -version
1. Install
Java
              • https://www.apache.org/dyn/closer.cgi/hadoop/comm
                on/hadoop-3.3.0/hadoop-3.3.0.tar.gz
              • extract to C:\Hadoop
2. Download
Hadoop
3.Set the path JAVA_HOME Environment variable
4.Set the path HADOOP_HOME Environment variable
5. Configurations
Edit file C:/Hadoop-3.3.0/etc/hadoop/core-site.xml,
paste the xml code in folder and save
<configuration>
 <property>
   <name>fs.defaultFS</name>
   <value>hdfs://localhost:9000</value>
 </property>
</configuration>
Rename “mapred-site.xml.template” to “mapred-site.xml” and edit this file C:/Hadoop-
3.3.0/etc/hadoop/mapred-site.xml, paste xml code and save this file.
<configuration>
 <property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
</configuration>
Create folder “data” under “C:\Hadoop-3.3.0”
Create folder “datanode” under “C:\Hadoop-3.3.0\data”
Create folder “namenode” under “C:\Hadoop-3.3.0\data”
Edit file C:\Hadoop-3.3.0/etc/hadoop/hdfs-site.xml,   Edit file C:/Hadoop-3.3.0/etc/hadoop/yarn-site.xml,
paste xml code and save this file.                    paste xml code and save this file.
<configuration>                                       <configuration>
 <property>                                             <property>
   <name>dfs.replication</name>                                <name>yarn.nodemanager.aux-services</name>
   <value>1</value>                                            <value>mapreduce_shuffle</value>
 </property>                                            </property>
 <property>                                             <property>
   <name>dfs.namenode.name.dir</name>                          <name>yarn.nodemanager.auxservices.mapreduc
   <value>/hadoop-3.3.0/data/namenode</value>         e.shuffle.class</name>
 </property>                                                   <value>org.apache.hadoop.mapred.ShuffleHandl
 <property>                                           er</value>
   <name>dfs.datanode.data.dir</name>                   </property>
   <value>/hadoop-3.3.0/data/datanode</value>         </configuration>
 </property>
</configuration>
                                                      Edit file C:/Hadoop-3.3.0/etc/hadoop/hadoop-env.cmd
                                                      by closing the command line
                                                      “JAVA_HOME=%JAVA_HOME%” instead of set
                                                      “JAVA_HOME=C:\Java”
                 • Download
                 https://github.com/brainmentorspvtltd/BigData_RDE/blob/master/Hadoop%20Configuration.zi
                 p
                 • or (for hadoop 3)
                 https://github.com/s911415/apache-hadoop-3.1.0-winutils
                 • Copy folder bin and replace existing bin folder in C:\Hadoop-3.3.0\bin
                 • Format the NameNode
6. Hadoop        • Open cmd and type command “hdfs namenode –format”
Configurations
   7. Testing
• Open cmd and change directory to C:\Hadoop-3.3.0\sbin
• type start-all.cmd
 (Or you can start like this)
 – Start namenode and datanode with this command
 – type start-dfs.cmd
 – Start yarn through this command
 – type start-yarn.cmd
Make sure these apps are running
– Hadoop Namenode
– Hadoop datanode
– YARN Resource Manager
– YARN Node Manager
Open: http://localhost:8088
Open: http://localhost:9870
MapReduce
MapReduce in Nutshell
What is MapReduce?
MapReduce Execution Pipeline
                               72
Input Files
The data for a MapReduce task is stored in input files, and input files typically lives
in HDFS. The format of these files is arbitrary, while line-based log files and binary
format can also be used.
InputFormat
Now, InputFormat defines how these input files are split and read. It selects the files
or other objects that are used for input. InputFormat creates InputSplit.
InputSplits
It is created by InputFormat, logically represent the data which will be processed by
an individual Mapper. One map task is created for each split; thus the number of
map tasks will be equal to the number of InputSplits. The split is divided into
records and each record will be processed by the mapper.
RecordReader
It communicates with the InputSplit in Hadoop MapReduce and converts the data into
key-value pairs suitable for reading by the mapper. By default, it uses
TextInputFormat for converting data into a key-value pair. RecordReader
communicates with the InputSplit until the file reading is not completed. It assigns
byte offset (unique number) to each line present in the file. Further, these key-value
pairs are sent to the mapper for further processing.
                                                                                          73
Mapper
It processes each input record (from RecordReader) and generates new key-value
pair, and this key-value pair generated by Mapper is completely different from the
input pair. The output of Mapper is also known as intermediate output which is
written to the local disk. The output of the Mapper is not stored on HDFS as this is
temporary data and writing on HDFS will create unnecessary copies (also HDFS is a
high latency system). Mappers output is passed to the combiner for further process
Combiner
The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce Combiner
performs local aggregation on the mappers’ output, which helps to minimize the data
transfer between mapper and reducer (we will see reducer below). Once the combiner
functionality is executed, the output is then passed to the partitioner for further work.
Partitioner
Hadoop MapReduce, Partitioner comes into the picture if we are working on more
than one reducer (for one reducer partitioner is not used).
Partitioner takes the output from combiners and performs partitioning. Partitioning
of output takes place on the basis of the key and then sorted. By hash function, key
(or a subset of the key) is used to derive the partition.
According to the key value in MapReduce, each combiner output is partitioned, and
a record having the same key value goes into the same partition, and then each
partition is sent to a reducer. Partitioning allows even distribution of the map
                                                                                            74
output over the reducer.
Shuffling and Sorting
Now, the output is Shuffled to the reduce node (which is a normal slave node but
reduce phase will run here hence called as reducer node). The shuffling is the
physical movement of the data which is done over the network. Once all the
mappers are finished and their output is shuffled on the reducer nodes, then this
intermediate output is merged and sorted, which is then provided as input to reduce
phase.
Reducer
It takes the set of intermediate key-value pairs produced by the mappers as the
input and then runs a reducer function on each of them to generate the output. The
output of the reducer is the final output, which is stored in HDFS.
RecordWriter
It writes these output key-value pair from the Reducer phase to the output files.
OutputFormat
The way these output key-value pairs are written in output files by RecordWriter is
determined by the OutputFormat. OutputFormat instances provided by the
Hadoop are used to write files in HDFS or on the local disk. Thus, the final output
of reducer is written on HDFS by OutputFormat instances.
                                                                                      75
              FileInputFormat
              TextInputFormat
Types of      KeyValueTextInputFormat
InputFormat   SequenceFileInputFormat
in            SequenceFileAsTextInputFormat
MapReduce     SequenceFileAsBinaryInputFormat
              NLineInputFormat
              DBInputFormat
FileInputFormat
• It is the base class for all file-based InputFormats.
• Hadoop FileInputFormat specifies input directory where data
  files are located.
• When we start a Hadoop job, FileInputFormat is provided with
  a path containing files to read.
• FileInputFormat will read all files and divides these files into
  one or more InputSplits.
    TextInputFormat
                                          example
• TextInputFormat is the default          input_data
  InputFormat.                                         A king should hunt regularly
                                                       A queen should shop daily,
• Each record is a line of input.                      Other people should just try.
• The key, a LongWritable, is the byte    The records are interpreted as the following
                                          key-value pairs.using TextInputFormat
  offset of the beginning of the line
  within the file.                                   Key       value
                                                     0        A king should hunt regularly
• The value is the contents of the                   29      A queen should shop daily,
  line, excluding any line terminators.              55      Other people should just try.
     KeyValueTextInputFormat
• It is similar to TextInputFormat
  as it also treats each line of input
  as a separate record.
• While TextInputFormat treats
  entire line as the value, but the
  KeyValueTextInputFormat
  breaks the line itself into key and
  value by a tab character (‘/t’).
• Here Key is everything up to the
  tab character while the value is
  the remaining part of the line
  after tab character.
SequenceFileInputFormat
• Hadoop SequenceFileInputFormat is an InputFormat
  which reads sequence files.
• Sequence files are binary files that stores sequences of
  binary key-value pairs.
• Sequence files block-compress and provide direct serialization
  and deserialization of several arbitrary data types (not just text).
• Here Key & Value both are user-defined.
SequenceFileAsTextInputFormat
• Hadoop SequenceFileAsTextInputFormat is another form
  of SequenceFileInputFormat which converts the sequence file
  key values to Text objects.
• By calling ‘tostring()’ conversion is performed on the keys and
  values.
• This InputFormat makes sequence files suitable input for
  streaming.
SequenceFileAsBinaryInputFormat
• Hadoop      SequenceFileAsBinaryInputFormat               is a
  SequenceFileInputFormat using which we can extract the
  sequence file’s keys and values as an opaque binary object.
NLineInputFormat
• Hadoop      NLineInputFormat           is   another   form     of
  TextInputFormat where the keys are byte offset of the line and
  values are contents of the line.
• If we want our mapper to receive a fixed number of lines of
  input, then we use NLineInputFormat.
• N is the number of lines of input that each mapper receives.
• By default (N=1), each mapper receives exactly one line of input.
• If N=2, then each split contains two lines.
• One mapper will receive the first two Key-Value pairs and
  another mapper will receive the second two key-value pairs.
DBInputFormat
• Hadoop DBInputFormat is an InputFormat that reads data
  from a relational database, using JDBC.
• It is best for loading relatively small datasets.
• Here Key is LongWritables while Value is DBWritables.
               TextOutputFormat
               SequenceFileOutputFormat
               SequenceFileAsBinaryOutputFormat
OutputFormat   MapFileOutputFormat
in MapReduce
               MultipleOutputs
               LazyOutputFormat
               DBOutputFormat
TextOutputFormat
• The default OutputFormat is TextOutputFormat.
• It writes (key, value) pairs on individual lines of text files.
• Its keys and values can be of any type.
• The reason behind is that TextOutputFormat turns them to
  string by calling toString() on them.
• It separates key-value pair by a tab character.
SequenceFileOutputFormat
• This OutputFormat writes sequences files for its output.
• SequenceFileOutputFormat is also intermediate format used
  between MapReduce jobs.
• It serializes arbitrary data types to the file.
• It presents the data to the next mapper in the same manner as
  it was emitted by the previous reducer.
SequenceFileAsBinaryOutputFormat
• It is another variant of SequenceFileInputFormat.
• It also writes keys and values to sequence file in binary format.
MapFileOutputFormat
• It is another form of FileOutputFormat.
• It also writes output as map files.
• The framework adds a key in a MapFile in order.
• So, we need to ensure that reducer emits keys in sorted order.
MultipleOutputs
• This format allows writing data to files whose names are derived
  from the output keys and values.
LazyOutputFormat
• In MapReduce job execution, FileOutputFormat sometimes
  create output files, even if they are empty.
• LazyOutputFormat is also a wrapper OutputFormat.
DBOutputFormat
• It is the OutputFormat for writing to relational databases
  and HBase.
• This format also sends the reduce output to a SQL table.
• It also accepts key-value pairs.
• In this, the key has a type extending DBwritable.
MapReduce WordCount Example 1
MapReduce WordCount Example 2
MapReduce WordCount Example 3
MapReduce Wordcount Example 4
MapReduce Wordcount Example 5
MapReduce Matrix Multiplication
                                  98