0% found this document useful (0 votes)

47 views101 pages

Chapter2 Bdi

Uploaded by

Mahek Upadhye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views101 pages

Chapter2 Bdi

Uploaded by

Mahek Upadhye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 101

Unit 2

INTRODUCTION
TO HADOOP AND
HADOOP
ARCHITECTURE
Dr. Nilesh M. Patil
Associate Professor
Dept. of Computer Engineering, DJSCE
Syllabus
• Big Data – Apache Hadoop & Hadoop EcoSystem
• Moving Data in and out of Hadoop – Understanding inputs and
outputs of MapReduce Concept of Hadoop
• HDFS Commands
• MapReduce-The Map Tasks, Grouping by Key, The Reduce Tasks,
Combiners, Details of MapReduce Execution
• 8 Hours
• Marks: 20 (approx.)
Introduction to Hadoop
• Hadoop is an open-source framework from Apache and is used to
store process and analyze data that are very huge in volume.
• Founders: Doug Cutting and Mike Cafarella
• Hadoop is written in Java and is not OLAP (online analytical
processing).
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn, and
many more.
• Moreover, it can be scaled up just by adding nodes in the cluster.
History of Hadoop
• Founders: Doug Cutting and Mike Cafarella
• 2002 – Apache Nutch (Open Source Web Crawler Software Project)
• 2003 – Google introduced GFS, proprietary distributed file system
• 2004 – Google released white paper on Map Reduce
• 2005 - Doug Cutting and Mike Cafarella introduced NDFS
• 2006 – Doug Cutting and Mike Cafarella quit Google and joined Yahoo, introduced
HDFS. Hadoop 0.1.0 version was released.
• 2007 – Yahoo runs 2 clusters of 1000 machines
• 2008 – Hadoop became the fastest system to sort 1TB data on 900 node cluster
within 209 seconds
• 2013 – Hadoop 2.2 was released
• 2017 – Hadoop 3.0 was released
Why Hadoop?

Overcomes
Leverage
the Provide linear
inexpensive, Low cost
traditional scalability
commodity open source
limitations of from 1 to
hardware as software.
storage and 4000 servers.
the platform
compute.
Hadoop Goals
1. Scalable: It can scale up from a single server to thousands of
servers.
2. Fault tolerance: It is designed with a very high degree of fault
tolerance.
3. Economical: It uses commodity hardware instead of high-end
hardware.
4. Handle hardware failures: It has the ability to detect and handle
failures at the application layer.
Core Hadoop
Components
Hadoop Common Package
• Hadoop Common refers to the collection of common
utilities and libraries that support other Hadoop
modules.
• Hadoop Common is also known as Hadoop Core.
• Hadoop Common also contains the necessary Java
Archive (JAR) files and scripts required to start Hadoop.
• The Hadoop Common package also provides source
code and documentation, as well as a contribution
section that includes different projects from the Hadoop
Community.
Hadoop Distributed File System (HDFS)
• Hadoop File System was developed using distributed file system
design.
• It is run on commodity hardware.
• Unlike other distributed systems, HDFS is highly fault-tolerant and
designed using low-cost hardware.
• HDFS holds a very large amount of data and provides easier access.
• To store such huge data, the files are stored across multiple
machines.
• These files are stored in a redundant fashion to rescue the system
from possible data losses in case of failure.
• HDFS also makes applications available for parallel processing.
HDFS Features
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of NameNode and DataNode help users to
easily check the status of cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture
NameNode
• NameNode is the master node that contains the metadata.
• The NameNode is responsible for the workings of the data nodes.
• NameNode is the primary server that manages the file system
namespace and controls client access to files.
• The NameNode performs file system namespace operations,
including opening, closing and renaming files and directories.
• The NameNode also governs the mapping of blocks to the
DataNodes.
DataNode
• The DataNodes are called the slaves.
• The DataNodes read, write, process, and replicate the data.
• They also send signals, known as heartbeats, to the NameNode.
These heartbeats show the status of the DataNode.
• While there is only one NameNode, there can be multiple DataNodes.
• Consider that 30TB of data is loaded
into the NameNode.
• The NameNode distributes it across
the DataNodes, and this data is
replicated among the DataNodes.
• You can see in the image above that
the blue, grey, and red data are
replicated among the three
DataNodes.
• Replication of the data is performed
three times by default. It is done this
way, so if a commodity machine fails,
you can replace it with a new
machine that has the same data.
Hadoop MapReduce
• Hadoop MapReduce is the processing unit of Hadoop.
• In the MapReduce approach, the processing is done at the slave nodes, and the
final result is sent to the master node.
• A data containing code is used to process the entire data. This coded data is
usually very small in comparison to the data itself.
• You only need to send a few kilobytes worth of code to perform a heavy-duty
process on computers.
• MapReduce program executes in three stages, namely map stage, shuffle
stage, and reduce stage.
• Map stage − The map or mapper’s job is to process the input data. Generally, the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
• Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
MapReduce Architecture
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to
the MapReduce for processing. There can be multiple clients
available that continuously send jobs for processing to the
Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client
wanted to do which is comprised of so many smaller tasks
that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into
subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after
dividing the main job. The result of all the job-parts are
combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for
processing.
6. Output Data: The final result is obtained after the processing.
MapReduce Example
Hadoop YARN
• Hadoop YARN stands for Yet Another Resource Negotiator.
• It is the resource management unit of Hadoop and is available as a
component of Hadoop version 2.
• Hadoop YARN acts like an OS to Hadoop. It is a file system that is
built on top of HDFS.
• It is responsible for managing cluster resources to make sure you don't
overload one machine.
• It performs job scheduling to make sure that the jobs are scheduled in
the right place.
Hadoop Ecosystem
HDFS (Hadoop Distributed File System)
• It is the storage component of Hadoop that stores data in the form of
files.
• Each file is divided into blocks of 128MB (configurable) and stores
them on different machines in the cluster.
• It has a master-slave architecture with two main components: Name
Node and Data Node.
• Name node is the master node and there is only one per cluster. Its
task is to know where each block belonging to a file is lying in the
cluster
• Data node is the slave node that stores the blocks of data and there
are more than one per cluster. Its task is to retrieve the data as and
when required. It keeps in constant touch with the Name node
through heartbeats.
MapReduce
• To handle Big Data, Hadoop relies on the MapReduce algorithm introduced
by Google and makes it easy to distribute a job and run it in parallel in a
cluster.
• It essentially divides a single task into multiple tasks and processes them on
different machines.
• In layman’s terms, it works in a divide-and-conquer manner and runs the
processes on the machines to reduce traffic on the network.
• It has two important phases: Map and Reduce.
• Map phase filters, groups, and sorts the data. Input data is divided into
multiple splits. Each map task works on a split of data in parallel on
different machines and outputs a key-value pair. The output of this phase is
acted upon by the reduce task and is known as the Reduce phase. It
aggregates the data, summarizes the result, and stores it on HDFS.
YARN
• YARN or Yet Another Resource Negotiator manages resources
in the cluster and manages the applications over Hadoop.
• It allows data stored in HDFS to be processed and run by various
data processing engines such as batch processing, stream
processing, interactive processing, graph processing, and many
more.
• This increases efficiency with the use of YARN.
HBase
• HBase is a Column-based NoSQL database.
• It runs on top of HDFS and can handle any type of data.
• It allows for real-time processing and random read/write
operations to be performed in the data.
Pig
• Pig was developed for analyzing large datasets and overcomes
the difficulty to write map and reduce functions.
• It consists of two components: Pig Latin and Pig Engine.
• Pig Latin is the Scripting Language that is similar to SQL.
• Pig Engine is the execution engine on which Pig Latin runs.
• Internally, the code written in Pig is converted to MapReduce
functions and makes it very easy for programmers who aren’t
proficient in Java.
Hive
• Hive is a distributed data warehouse system developed by
Facebook.
• It allows for easy reading, writing, and managing files on HDFS.
• It has its own querying language for the purpose known as Hive
Querying Language (HQL) which is very similar to SQL.
• This makes it very easy for programmers to write MapReduce
functions using simple HQL queries.
Sqoop
• A lot of applications still store data in relational databases, thus
making them a very important source of data.
• Therefore, Sqoop plays an important part in bringing data from
Relational Databases into HDFS.
• The commands written in Sqoop internally converts into
MapReduce tasks that are executed over HDFS.
• It works with almost all relational databases like MySQL,
Postgres, SQLite, etc.
• It can also be used to export data from HDFS to RDBMS.
Flume
• Flume is an open-source, reliable, and available service used to
efficiently collect, aggregate, and move large amounts of data
from multiple data sources into HDFS.
• It can collect data in real-time as well as in batch mode.
• It has a flexible architecture and is fault-tolerant with multiple
recovery mechanisms.
Kafka
• There are a lot of applications generating data and a
commensurate number of applications consuming that data. But
connecting them individually is a tough task. That’s where Kafka
comes in.
• It sits between the applications generating data (Producers) and
the applications consuming data (Consumers).
• Kafka is distributed and has in-built partitioning, replication, and
fault-tolerance.
• It can handle streaming data and also allows businesses to
analyze data in real-time.
Oozie
• Oozie is a workflow scheduler system that allows users to link jobs
written on various platforms like MapReduce, Hive, Pig, etc.
• Using Oozie you can schedule a job in advance and can create a
pipeline of individual jobs to be executed sequentially or in parallel to
achieve a bigger task.
• For example, you can use Oozie to perform ETL operations on data
and then save the output in HDFS.
Zookeeper
• In a Hadoop cluster, coordinating and synchronizing nodes can
be a challenging task. Therefore, Zookeeper is the perfect tool
for the problem.
• It is an open-source, distributed, and centralized service for
maintaining configuration information, naming, providing
distributed synchronization, and providing group services across
the cluster.
Mahout
• Mahout offers a platform to develop machine learning software that
can be scaled.
• Machine learning algorithms enable the creation of self-learning
systems that learn by themselves without having to be explicitly
programmed.
• Based on the user’s behavior patterns, data and previous experiences,
it can make crucial choices.
• It can be described as an ancestor from Artificial Intelligence (AI).
• Mahout is a collaborative filtering system as well as clustering and
classification.
Spark
• Spark is an alternative framework to Hadoop built on Scala but
supports varied applications written in Java, Python, etc.
• Compared to MapReduce it provides in-memory processing which
accounts for faster processing.
• In addition to batch processing offered by Hadoop, it can also handle
real-time processing.
Ambari
• Ambari is an Apache Software Foundation Project which seeks to
make the ecosystem Hadoop easier to manage.
• It is a software solution for provisioning, and managing Apache
Hadoop clusters.
Description of Hadoop components
•Name Node
Physical Architecture • It is the master of HDFS (Hadoop file system).
• Contains Job Tracker, which keeps tracks of a file
of Hadoop distributed to different data nodes.
• Failure of Name Node will lead to the failure of the
full Hadoop system.
•Data node
• Data node is the slave of HDFS.
• A data node can communicate with each other
through the name node to avoid replication in the
provided task.
• Data nodes update the change to the data node.
•Job Tracker
• Determines which file to process.
• There can be only one job tracker for per Hadoop
cluster.
•Task Tracker
• Only single task tracker is present per slave node.
• Performs tasks given by job tracker and also
continuously communicates with the job tracker.
•SSN (Secondary Name Node)
• Its main purpose is to monitor.
• One SSN is present per cluster.
Working
1. When the client submit his job, it will go to the NameNode.
2. Now, NameNode will decide whether to accept the job or not.
3. After accepting the job, the NameNode will transfer the job to the job tracker.
4. Job tracker will divide the job into components and transfer them to DataNodes.
5. Now, DataNodes will further transfer the jobs to the task tracker.
6. The actual processing will be done here, means the execution of the job submitted is
done here.
7. The job tracker continuously communicates with the task trackers. In the case in any
moment job trackers do not get a reply from any of the task trackers, it considers that it
failed and transfers its work to another one.
8. Then, after completing the part of the jobs assigned to them, the task tracker will
submit the completed task to the job tracker via the DataNode.
9. The task of secondary NameNode is to just monitor the whole process ongoing.
10. There is no fixed number of data nodes, it can be as much as required or made.
Hadoop
Advantages
Limitations of
Hadoop
Hadoop Installation
Steps to Install Hadoop
•Install Java JDK 1.8
•Download Hadoop and extract and place under C drive
•Set Path in Environment Variables
•Config files under Hadoop directory
•Create folder datanode and namenode under data
directory
•Edit HDFS and YARN files
•Set Java Home environment in Hadoop environment
•Setup Complete. Test by executing start-all.cmd
There are two ways to install Hadoop, i.e.
•Single node
•Multi node
Single node cluster means only one DataNode running and setting up all the NameNode, DataNode,
ResourceManager and NodeManager on a single machine.
While in a Multi node cluster, there are more than one DataNode running and each DataNode is running
on different machines. The multi node cluster is practically used in organizations for analyzing Big Data. In
real time when we deal with petabytes of data, it needs to be distributed across hundreds of machines to
be processed. Thus, here we use multi node cluster.
Setting up a single node Hadoop cluster

• Prerequisites to install Hadoop on windows

• VIRTUAL BOX (For Linux): it is used for installing the operating system
on it.
• OPERATING SYSTEM: You can install Hadoop on Windows or Linux
based operating systems. Ubuntu and CentOS are very commonly
used.
• JAVA: You need to install the Java 8 package on your system.
• HADOOP: You require Hadoop’s latest version
• Java JDK Link to download
https://www.oracle.com/java/technologies/javase-jdk8-
downloads.html
• extract and install Java in C:\Java
• open cmd and type -> javac -version

1. Install
Java
• https://www.apache.org/dyn/closer.cgi/hadoop/comm
on/hadoop-3.3.0/hadoop-3.3.0.tar.gz
• extract to C:\Hadoop

2. Download
Hadoop
3.Set the path JAVA_HOME Environment variable
4.Set the path HADOOP_HOME Environment variable
5. Configurations

Edit file C:/Hadoop-3.3.0/etc/hadoop/core-site.xml,

paste the xml code in folder and save

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Rename “mapred-site.xml.template” to “mapred-site.xml” and edit this file C:/Hadoop-
3.3.0/etc/hadoop/mapred-site.xml, paste xml code and save this file.

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Create folder “data” under “C:\Hadoop-3.3.0”

Create folder “datanode” under “C:\Hadoop-3.3.0\data”
Create folder “namenode” under “C:\Hadoop-3.3.0\data”
Edit file C:\Hadoop-3.3.0/etc/hadoop/hdfs-site.xml, Edit file C:/Hadoop-3.3.0/etc/hadoop/yarn-site.xml,
paste xml code and save this file. paste xml code and save this file.

<configuration> <configuration>
<property> <property>
<name>dfs.replication</name> <name>yarn.nodemanager.aux-services</name>
<value>1</value> <value>mapreduce_shuffle</value>
</property> </property>
<property> <property>
<name>dfs.namenode.name.dir</name> <name>yarn.nodemanager.auxservices.mapreduc
<value>/hadoop-3.3.0/data/namenode</value> e.shuffle.class</name>
</property> <value>org.apache.hadoop.mapred.ShuffleHandl
<property> er</value>
<name>dfs.datanode.data.dir</name> </property>
<value>/hadoop-3.3.0/data/datanode</value> </configuration>
</property>
</configuration>
Edit file C:/Hadoop-3.3.0/etc/hadoop/hadoop-env.cmd
by closing the command line
“JAVA_HOME=%JAVA_HOME%” instead of set
“JAVA_HOME=C:\Java”
• Download
https://github.com/brainmentorspvtltd/BigData_RDE/blob/master/Hadoop%20Configuration.zi
p
• or (for hadoop 3)
https://github.com/s911415/apache-hadoop-3.1.0-winutils
• Copy folder bin and replace existing bin folder in C:\Hadoop-3.3.0\bin
• Format the NameNode

6. Hadoop • Open cmd and type command “hdfs namenode –format”

Configurations
7. Testing
• Open cmd and change directory to C:\Hadoop-3.3.0\sbin
• type start-all.cmd

(Or you can start like this)

– Start namenode and datanode with this command
– type start-dfs.cmd
– Start yarn through this command
– type start-yarn.cmd
Make sure these apps are running
– Hadoop Namenode
– Hadoop datanode
– YARN Resource Manager
– YARN Node Manager
Open: http://localhost:8088
Open: http://localhost:9870
MapReduce
MapReduce in Nutshell
What is MapReduce?
MapReduce Execution Pipeline

72
Input Files
The data for a MapReduce task is stored in input files, and input files typically lives
in HDFS. The format of these files is arbitrary, while line-based log files and binary
format can also be used.

InputFormat
Now, InputFormat defines how these input files are split and read. It selects the files
or other objects that are used for input. InputFormat creates InputSplit.
InputSplits
It is created by InputFormat, logically represent the data which will be processed by
an individual Mapper. One map task is created for each split; thus the number of
map tasks will be equal to the number of InputSplits. The split is divided into
records and each record will be processed by the mapper.
RecordReader
It communicates with the InputSplit in Hadoop MapReduce and converts the data into
key-value pairs suitable for reading by the mapper. By default, it uses
TextInputFormat for converting data into a key-value pair. RecordReader
communicates with the InputSplit until the file reading is not completed. It assigns
byte offset (unique number) to each line present in the file. Further, these key-value
pairs are sent to the mapper for further processing.

73
Mapper
It processes each input record (from RecordReader) and generates new key-value
pair, and this key-value pair generated by Mapper is completely different from the
input pair. The output of Mapper is also known as intermediate output which is
written to the local disk. The output of the Mapper is not stored on HDFS as this is
temporary data and writing on HDFS will create unnecessary copies (also HDFS is a
high latency system). Mappers output is passed to the combiner for further process
Combiner
The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce Combiner
performs local aggregation on the mappers’ output, which helps to minimize the data
transfer between mapper and reducer (we will see reducer below). Once the combiner
functionality is executed, the output is then passed to the partitioner for further work.

Partitioner
Hadoop MapReduce, Partitioner comes into the picture if we are working on more
than one reducer (for one reducer partitioner is not used).
Partitioner takes the output from combiners and performs partitioning. Partitioning
of output takes place on the basis of the key and then sorted. By hash function, key
(or a subset of the key) is used to derive the partition.
According to the key value in MapReduce, each combiner output is partitioned, and
a record having the same key value goes into the same partition, and then each
partition is sent to a reducer. Partitioning allows even distribution of the map
74
output over the reducer.
Shuffling and Sorting
Now, the output is Shuffled to the reduce node (which is a normal slave node but
reduce phase will run here hence called as reducer node). The shuffling is the
physical movement of the data which is done over the network. Once all the
mappers are finished and their output is shuffled on the reducer nodes, then this
intermediate output is merged and sorted, which is then provided as input to reduce
phase.

Reducer
It takes the set of intermediate key-value pairs produced by the mappers as the
input and then runs a reducer function on each of them to generate the output. The
output of the reducer is the final output, which is stored in HDFS.

RecordWriter
It writes these output key-value pair from the Reducer phase to the output files.
OutputFormat
The way these output key-value pairs are written in output files by RecordWriter is
determined by the OutputFormat. OutputFormat instances provided by the
Hadoop are used to write files in HDFS or on the local disk. Thus, the final output
of reducer is written on HDFS by OutputFormat instances.

75
FileInputFormat

TextInputFormat
Types of KeyValueTextInputFormat
InputFormat SequenceFileInputFormat
in SequenceFileAsTextInputFormat
MapReduce SequenceFileAsBinaryInputFormat

NLineInputFormat

DBInputFormat
FileInputFormat
• It is the base class for all file-based InputFormats.
• Hadoop FileInputFormat specifies input directory where data
files are located.
• When we start a Hadoop job, FileInputFormat is provided with
a path containing files to read.
• FileInputFormat will read all files and divides these files into
one or more InputSplits.
TextInputFormat
example
• TextInputFormat is the default input_data
InputFormat. A king should hunt regularly
A queen should shop daily,
• Each record is a line of input. Other people should just try.

• The key, a LongWritable, is the byte The records are interpreted as the following
key-value pairs.using TextInputFormat
offset of the beginning of the line
within the file. Key value
0 A king should hunt regularly
• The value is the contents of the 29 A queen should shop daily,
line, excluding any line terminators. 55 Other people should just try.
KeyValueTextInputFormat
• It is similar to TextInputFormat
as it also treats each line of input
as a separate record.
• While TextInputFormat treats
entire line as the value, but the
KeyValueTextInputFormat
breaks the line itself into key and
value by a tab character (‘/t’).
• Here Key is everything up to the
tab character while the value is
the remaining part of the line
after tab character.
SequenceFileInputFormat

• Hadoop SequenceFileInputFormat is an InputFormat

which reads sequence files.
• Sequence files are binary files that stores sequences of
binary key-value pairs.
• Sequence files block-compress and provide direct serialization
and deserialization of several arbitrary data types (not just text).
• Here Key & Value both are user-defined.
SequenceFileAsTextInputFormat

• Hadoop SequenceFileAsTextInputFormat is another form

of SequenceFileInputFormat which converts the sequence file
key values to Text objects.
• By calling ‘tostring()’ conversion is performed on the keys and
values.
• This InputFormat makes sequence files suitable input for
streaming.
SequenceFileAsBinaryInputFormat

• Hadoop SequenceFileAsBinaryInputFormat is a
SequenceFileInputFormat using which we can extract the
sequence file’s keys and values as an opaque binary object.
NLineInputFormat

• Hadoop NLineInputFormat is another form of

TextInputFormat where the keys are byte offset of the line and
values are contents of the line.
• If we want our mapper to receive a fixed number of lines of
input, then we use NLineInputFormat.
• N is the number of lines of input that each mapper receives.
• By default (N=1), each mapper receives exactly one line of input.
• If N=2, then each split contains two lines.
• One mapper will receive the first two Key-Value pairs and
another mapper will receive the second two key-value pairs.
DBInputFormat

• Hadoop DBInputFormat is an InputFormat that reads data

from a relational database, using JDBC.
• It is best for loading relatively small datasets.
• Here Key is LongWritables while Value is DBWritables.
TextOutputFormat

SequenceFileOutputFormat

SequenceFileAsBinaryOutputFormat
OutputFormat MapFileOutputFormat
in MapReduce
MultipleOutputs

LazyOutputFormat

DBOutputFormat
TextOutputFormat

• The default OutputFormat is TextOutputFormat.

• It writes (key, value) pairs on individual lines of text files.
• Its keys and values can be of any type.
• The reason behind is that TextOutputFormat turns them to
string by calling toString() on them.
• It separates key-value pair by a tab character.
SequenceFileOutputFormat

• This OutputFormat writes sequences files for its output.

• SequenceFileOutputFormat is also intermediate format used
between MapReduce jobs.
• It serializes arbitrary data types to the file.
• It presents the data to the next mapper in the same manner as
it was emitted by the previous reducer.
SequenceFileAsBinaryOutputFormat

• It is another variant of SequenceFileInputFormat.

• It also writes keys and values to sequence file in binary format.
MapFileOutputFormat

• It is another form of FileOutputFormat.

• It also writes output as map files.
• The framework adds a key in a MapFile in order.
• So, we need to ensure that reducer emits keys in sorted order.
MultipleOutputs
• This format allows writing data to files whose names are derived
from the output keys and values.
LazyOutputFormat

• In MapReduce job execution, FileOutputFormat sometimes

create output files, even if they are empty.
• LazyOutputFormat is also a wrapper OutputFormat.
DBOutputFormat

• It is the OutputFormat for writing to relational databases

and HBase.
• This format also sends the reduce output to a SQL table.
• It also accepts key-value pairs.
• In this, the key has a type extending DBwritable.
MapReduce WordCount Example 1
MapReduce WordCount Example 2
MapReduce WordCount Example 3
MapReduce Wordcount Example 4
MapReduce Wordcount Example 5
MapReduce Matrix Multiplication

HADOOP
No ratings yet
HADOOP
19 pages
Module 2 HDFS
No ratings yet
Module 2 HDFS
33 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Introduction To
No ratings yet
Introduction To
7 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Unit 2
No ratings yet
Unit 2
17 pages
Unit 5
No ratings yet
Unit 5
101 pages
Big Data
No ratings yet
Big Data
67 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Module - 2
No ratings yet
Module - 2
84 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Unit 3
No ratings yet
Unit 3
18 pages
Big Data Ecosystem & Hadoop Guide
No ratings yet
Big Data Ecosystem & Hadoop Guide
31 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
Hadoop Ecosystem Components Guide
No ratings yet
Hadoop Ecosystem Components Guide
19 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
38 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Module 2
No ratings yet
Module 2
23 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Hadoop Basics for Data Engineers
No ratings yet
Hadoop Basics for Data Engineers
44 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
13 pages
Academy, Skill Valley - PMI PMP PMBOK 7 Practice Exam Book_ Over 3 Full Practice Tests, Offering 540+ Realistic PMP Questions Aligned With PMBOK Guide, 7th Edition and 2021 ECO With Detailed Explanati
100% (21)
Academy, Skill Valley - PMI PMP PMBOK 7 Practice Exam Book_ Over 3 Full Practice Tests, Offering 540+ Realistic PMP Questions Aligned With PMBOK Guide, 7th Edition and 2021 ECO With Detailed Explanati
460 pages
PMP Exam Prep 2023-2024 Covers The Current PMP Exam Content Agile and Predictive Content 2023
100% (13)
PMP Exam Prep 2023-2024 Covers The Current PMP Exam Content Agile and Predictive Content 2023
391 pages
PMP Memory Sheets
98% (41)
PMP Memory Sheets
6 pages
PMP Study Materials
100% (12)
PMP Study Materials
90 pages
PMP Exam Prep Summary
100% (23)
PMP Exam Prep Summary
5 pages
PMP Exam Prep - 2023 11th Edition (Rita Mulcahy, PMP With Margo Kirwin)
93% (82)
PMP Exam Prep - 2023 11th Edition (Rita Mulcahy, PMP With Margo Kirwin)
456 pages
Read and Pass Notes For PMP Exams (Based On PMBOK Guide 6th Edition) by Maneesh Vijaya (Marek) PDF
90% (29)
Read and Pass Notes For PMP Exams (Based On PMBOK Guide 6th Edition) by Maneesh Vijaya (Marek) PDF
774 pages
AWS Certified Solutions Architect Associate (Jon Bonso and Adrian Formaran)
86% (7)
AWS Certified Solutions Architect Associate (Jon Bonso and Adrian Formaran)
236 pages
AWS Certified Solutions Architect Professional Slides v6
100% (9)
AWS Certified Solutions Architect Professional Slides v6
823 pages
Kubernetes Basic To Advance End To End
100% (7)
Kubernetes Basic To Advance End To End
295 pages
Essential PMP Preparation A Practical Exam Prep With Simplified Explanations Definitions and Examp 2022
92% (12)
Essential PMP Preparation A Practical Exam Prep With Simplified Explanations Definitions and Examp 2022
336 pages
AWS Cloud Practitioner Full Course
86% (14)
AWS Cloud Practitioner Full Course
246 pages
PMP Exam Prep Skyrocket Your Career by Becoming A Certified Project
75% (12)
PMP Exam Prep Skyrocket Your Career by Becoming A Certified Project
110 pages
AWS Certified Solution Architect Associate Study Guide V1.0 Abdul Jaseem VP Release 30 Aug 2020
100% (8)
AWS Certified Solution Architect Associate Study Guide V1.0 Abdul Jaseem VP Release 30 Aug 2020
235 pages
AWS - Interview Guide
100% (1)
AWS - Interview Guide
229 pages
Learn Kubernetes 5 Minutes at A Time
No ratings yet
Learn Kubernetes 5 Minutes at A Time
187 pages
Aws Solution Architect Associate Guide
100% (2)
Aws Solution Architect Associate Guide
43 pages
AWS Course - All Slides
80% (10)
AWS Course - All Slides
879 pages
Perfect PMP® Exam Flashcards BONUS Web
100% (10)
Perfect PMP® Exam Flashcards BONUS Web
571 pages
PMBOK Guide (7th Edition) - June. 2022
90% (29)
PMBOK Guide (7th Edition) - June. 2022
90 pages
Scrum Cheat Sheet
100% (105)
Scrum Cheat Sheet
1 page
2021 PMP Mock Practice Tests (Gumroad)
95% (21)
2021 PMP Mock Practice Tests (Gumroad)
284 pages
Azure Devops Complete Ci CD Pipeline PDF
100% (9)
Azure Devops Complete Ci CD Pipeline PDF
109 pages
PMP Authorized Exam Prep
91% (11)
PMP Authorized Exam Prep
310 pages
AWS Solutions Architect Study Guide
73% (11)
AWS Solutions Architect Study Guide
288 pages
PMP Overview For Exam Preparation 1703965962
100% (1)
PMP Overview For Exam Preparation 1703965962
291 pages
PMP Exam Simplified
100% (11)
PMP Exam Simplified
639 pages
Docker Docker Tutorial For Beginners Build Ship and Run - Dennis Hutten
100% (11)
Docker Docker Tutorial For Beginners Build Ship and Run - Dennis Hutten
187 pages
AWS Services Overview
75% (4)
AWS Services Overview
5 pages
Best PMP Exam Prep Guide 2023 - 2024 Get PMP Certified in 2 Weeks - Study 2 Hours A Day Before-After Work 2023
100% (8)
Best PMP Exam Prep Guide 2023 - 2024 Get PMP Certified in 2 Weeks - Study 2 Hours A Day Before-After Work 2023
274 pages
Hussain Mohammad CV Resume
No ratings yet
Hussain Mohammad CV Resume
2 pages
CS QP Set-4
No ratings yet
CS QP Set-4
8 pages
SQL L
No ratings yet
SQL L
62 pages
II Year Rdbms - It
No ratings yet
II Year Rdbms - It
231 pages
DDBMS True False
No ratings yet
DDBMS True False
7 pages
B.Tech CSE 8th Sem
No ratings yet
B.Tech CSE 8th Sem
10 pages
Sas Developer Resume
No ratings yet
Sas Developer Resume
2 pages
PHP Z
No ratings yet
PHP Z
34 pages
DDL & DML Commands in SQL
No ratings yet
DDL & DML Commands in SQL
6 pages
(Project Name) : Inception: High-Level Technical Design Template
No ratings yet
(Project Name) : Inception: High-Level Technical Design Template
4 pages
APEX ITEM and Dynamic Tabular Forms
No ratings yet
APEX ITEM and Dynamic Tabular Forms
51 pages
1 Terminology 1
No ratings yet
1 Terminology 1
36 pages
Data vs Information: Key Differences
100% (1)
Data vs Information: Key Differences
27 pages
Enhanced Student Management System JDBC API Documentation
No ratings yet
Enhanced Student Management System JDBC API Documentation
19 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
Currently Running 1
No ratings yet
Currently Running 1
7 pages
Oracle 1z0-083 v2021-08-06 q56
No ratings yet
Oracle 1z0-083 v2021-08-06 q56
22 pages
MongoDB Indexing & Sharding Guide
No ratings yet
MongoDB Indexing & Sharding Guide
32 pages
Gephi Tutorial Toolkit
No ratings yet
Gephi Tutorial Toolkit
23 pages
ADBMS Data Encryption Guide
100% (1)
ADBMS Data Encryption Guide
41 pages
DBMS Unit 1
No ratings yet
DBMS Unit 1
11 pages
JSP Session Hibernate
No ratings yet
JSP Session Hibernate
22 pages
Database Design Beginners
No ratings yet
Database Design Beginners
4 pages
Assignment 6
No ratings yet
Assignment 6
5 pages
Issues ORA 0054 Resource Busy
No ratings yet
Issues ORA 0054 Resource Busy
5 pages
Web App Scalability Guide
No ratings yet
Web App Scalability Guide
46 pages
The RODBC Package: R Topics Documented
No ratings yet
The RODBC Package: R Topics Documented
22 pages
Database Concepts Exam Prep
No ratings yet
Database Concepts Exam Prep
25 pages
Malik Muhammad Ibrahim F22-1179
No ratings yet
Malik Muhammad Ibrahim F22-1179
7 pages
Script DB
No ratings yet
Script DB
11 pages

Chapter2 Bdi

Uploaded by

Chapter2 Bdi

Uploaded by

Unit 2

• Prerequisites to install Hadoop on windows

Edit file C:/Hadoop-3.3.0/etc/hadoop/core-site.xml,

Create folder “data” under “C:\Hadoop-3.3.0”

6. Hadoop • Open cmd and type command “hdfs namenode –format”

(Or you can start like this)

• Hadoop SequenceFileInputFormat is an InputFormat

• Hadoop SequenceFileAsTextInputFormat is another form

• Hadoop NLineInputFormat is another form of

• Hadoop DBInputFormat is an InputFormat that reads data

• The default OutputFormat is TextOutputFormat.

• This OutputFormat writes sequences files for its output.

• It is another variant of SequenceFileInputFormat.

• It is another form of FileOutputFormat.

• In MapReduce job execution, FileOutputFormat sometimes

• It is the OutputFormat for writing to relational databases

You might also like