0% found this document useful (0 votes)
14 views8 pages

Unit - 2 (A)

Uploaded by

Sathish Koppoju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

Unit - 2 (A)

Uploaded by

Sathish Koppoju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT – 2

What is Hadoop? Explain features of Hadoop?

• Hadoop is an open source framework that is meant for storage and processing
of big data in a distributed manner.
• It is the best solution for handling big data challenges.

Some important features of Hadoop are –


1. Open Source – Hadoop is an open source framework which means it is
available free of cost. Also, the users are allowed to change the source
code as per their requirements.
2. Distributed Processing – Hadoop supports distributed processing of data
i.e. faster processing. The data in Hadoop HDFS is stored in a distributed
manner and MapReduce is responsible for the parallel processing of data.
3. Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas
for each block (default) at different nodes.
4. Reliability – Hadoop stores data on the cluster in a reliable manner that is
independent of machine. So, the data stored in Hadoop environment is
not affected by the failure of the machine.
5. Scalability – It is compatible with the other hardware and we can easily
add/remove the new hardware to the nodes.
6. High Availability – The data stored in Hadoop is available to access even
after the hardware failure. In case of hardware failure, the data can be
accessed from another node.
Moving Data in and out of Hadoop:

Moving data in and out of Hadoop involves various methods and tools depending on
the specific requirements and data sources. Here are some common approaches for
data movement in Hadoop:

1. Hadoop File System Commands (HDFS commands):

Hadoop provides a set of command-line utilities to interact with the Hadoop


Distributed File System (HDFS). You can use commands like hdfs dfs -put to upload
data from the local file system to HDFS, and hdfs dfs -get to download data from
HDFS to the local file system.

2. Hadoop Streaming:

Hadoop Streaming is a utility that allows you to write MapReduce programs in


languages other than Java, such as Python or Ruby. You can use this approach to
process data in Hadoop while reading input from standard input and writing output
to standard output.

3. Apache Sqoop:

Apache Sqoop is a tool designed for efficiently transferring data between Apache
Hadoop and structured data sources such as relational databases. Sqoop can import
data from a database into Hadoop by executing parallel database queries and
transferring the results to HDFS. It can also export data from Hadoop back to the
database.

4. Apache Flume:

Apache Flume is a distributed, reliable, and scalable system for efficiently collecting,
aggregating, and moving large amounts of streaming data into Hadoop. Flume
provides a flexible architecture to ingest data from various sources like web servers,
social media feeds, log files, etc., and transport it to Hadoop for processing.
5. Apache Kafka:

Apache Kafka is a distributed streaming platform that can be used for building real-
time data pipelines and streaming applications. Kafka allows you to publish and
subscribe to streams of records, and it can integrate with Hadoop to move data from
Kafka topics to Hadoop clusters for further processing.

6. Hadoop Connectors and Integration Tools:

Various connectors and integration tools are available to facilitate data movement
between Hadoop and other systems. For example, Apache NiFi, Apache Falcon, and
Apache Nutch provide data ingestion, integration, and data pipeline management
capabilities, enabling seamless data flow in and out of Hadoop.

7. Custom Code and APIs:

You can also develop custom code using Hadoop APIs (e.g., Hadoop Java APIs or
Hadoop Streaming APIs) to read data from external sources, perform data
transformations, and write the processed data into Hadoop or export it to other
systems.

UNDERSTANDING INPUTS AND OUTPUTS


IN MAPREDUCE
Your data might be XML files sitting behind a number of FTP servers, text log files sitting on a central

web server, or Lucene indexes1 in HDFS. How does MapReduce support reading and writing to these

different serialization structures across the various storage mechanisms? You’ll need to know the

answer in order to support a specific serialization format.


Data input :-

The two classes that support data input in MapReduce are InputFormat and Record-Reader. The

InputFormat class is consulted to determine how the input data should be partitioned for the map

tasks, and the RecordReader performs the reading of data from the inputs.

INPUTFORMAT :-

Every job in MapReduce must define its inputs according to contracts specified in the InputFormat

abstract class. InputFormat implementers must fulfill three contracts: first, they describe type

information for map input keys and values; next, they specify how the input data should be
partitioned; and finally, they indicate the RecordReader

instance that should read the data from source.

Advertisements

REPORT THIS AD

RECORDREADER :-

The RecordReader class is used by MapReduce in the map tasks to read data from an input split and

provide each record in the form of a key/value pair for use by mappers. A task is commonly created

for each input split, and each task has a single RecordReader that’s responsible for reading the data

for that input split.

Data output :-

MapReduce uses a similar process for supporting output data as it does for input data.Two classes

must exist, an OutputFormat and a RecordWriter. The OutputFormat performs some basic validation

of the data sink properties, and the RecordWriter writes each reducer output to the data sink.

OUTPUTFORMAT:-

Much like the InputFormat class, the OutputFormat class, as shown in figure 3.5, defines the

contracts that implementers must fulfill, including checking the information related to the job

output, providing a RecordWriter, and specifying an output committer, which allows writes to be

staged and then made “permanent” upon task and/or job success.

What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and
analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP
(online analytical processing). It is used for batch/offline processing.It is being used by
Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled
up just by adding nodes in the cluster.

Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on
the basis of that HDFS was developed. It states that the files will be broken into
blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage
the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output
of Map task is consumed by reduce task and then the out of reducer gives the
desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used
by other Hadoop modules.

Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave
node includes DataNode and TaskTracker.

Hadoop Distributed File System


The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines.
The Java language is used to develop HDFS. So any machine that supports Java
language can easily run the NameNode and DataNode software.

NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.

DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process
can also be called as a Mapper.

MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a case,
that part of the job is rescheduled.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers, thus
reducing the processing time. It is able to process terabytes of data in minutes and
Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so
it really cost effective as compared to traditional relational database management
system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.

You might also like