0% found this document useful (0 votes)

14 views8 pages

Unit - 2 (A)

Uploaded by

Sathish Koppoju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views8 pages

Unit - 2 (A)

Uploaded by

Sathish Koppoju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

UNIT – 2

What is Hadoop? Explain features of Hadoop?

• Hadoop is an open source framework that is meant for storage and processing
of big data in a distributed manner.
• It is the best solution for handling big data challenges.

Some important features of Hadoop are –

1. Open Source – Hadoop is an open source framework which means it is
available free of cost. Also, the users are allowed to change the source
code as per their requirements.
2. Distributed Processing – Hadoop supports distributed processing of data
i.e. faster processing. The data in Hadoop HDFS is stored in a distributed
manner and MapReduce is responsible for the parallel processing of data.
3. Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas
for each block (default) at different nodes.
4. Reliability – Hadoop stores data on the cluster in a reliable manner that is
independent of machine. So, the data stored in Hadoop environment is
not affected by the failure of the machine.
5. Scalability – It is compatible with the other hardware and we can easily
add/remove the new hardware to the nodes.
6. High Availability – The data stored in Hadoop is available to access even
after the hardware failure. In case of hardware failure, the data can be
accessed from another node.
Moving Data in and out of Hadoop:

Moving data in and out of Hadoop involves various methods and tools depending on
the specific requirements and data sources. Here are some common approaches for
data movement in Hadoop:

1. Hadoop File System Commands (HDFS commands):

Hadoop provides a set of command-line utilities to interact with the Hadoop

Distributed File System (HDFS). You can use commands like hdfs dfs -put to upload
data from the local file system to HDFS, and hdfs dfs -get to download data from
HDFS to the local file system.

2. Hadoop Streaming:

Hadoop Streaming is a utility that allows you to write MapReduce programs in

languages other than Java, such as Python or Ruby. You can use this approach to
process data in Hadoop while reading input from standard input and writing output
to standard output.

3. Apache Sqoop:

Apache Sqoop is a tool designed for efficiently transferring data between Apache
Hadoop and structured data sources such as relational databases. Sqoop can import
data from a database into Hadoop by executing parallel database queries and
transferring the results to HDFS. It can also export data from Hadoop back to the
database.

4. Apache Flume:

Apache Flume is a distributed, reliable, and scalable system for efficiently collecting,
aggregating, and moving large amounts of streaming data into Hadoop. Flume
provides a flexible architecture to ingest data from various sources like web servers,
social media feeds, log files, etc., and transport it to Hadoop for processing.
5. Apache Kafka:

Apache Kafka is a distributed streaming platform that can be used for building real-
time data pipelines and streaming applications. Kafka allows you to publish and
subscribe to streams of records, and it can integrate with Hadoop to move data from
Kafka topics to Hadoop clusters for further processing.

6. Hadoop Connectors and Integration Tools:

Various connectors and integration tools are available to facilitate data movement
between Hadoop and other systems. For example, Apache NiFi, Apache Falcon, and
Apache Nutch provide data ingestion, integration, and data pipeline management
capabilities, enabling seamless data flow in and out of Hadoop.

7. Custom Code and APIs:

You can also develop custom code using Hadoop APIs (e.g., Hadoop Java APIs or
Hadoop Streaming APIs) to read data from external sources, perform data
transformations, and write the processed data into Hadoop or export it to other
systems.

UNDERSTANDING INPUTS AND OUTPUTS

IN MAPREDUCE
Your data might be XML files sitting behind a number of FTP servers, text log files sitting on a central

web server, or Lucene indexes1 in HDFS. How does MapReduce support reading and writing to these

different serialization structures across the various storage mechanisms? You’ll need to know the

answer in order to support a specific serialization format.

Data input :-

The two classes that support data input in MapReduce are InputFormat and Record-Reader. The

InputFormat class is consulted to determine how the input data should be partitioned for the map

tasks, and the RecordReader performs the reading of data from the inputs.

INPUTFORMAT :-

Every job in MapReduce must define its inputs according to contracts specified in the InputFormat

abstract class. InputFormat implementers must fulfill three contracts: first, they describe type

information for map input keys and values; next, they specify how the input data should be
partitioned; and finally, they indicate the RecordReader

instance that should read the data from source.

for that input split.

Data output :-

MapReduce uses a similar process for supporting output data as it does for input data.Two classes

must exist, an OutputFormat and a RecordWriter. The OutputFormat performs some basic validation

of the data sink properties, and the RecordWriter writes each reducer output to the data sink.

OUTPUTFORMAT:-

Much like the InputFormat class, the OutputFormat class, as shown in figure 3.5, defines the

contracts that implementers must fulfill, including checking the information related to the job

output, providing a RecordWriter, and specifying an output committer, which allows writes to be

staged and then made “permanent” upon task and/or job success.

What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and
analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP
(online analytical processing). It is used for batch/offline processing.It is being used by
Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled
up just by adding nodes in the cluster.

Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on
the basis of that HDFS was developed. It states that the files will be broken into
blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage
the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output
of Map task is consumed by reduce task and then the out of reducer gives the
desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used
by other Hadoop modules.

Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave
node includes DataNode and TaskTracker.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines.
The Java language is used to develop HDFS. So any machine that supports Java
language can easily run the NameNode and DataNode software.

NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.

DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process
can also be called as a Mapper.

MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a case,
that part of the job is rescheduled.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers, thus
reducing the processing time. It is able to process terabytes of data in minutes and
Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so
it really cost effective as compared to traditional relational database management
system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.

Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
21 pages
Chap 2 Hadoop
No ratings yet
Chap 2 Hadoop
24 pages
Hadoop Notes 2
No ratings yet
Hadoop Notes 2
5 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Hadoop
No ratings yet
Hadoop
5 pages
Unit 2
No ratings yet
Unit 2
9 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit III
No ratings yet
Unit III
32 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Hadoop
No ratings yet
Hadoop
7 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
No ratings yet
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
8 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Big Data Analytics Assignment
No ratings yet
Big Data Analytics Assignment
7 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
CC 2
No ratings yet
CC 2
25 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Unit - II
No ratings yet
Unit - II
64 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
IDS Unit3
100% (1)
IDS Unit3
16 pages
Hadoop for Big Data Analysis
No ratings yet
Hadoop for Big Data Analysis
4 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
FALLSEM2025-26 VL ISWE309L 00100 TH 2025-08-08 Module-4
No ratings yet
FALLSEM2025-26 VL ISWE309L 00100 TH 2025-08-08 Module-4
40 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Spark 4-2 Documentation
No ratings yet
Spark 4-2 Documentation
60 pages
Unit 2
No ratings yet
Unit 2
17 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Chapter 6 1712934164767
No ratings yet
Chapter 6 1712934164767
19 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Internet of Things
No ratings yet
Internet of Things
1 page
Snapdragon Processors
No ratings yet
Snapdragon Processors
1 page
Servlet Complete
No ratings yet
Servlet Complete
15 pages
WT Programs For Labsession
No ratings yet
WT Programs For Labsession
18 pages
? Spring MVC
No ratings yet
? Spring MVC
8 pages
Hari Mini
No ratings yet
Hari Mini
40 pages
XOR A String With A Zero: Hello World'. The
No ratings yet
XOR A String With A Zero: Hello World'. The
30 pages
Major Documentation
No ratings yet
Major Documentation
77 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
ML Unit 5
No ratings yet
ML Unit 5
13 pages
ML 5
No ratings yet
ML 5
20 pages
Unit - 1
No ratings yet
Unit - 1
15 pages
CNS CW .
No ratings yet
CNS CW .
151 pages
DPPM Unit - I
No ratings yet
DPPM Unit - I
16 pages
CNS Laqs
No ratings yet
CNS Laqs
45 pages
DAA
No ratings yet
DAA
15 pages
CNS Vsaqs
No ratings yet
CNS Vsaqs
25 pages
DAA
No ratings yet
DAA
10 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
118 pages
Design and Analysis of Algorithms (Complete)
No ratings yet
Design and Analysis of Algorithms (Complete)
116 pages
Travelling Sales Person Problem
No ratings yet
Travelling Sales Person Problem
8 pages
NG Core Architecture
No ratings yet
NG Core Architecture
10 pages
Rifyva PDF
No ratings yet
Rifyva PDF
3 pages
Distributed System Notes
No ratings yet
Distributed System Notes
27 pages
Lab Manual CN TECompsB
100% (1)
Lab Manual CN TECompsB
77 pages
Chapter 6 Exam - Dpr7501-001v CCNPT 2019-1
No ratings yet
Chapter 6 Exam - Dpr7501-001v CCNPT 2019-1
22 pages
Anti Ban Host
100% (1)
Anti Ban Host
271 pages
100 F5 Interview Questions
No ratings yet
100 F5 Interview Questions
9 pages
Python Networking PDF
No ratings yet
Python Networking PDF
4 pages
Lecture 9 (DNS) PDF
No ratings yet
Lecture 9 (DNS) PDF
17 pages
Configuring Static NAT in Cisco Router
No ratings yet
Configuring Static NAT in Cisco Router
10 pages
AWS Mock Test - 3
No ratings yet
AWS Mock Test - 3
63 pages
AWS DevOps Services Overview
No ratings yet
AWS DevOps Services Overview
6 pages
ISO 27001 Router Security Audit Checklist: Yes No Router Policy
No ratings yet
ISO 27001 Router Security Audit Checklist: Yes No Router Policy
14 pages
Network Requirements For BlackBerry 10 and 3G+ 4G and 4G LTE BlackBerry ...
No ratings yet
Network Requirements For BlackBerry 10 and 3G+ 4G and 4G LTE BlackBerry ...
19 pages
Lightweight Access Point Protocol
No ratings yet
Lightweight Access Point Protocol
3 pages
Plan Aws Solutions Architecht
No ratings yet
Plan Aws Solutions Architecht
5 pages
CDP and LLDP
No ratings yet
CDP and LLDP
10 pages
ZXR10 M6000 (V1.00.20) Carrier-Class Router Command Reference (QoS Volume)
No ratings yet
ZXR10 M6000 (V1.00.20) Carrier-Class Router Command Reference (QoS Volume)
48 pages
Java Microservice - Scenario Based Interview
No ratings yet
Java Microservice - Scenario Based Interview
21 pages
Ethernet Oam Tutorial Final v2 1362014627
No ratings yet
Ethernet Oam Tutorial Final v2 1362014627
79 pages
Network Essentials for Engineers
No ratings yet
Network Essentials for Engineers
64 pages
CCDEv3 Learning Matrix+
No ratings yet
CCDEv3 Learning Matrix+
93 pages
Gpon Ont - Evo-Hg326: Terminale CPE
No ratings yet
Gpon Ont - Evo-Hg326: Terminale CPE
2 pages
Aruba IAP VPN Solution Guide For Teleworkers and Home Offices
No ratings yet
Aruba IAP VPN Solution Guide For Teleworkers and Home Offices
33 pages
DIKTAT Materi Pelajaran Matematika Kelas 1 SD - PDF
No ratings yet
DIKTAT Materi Pelajaran Matematika Kelas 1 SD - PDF
1,352 pages
13.2.6 Packet Tracer - Verify Ipv4 and Ipv6 Addressing
No ratings yet
13.2.6 Packet Tracer - Verify Ipv4 and Ipv6 Addressing
3 pages
CCN Faraz
No ratings yet
CCN Faraz
5 pages
MikroTik - Dual Wan Failover
No ratings yet
MikroTik - Dual Wan Failover
8 pages
AWS Interview Question
No ratings yet
AWS Interview Question
210 pages
Profile Pelanggan 628111920084
No ratings yet
Profile Pelanggan 628111920084
7 pages

Unit - 2 (A)

Uploaded by

Unit - 2 (A)

Uploaded by

UNIT – 2

What is Hadoop? Explain features of Hadoop?

Some important features of Hadoop are –

1. Hadoop File System Commands (HDFS commands):

Hadoop provides a set of command-line utilities to interact with the Hadoop

Hadoop Streaming is a utility that allows you to write MapReduce programs in

6. Hadoop Connectors and Integration Tools:

7. Custom Code and APIs:

UNDERSTANDING INPUTS AND OUTPUTS

answer in order to support a specific serialization format.

instance that should read the data from source.

for that input split.

Hadoop Distributed File System

You might also like