0% found this document useful (0 votes)
6 views37 pages

Bda Unit-2

Hadoop is an open-source framework designed for distributed processing of large datasets across clusters of computers, utilizing a two-layer architecture consisting of MapReduce for computation and HDFS for storage. Key features include scalability, fault tolerance, and the ability to handle various data types, making it a cost-effective solution for big data analytics. The Hadoop ecosystem comprises several components, including YARN for resource management and various tools for data processing and analysis.

Uploaded by

niveditha16537
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views37 pages

Bda Unit-2

Hadoop is an open-source framework designed for distributed processing of large datasets across clusters of computers, utilizing a two-layer architecture consisting of MapReduce for computation and HDFS for storage. Key features include scalability, fault tolerance, and the ability to handle various data types, making it a cost-effective solution for big data analytics. The Hadoop ecosystem comprises several components, including YARN for resource management and various tools for data processing and analysis.

Uploaded by

niveditha16537
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT-2

Hadoop: Features of Hadoop, Key advantages of hadoop, versions of hadoop, overview of


hadoop ecosystem, Hadoop distributions. Need of hadoop, RDBMS vs Hadoop, Distribution
computing challenges, History of hadoop, Hadoop overview,HDFS

Hadoop

Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.

Hadoop Architecture

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and


 Storage layer (Hadoop Distributed File System).
MapReduce

MapReduce is a parallel programming model for writing distributed applications devised at


Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be
deployed on low-cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.

Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −

 Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and cluster resource management.

How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with
single-CPU, as a single functional distributed system and practically, the clustered machines
can read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper
than one high-end server. So this is the first motivational factor behind using Hadoop that it
runs across clustered and low-cost machines.

Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs −

 Data is initially divided into directories and files. Files are divided into uniform sized blocks
of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.

Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible on all
the platforms since it is Java based.

Apache Hadoop is an open source framework that is used to efficiently store and

process large datasets ranging in size from gigabytes to petabytes of data.

 Instead of using one large computer to store and process the data, Hadoop allows

clustering multiple computers to analyze massive datasets in parallel more quickly.

Hadoop consists of four main modules:

 Hadoop Distributed File System (HDFS) — A distributed file system that runs on

standard or low-end hardware. HDFS provides better data throughput than traditional file

systems, in addition to high fault tolerance and native support of large datasets.
 Yet Another Resource Negotiator (YARN) — Manages and monitors cluster nodes and

resource usage. It schedules jobs and tasks.

 MapReduce — A framework that helps programs do the parallel computation on data.

The map task takes input data and converts it into a dataset that can be computed in key

value pairs. The output of the map task is consumed by reduce tasks to aggregate output

and provide the desired result.

 Hadoop Common — Provides common Java libraries that can be used across all

modules.

How Hadoop Works?

Hadoop makes it easier to use all the storage and processing capacity in cluster servers, and to

execute distributed processes against huge amounts of data. Hadoop provides the building

blocks on which other services and applications can be built.

Applications that collect data in various formats can place data into the Hadoop cluster by

using an API operation to connect to the NameNode. The NameNode tracks the file directory

structure and placement of “chunks” for each file, replicated across DataNodes.

To run a job to query the data, provide a MapReduce job made up of many map and reduce

tasks that run against the data in HDFS spread across the DataNodes. Map tasks run on each

node against the input files supplied, and reducers run to aggregate and organize the final

output.
Why Hadoop used for Big Data Analytics?

Hadoop is the best solution for storing and processing Big Data because,

Hadoop stores huge files as they are (raw) without specifying any schema.

 High scalability- We can add any number of nodes, hence enhancing performance

dramatically.

 High availability- In Hadoop data is highly available despite hardware failure. If a

machine or few hardware crashes, then we can data from another path.

 Reliable- Data is reliable stored on the cluster despite of machine failure.

 Economic- Hadoop runs on a cluster of commodity hardware which is not very

expensive.

Features of Hadoop

a. Open Source

 One of the features of Hadoop is that it is an open-source project, which means its
source code is available to all for modification, inspection, and analysis.
 Because the code is open-source, firms may alter it to meet their own requirements.
 Because of the code's adaptability, enterprises may customise Hadoop to their own
needs.

b. Highly Scalable Cluster

 Hadoop is extremely scalable and capable of handling massive amounts of data by


distributing it over several computers running in parallel, which is one of the features
of Hadoop.
 Unlike typical relational databases, Hadoop allows businesses to manage enormous
datasets spanning thousands of gigabytes by executing applications over numerous
nodes.

c. Fault Tolerance is Available

 Another one of the features of Hadoop is that it achieves fault tolerance by creating
replicas, which spread data blocks from a file among multiple servers in the HDFS
cluster.
 By default, HDFS duplicates each block three times on separate machines in the
cluster, providing data access even if a node fails or goes down.

d. High Availability is Provided

 By duplicating data across numerous DataNodes, Hadoop's fault tolerance feature


assures excellent data availability even in adverse situations.
 High availability Hadoop clusters have multiple NameNodes, with active and passive
nodes in hot standby setups. Passive nodes ensure file accessibility even if active node
fails.

e. Cost-Effective

 Hadoop offers businesses working with large data a cost-effective storage alternative
to conventional relational database management systems, which may be expensive to
scale.
 Hadoop's scale-out design allows cost-effective storage of all data, reducing raw data
loss and enabling organizations to store and use their entire dataset at a fraction of the
cost.

f. Hadoop Provides Flexibility

 Hadoop is extremely adaptable and can handle a wide range of data types, including
structured, semi-structured, and unstructured data.
 Hadoop can handle and analyse data that is organised and well-defined, somewhat
organised, or even fully unstructured.

g. Easy to Use

 Hadoop reduces the need for clients to perform distributed computing jobs by
handling all of its intricacies, making it user-friendly.
 Users of Hadoop may concentrate on their data and analytics activities rather than the
complexities of distributed computing, making it easier to use and run.

h. Hadoop uses Data Locality

 The data locality feature of Hadoop allows computation to be conducted near the data,
eliminating the need to transport data and lowering network congestion.
 Hadoop enhances system throughput and overall performance by minimising data
transport and putting computing closer to the data nodes.

i. Provides Faster Data Processing

 Hadoop prioritises distributed processing, which results in speedier data processing


capabilities.
 The data is distributedly stored in Hadoop HDFS, and the MapReduce architecture
allows for concurrent processing of the data.

j. Support for Multiple Data Formats


 Hadoop supports numerous data formats, allowing users to work with a wide range of
data.
 Hadoop offers versatility in managing numerous data types, making it flexible for
varied data processing and analysis needs, whether structured, semi-structured, or
unstructured data.

k. High Processing Speed

 One of the other features of Hadoop is that it is well-known for its fast processing
speed, which enables effective handling of massive amounts of data.
 Hadoop may greatly expedite data processing processes by exploiting parallel
processing and distributed computing capabilities, resulting in quicker results and
enhanced overall performance.

l. Machine Learning Capabilities

 Hadoop has machine learning capabilities, allowing users to do advanced analytics


and predictive modeling on massive datasets.
 Users may utilize machine learning algorithms to identify insights and trends, and
create data-driven predictions at scale using frameworks like Apache Spark and other
integrated libraries with Hadoop.

m. Integration with Other Tools

 Hadoop integrates seamlessly with a wide range of tools and technologies, allowing
data ecosystem interoperability.
 Users may easily link Hadoop with a wide range of data processing frameworks,
databases, analytics tools, and visualization platforms, increasing the flexibility and
value of their data infrastructure.

n. Secure

 Hadoop delivers comprehensive security capabilities to protect data and maintain


secure ecosystem operations.
 It contains measures for authentication, authorization, and encryption to protect data
privacy and prevent unauthorized access, thereby making Hadoop a secure platform
for storing and processing sensitive data.

o. Community Support

 One of the main features of Hadoop is that it benefits from a vibrant and active
community of developers, users, and contributors.
 The community support for Hadoop is extensive, offering resources, documentation,
forums, and continuous development, ensuring users have access to assistance,
updates, and a wealth of knowledge for leveraging Hadoop effectively.

Key advantages of hadoop


Hadoop is a cornerstone technology in big data ecosystems, and it offers several key
advantages that make it widely adopted for processing massive datasets.

1. Scalability

 Hadoop is highly scalable — it can handle petabytes of data by simply adding more
nodes (computers) to the cluster.
 Grows linearly without a major overhaul.

2. Cost-Effective

 Built on open-source software and runs on commodity hardware, making it far


cheaper than traditional data warehouses or high-end servers.

3. Fault Tolerance

 Data is replicated across multiple nodes, so if one node fails, the system continues
to operate using copies from other nodes.
 Ensures high data reliability and availability.

4. Flexibility in Data Handling

 Can store and process structured, semi-structured, and unstructured data (e.g.,
logs, images, videos, emails).
 Not limited to a specific schema like traditional RDBMS.

5. High Processing Power

 Uses a distributed computing model (MapReduce or Apache Spark) to process large


data sets in parallel across many machines, significantly increasing speed.

6. Open Source & Extensible

 Being open source means it's free to use and has a huge community.
 Easily integrates with other tools like Hive, Pig, HBase, Spark, and Oozie.

7. Storage and Computation Together

 Unlike traditional systems, Hadoop stores and processes data on the same machines,
reducing data transfer time and improving efficiency.

8. Compatible with Cloud Platforms

 Works well with cloud services (e.g., Amazon EMR, Google Cloud Dataproc),
making deployment and scaling even easier.
9. Real-World Proven

 Used by tech giants like Facebook, Yahoo, Netflix, and LinkedIn to manage and
analyze their huge data volumes.

Versions of hadoop
Hadoop has gone through several key versions over time, each introducing important
improvements and new components.

Hadoop 1.x Series (Classic Hadoop)

Key Features:

 MapReduce was the only processing model.


 Used a single NameNode, which was a single point of failure.
 HDFS (Hadoop Distributed File System) for storage.

Limitations:

 No support for real-time or interactive processing.


 Limited scalability and fault tolerance for the NameNode.

Hadoop 2.x Series (Introduction of YARN)

Major Upgrade – Enter YARN:

 YARN (Yet Another Resource Negotiator) decoupled resource management from


MapReduce.
 Allowed multiple data processing engines like Spark, Tez, and Storm to run on
Hadoop.

Other Improvements:

 High Availability (HA) NameNode setup.


 Support for non-MapReduce applications.
 Better scalability and cluster utilization.

Hadoop 3.x Series (Modern Hadoop)

Key Enhancements:

 Support for GPUs in the cluster.


 Erasure Coding – saves ~50% storage space compared to 3x replication in HDFS.
 YARN Federation – better scalability for very large clusters.
 Docker and container integration.
 Improved NameNode scalability and support for more NameNodes.
Notable Release:

 Hadoop 3.3.x – current stable versions as of recent years.

Core Components Across Versions


Component Function

HDFS Storage system for big data

MapReduce Batch data processing framework

YARN Resource management and job scheduling

Common Shared utilities and libraries

Summary Table: Hadoop Version Evolution


Version Key Feature Highlights

1.x Basic HDFS + MapReduce; no YARN

2.x YARN introduced, better scalability

3.x Erasure Coding, Docker, GPU support, HA upgrades

Each version of Hadoop made the platform more scalable, flexible, and enterprise-ready —
moving from a batch-only framework to a modern ecosystem that supports real-time,
machine learning, and interactive analytics.

Overview of hadoop ecosystem


Apache Hadoop is an open-source framework designed to facilitate interaction with big
data. Big data is a term for data sets that cannot be efficiently processed using a
traditional methodology such as an RDBMS. Hadoop has made its place in industries and
companies that need to work on large data sets that are sensitive and need efficient
processing. HDFS, MapReduce, YARN, and Hadoop Common are the major elements
of Hadoop. Most tools or solutions are used to supplement or support these core
elements. All these tools work together to provide services such as data absorption,
analysis, storage, maintenance, etc.

Components that collectively form a Hadoop Ecosystem

 HDFS: Hadoop Distributed File System

 YARN: Yet Another Resource Negotiator

 MapReduce: Programming-based Data Processing


 Spark: In-Memory data processing

 PIG, HIVE: processing of data services on Query-based

 HBase: NoSQL Database

 Mahout, Spark MLLib: Machine Learning algorithm libraries

 Solar, Lucene: Searching and Indexing

 Zookeeper: Managing cluster

 Oozie: Job Scheduling

HDFS:

 HDFS is the primary or core component of the Hadoop ecosystem. It is

responsible for storing large datasets of structured or unstructured data across

multiple nodes, thereby storing metadata as log files.

 HDFS consists of two basic components viz.

1. Node name

2. A data node

 A Name Node is a primary node that contains metadata (data about data),

requiring comparatively fewer resources than data nodes that store the actual data.
These data nodes are commodity hardware in a distributed environment.

Undoubtedly, what makes Hadoop cost-effective?

 HDFS maintains all coordination between clusters and hardware, so it works at

the system’s heart.

YARN:

 Yet Another Resource Negotiator, as the name suggests, YARN helps manage

resources across clusters. In short, it performs resource planning and allocation

for the Hadoop system.

 It consists of three main components, viz.

1.

1. Resource manager

2. Node manager
3. Application manager

 A resource manager has the privilege of allocating resources for applications in

the system. In contrast, node managers work to allocate resources such as CPU,

memory, and bandwidth per machine and later acknowledge the resource

manager. The application manager acts as an interface between the resource

manager and the node manager and performs negotiations according to the two

requirements.

MapReduce:

 Using distributed and parallel algorithms, MapReduce allows you to offload

processing logic and helps you write applications that transform large datasets

into manageable ones.

 MapReduce uses two functions, i.e., Map() and Reduce(), whose task is to:

1.

1. Map() performs sorting and filtering of the data and thus organizes it in the

form of a group. The map generates a result based on the key-value pair,

which is later processed by the Reduce() method.


2. Reduce(), as the name suggests, performs summarization by aggregating

mapped data. Simply put, Reduce() takes the output generated by Map() as

input and combines those tuples into a smaller set of tuples.


PIG:
Pig was essentially developed by Yahoo, working on Pig Latin, a query-based language
similar to SQL.

 It is a platform for data flow structuring, processing, and analyzing large data

sets.

 Pig does the job of executing the commands, and all the MapReduce activities are

taken care of in the background. After processing, the pig stores the result in

HDFS.

 The Pig Latin language is specifically designed for this framework, which runs on

the Pig Runtime.

 Pig helps to achieve ease of programming and optimization and thus is a core

segment of the Hadoop ecosystem.


HIVE:

 With the help of SQL methodology and the HIVE interface, it reads and writes

large data sets known as Hive Query Language.

 It is highly scalable as it allows both real-time and batches processing. Also, all

SQL data types are supported by Hive, making query processing easier.

 Like Query Processing frameworks, HIVE comes with two components: JDBC

Drivers and HIVE Command-Line.

 JDBC works with ODBC drivers to create data storage and connection

permissions, while the HIVE command line helps with query processing.
Mahout:

 Mahout enables machine learning of a system or application. Machine learning, as

the name suggests, helps a system evolve based on certain patterns,

user/environment interactions, or algorithms.

 It provides various libraries or features like collaborative filtering, clustering, and

classification, which are nothing but machine learning concepts. It allows us to

invoke algorithms according to our needs using our libraries.


Apache Spark:

 It is a platform that handles all process-intensive tasks such as batch processing,

real-time interactive or iterative processing, graph conversions, and

visualizations, etc.

 Therefore, it consumes memory resources and is faster than the previous one in

terms of optimization.

 Spark is best suited for real-time data, while Hadoop is best suited for structured

data or batch processing. Hence both are used interchangeably in most companies.
Apache HBase:

 It is a NoSQL database that supports all kinds of data and is capable of processing

anything from a Hadoop database. It provides Google BigTable capabilities to

work with large data sets efficiently.

 When we need to search or retrieve occurrences of something small in a huge

database, the request must be processed in a short, fast time frame. At such times,

HBase comes in handy as it provides us with a tolerant way of storing limited

data.
Other Components:

Apart from all these, some other components perform the huge task of making Hadoop

capable of processing large data sets. They are as follows:

 Solr, Lucene: These are two services that perform the task of searching and

indexing using some java libraries. Lucene is based on Java which also enables a

spell-checking mechanism. However, Lucene is driven by Solr.

 Zookeeper: There was a huge problem with managing coordination and

synchronization between Hadoop resources or components, which often led to

inconsistency. Zookeeper has overcome all the problems by performing

synchronization, inter-component communication, grouping, and maintenance.


 Oozie: Oozie simply acts as a scheduler, so it schedules jobs and joins them

together. There are two kinds of tasks, i.e., Oozie workflow and Oozie

coordinator tasks. An Oozie workflow is the tasks that need to be executed

sequentially in an ordered manner. In contrast, the Oozie coordinator tasks are

triggered when some data or external stimulus is given to it.


Hadoop distributions

A Hadoop distribution refers to a packaged and often enhanced version of the Apache
Hadoop ecosystem, typically provided by companies or organizations that tailor it for
performance, manageability, and enterprise use. These distributions generally include:

 Core Hadoop components: HDFS (Hadoop Distributed File System), YARN (Yet
Another Resource Negotiator), and MapReduce.
 Ecosystem tools: Hive, Pig, HBase, Spark, Flume, Sqoop, Oozie, Zookeeper, etc.
 Management & Monitoring tools: For cluster setup, job management, security, and
system health checks.
 Security & Governance: Integrations for Kerberos authentication, data encryption,
access control (e.g., Ranger, Sentry), and auditing.

Popular Hadoop Distributions

1. Cloudera Data Platform (CDP)


o Merged with Hortonworks in 2019.
o Offers robust security, governance, and multi-cloud capabilities.
o Strong focus on data lakes and machine learning.
2. Hortonworks Data Platform (HDP) (now part of Cloudera)
o Focused on open-source principles with strong community roots.
o Integrated with Apache Ranger, Atlas for security and governance.
3. MapR (Now part of HPE)
o Known for its high-performance file system (MapR-FS).
o Supported real-time streaming and advanced analytics.
o Differentiated by not using HDFS, instead using its own file system.

4 Amazon EMR (Elastic MapReduce)

 A cloud-native Hadoop distribution.


 Integrated with AWS services (S3, DynamoDB, CloudWatch, etc.).
 Auto-scaling, managed cluster provisioning.

5 Microsoft Azure HDInsight

 A cloud-based distribution built on top of Hortonworks.


 Deep integration with Azure services.
 Supports Spark, Hive, HBase, Storm, Kafka.
6 Google Cloud Dataproc

 Lightweight, fast, managed Hadoop and Spark clusters.


 Integrates well with Google Cloud Storage, BigQuery, etc.

Need of hadoop

The need for Hadoop in Big Data Analytics arises from the challenges associated with
processing massive amounts of structured, semi-structured, and unstructured data. Here’s a
breakdown of why Hadoop plays a crucial role in big data environments:

1. Handling Large Volumes of Data

Big data refers to datasets that are too large or complex for traditional data-processing
software. Hadoop is designed to store and process terabytes to petabytes of data across
clusters of machines.

2. Distributed Storage (HDFS)

Hadoop uses the Hadoop Distributed File System (HDFS) to store data across multiple
nodes. This allows:

 Fault tolerance (data is replicated across nodes)


 High availability
 Scalability (easy to add more nodes)

3. Efficient Data Processing (MapReduce)

Hadoop's MapReduce model allows for distributed computation:

 Splits tasks into small parts and processes them in parallel


 Reduces processing time significantly
 Ideal for batch processing large datasets

4. Cost-Effective

Hadoop runs on commodity hardware, making it a low-cost solution for handling big data
compared to traditional high-end servers or data warehouses.

5. Flexibility with Unstructured Data

Hadoop can handle various types of data:


 Text, images, videos, log files, social media content, etc.
 Not limited to structured data (unlike traditional RDBMS)

6. Scalability

 You can start with a few nodes and scale up to thousands


 Suitable for growing datasets

7. Ecosystem Support

Hadoop has a rich ecosystem including tools like:

 Hive (SQL-like querying)


 Pig (scripting for data transformation)
 HBase (NoSQL database)
 Spark (fast, in-memory processing)
 Oozie, Flume, Sqoop, etc.

8. Real-World Applications

 Social media analytics


 Fraud detection
 Recommendation engines
 Healthcare data analysis
 Sensor data processing (IoT)

RDBMS vs Hadoop
RDBMS (Relational Database
Feature Hadoop
Management System)
Handles structured data (rows, Handles structured, semi-
Data Type
columns) structured, and unstructured data
Vertical scaling (adding more power Horizontal scaling (adding more
Scalability
to a single machine) machines to a cluster)
Storage Model Centralized storage Distributed storage (HDFS)
Suitable for small to medium Designed for very large datasets
Data Volume
datasets (terabytes to petabytes)
Processing Model Traditional SQL-based processing MapReduce (batch), Spark (real-time)
Expensive due to proprietary Cost-effective; uses commodity
Cost
hardware/software hardware and open-source software
Built-in fault tolerance with data
Fault Tolerance Limited; requires backups
replication
Performance with Degrades significantly with huge
Optimized for big data workloads
Big Data data volumes
RDBMS (Relational Database
Feature Hadoop
Management System)
Schema Schema-on-write (must define Schema-on-read (flexible; schema
Dependency schema before inserting data) applied when reading data)
Real-Time Supports real-time for transactionalBatch-oriented (Hadoop
Processing data MapReduce); real-time via Spark
Large-scale analytics, log processing,
Use Cases OLTP systems, small-scale analytics
IoT, social media analytics

Hadoop is ideal when:


RDBMS is ideal when:
You work with structured, consistent
You work with massive, diverse, or semi-structured data
data
You need real-time transactional
You need to process and analyze large datasets
processing
Data size is manageable (GBs) Data size is huge (TBs or PBs)
Data integrity and ACID compliance
Speed, fault-tolerance, and scalability are more critical
are essential

Distribution computing challenges


Big data systems like Hadoop and Spark rely on distributed computing—dividing data and
tasks across many machines. While powerful, this introduces several challenges:

1. Data Distribution & Partitioning

 Challenge: Efficiently dividing data across nodes without skew.


 Why it matters: Poor partitioning can cause data skew, leading to some nodes being
overloaded while others are idle.

2. Fault Tolerance

 Challenge: Nodes may fail due to hardware issues, network problems, or crashes.
 Why it matters: Systems must detect failures quickly and recover without data loss
or duplication.

3. Data Locality

 Challenge: Moving large datasets across nodes is costly.


 Why it matters: To reduce latency and bandwidth use, systems try to process data
where it's stored (data locality), but this is not always possible.

4. Scalability

 Challenge: Ensuring the system scales efficiently with more nodes and larger
datasets.
 Why it matters: Poorly designed systems face bottlenecks like slow job scheduling,
memory limits, or network congestion.

5. Synchronization & Coordination

 Challenge: Coordinating tasks across many nodes without conflicts or race


conditions.
 Why it matters: Asynchronous execution and communication delays can lead to
inconsistencies and deadlocks.

6. Network Overhead

 Challenge: High data movement across nodes increases latency and consumes
bandwidth.
 Why it matters: Can significantly slow down computation and analytics pipelines.

7. Security and Privacy

 Challenge: Distributed systems often span public/private clouds and multiple


organizations.
 Why it matters: Ensuring secure data access, authentication, and encryption
becomes more complex.

8. Monitoring and Debugging

 Challenge: Debugging across multiple nodes and tracking failures is hard.


 Why it matters: Without good logging and monitoring tools, diagnosing issues in
real-time is difficult.

9. Load Balancing

 Challenge: Distributing workload evenly across nodes.


 Why it matters: Imbalanced loads reduce overall efficiency and increase
computation time.

10. Consistency and Concurrency

 Challenge: Managing simultaneous access to data and ensuring consistent views.


 Why it matters: Distributed systems must maintain eventual or strong consistency
depending on use case.

History of hadoop
🔹 1. Origin (2003–2004): The Google Influence

 2003: Google published papers on Google File System (GFS).


 2004: Followed by a paper on MapReduce, Google’s programming model for
processing large datasets.
 These papers inspired open-source alternatives.

2. Doug Cutting & Nutch (2004–2006)

 Doug Cutting and Mike Cafarella were working on an open-source web search
engine called Nutch.
 Nutch needed a scalable storage and processing system.
 They started building an open-source implementation of GFS and MapReduce.
 The storage part became HDFS (Hadoop Distributed File System).
 The processing part became MapReduce.

3. Birth of Hadoop (2006)

 Doug Cutting moved to Yahoo!.


 Nutch was split: the distributed computing parts were extracted and became a new
project — Hadoop.
 Named after Cutting’s son’s toy elephant 🐘.
 2006: Yahoo! dedicated a team to develop Hadoop.

4. Adoption and Growth (2008–2010)

 2008: Hadoop became an Apache top-level project.


 Yahoo! ran Hadoop on 10,000+ cores, processing massive data.
 Used by Facebook, LinkedIn, eBay, Twitter, etc.
 Hadoop ecosystem began to grow: Hive, Pig, HBase, ZooKeeper, Oozie, etc.

5. Commercial Support (2011–Present)

 Emergence of commercial vendors:


o Cloudera (2008)
o Hortonworks (2011)
o MapR (2009)
 These companies offered Hadoop-based platforms with enterprise support.
 2019: Hortonworks merged with Cloudera.

6. Evolving Ecosystem

 Apache Spark emerged for faster, in-memory processing (not a Hadoop component,
but often used with it).
 Hadoop shifted from core MapReduce to more modern engines like YARN and
Spark.
 Cloud-native solutions (e.g., AWS EMR, GCP Dataproc) started replacing on-prem
Hadoop clusters.

Hadoop overview
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.

The Hadoop Architecture Mainly consists of 4 components.

 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is based on the
YARN framework. The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working so fast. When you
are dealing with Big Data, serial processing is no more of any use. MapReduce has mainly
2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
Here, we can see that the Input is provided to the Map() function then it’s output is used as
an input to the Reduce function and after that, we receive our final output. Let’s understand
What this Map() and Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big Data. The
Input is a set of Data. The Map() function here breaks this DataBlocks into Tuples that are
nothing but a key-value pair. These key-value pairs are now sent as input to the Reduce().
The Reduce() function then combines this broken Tuples or key-value pair based on its Key
value and form set of Tuples, and perform some operation like sorting, summation type job,
etc. which is then sent to the final Output Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the business requirement of
that industry. This is How First Map() and then Reduce is utilized one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:

 RecordReader The purpose of recordreader is to break the records. It is responsible


for providing key-value pairs in a Map() function. The key is actually is its locational
information and value is the data associated with it.
 Map: A map is nothing but a user-defined function whose work is to process the Tuples
obtained from record reader. The Map() function either does not generate any key-value
pair or generate multiple pairs of these tuples.
 Combiner: Combiner is used for grouping the data in the Map workflow. It is similar to
a Local reducer. The intermediate key-value that are generated in the Map is combined
with the help of this combiner. Using a combiner is not necessary as it is optional.
 Partitionar: Partitional is responsible for fetching key-value pairs generated in the
Mapper Phases. The partitioner generates the shards corresponding to each reducer.
Hashcode of each key is also fetched by this partition. Then partitioner performs
it’s(Hashcode) modulus with the number of reducers(key.hashcode()%(number of
reducers)).
Reduce Task
 Shuffle and Sort: The Task of Reducer starts with this step, the process in which the
Mapper generates the intermediate key-value and transfers them to the Reducer task is
known as Shuffling. Using the Shuffling process the system can sort the data using its
key value.

Once some of the Mapping tasks are done Shuffling begins that is why it is a faster
process and does not wait for the completion of the task performed by Mapper.
 Reduce: The main function or task of the Reduce is to gather the Tuple generated from
Map and then perform some sorting and aggregation sort of process on those key-value
depending on its key element.
 OutputFormat: Once all the operations are performed, the key-value pairs are written
into the file with the help of record writer, each record in a new line, and the key and
value in a space-separated manner.

2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly


designed for working on commodity Hardware devices(inexpensive devices), working on a
distributed file system design. HDFS is designed in such a way that it believes more in
storing the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the
other devices present in that Hadoop cluster. Data storage Nodes in HDFS.

 NameNode(Master)
 DataNode(Slave)
 NameNode:NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
 Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
 DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even
more than that. The more number of DataNode, the Hadoop cluster will be able to
store more data. So it is advised that the DataNode should have High storing
capacity to store a large number of file blocks.
 High Level Architecture Of Hadoop

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single
block of data is divided into multiple blocks of size 128MB which is default and you can
also change it manually.
Let’s understand this concept of breaking down of file in blocks with an example. Suppose
you have uploaded a file of 400MB to your HDFS then what happens is this file got divided
into blocks of 128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created
each of 128MB except the last one. Hadoop doesn’t know or it doesn’t care about what data
is stored in these blocks so it considers the final file blocks as a partial record as it does not
have any idea regarding it. In the Linux file system, the size of a file block is about 4KB
which is very much less than the default size of file blocks in the Hadoop file system. As
we all know Hadoop is mainly configured for storing the large size data which is in
petabyte, this is what makes Hadoop file system different from other file systems as it can
be scaled, nowadays file blocks of 128MB to 256MB are considered in Hadoop.

Replication In HDFS Replication ensures the availability of the data. Replication is


making a copy of something and the number of times you make a copy of that particular
thing can be expressed as it’s Replication Factor. As we have seen in File blocks that the
HDFS stores the data in the form of various blocks at the same time Hadoop is also
configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured means
you can change it manually as per your requirement like in above example we have made 4
file blocks which means that 3 Replica or copy of each file block is made means total of
4×3 = 12 blocks are made for the backup purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive system
hardware) which can be crashed at any time. We are not using the supercomputer for our
Hadoop setup. That is why we need such a feature in HDFS which can make copies of that
file blocks for backup purposes, this is known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s of our file blocks
we are wasting so much of our storage but for the big brand organization the data is very
much important than the storage so nobody cares for this extra storage. You can configure
the Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our
Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so many Racks .
with the help of this Racks information Namenode chooses the closest Datanode to achieve
the maximum performance while performing the read/write information which reduces the
Network Traffic.
HDFS Architecture
3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations that are
Job scheduling and Resource Management. The Purpose of Job schedular is to divide a big
task into small jobs so that each job can be assigned to various slaves in a Hadoop cluster
and Processing can be Maximized. Job Scheduler also keeps track of which job is
important, which job has more priority, dependencies between the jobs and all the other
information like job timing, etc. And the use of Resource Manager is to manage all the
resources that are made available for running a Hadoop cluster.

Features of YARN

 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files or we
can say the java scripts that we need for all the other components present in a Hadoop
cluster. these utilities are used by HDFS, YARN, and MapReduce for running the cluster.
Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to
be solved automatically in software by Hadoop Framework.
HDFS
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. This open source framework works by rapidly transferring data between nodes.
It's often used by companies who need to handle and store big data. HDFS is a key
component of many Hadoop systems, as it provides a means for managing big data, as well as
supporting big data analytics.

HDFS stands for Hadoop Distributed File System. HDFS operates as a distributed file system
designed to run on commodity hardware.

HDFS is fault-tolerant and designed to be deployed on low-cost, commodity hardware. HDFS


provides high throughput data access to application data and is suitable for applications that
have large data sets and enables streaming access to file system data in Apache Hadoop.

A core difference between Hadoop and HDFS is that Hadoop is the open source framework
that can store, process and analyze data, while HDFS is the file system of Hadoop that
provides access to data. This essentially means that HDFS is a module of Hadoop.

As we can see, it focuses on NameNodes and DataNodes. The NameNode is the hardware
that contains the GNU/Linux operating system and software. The Hadoop distributed file
system acts as the master server and can manage the files, control a client's access to files,
and overseas file operating processes such as renaming, opening, and closing files.

A DataNode is hardware having the GNU/Linux operating system and DataNode software.
For every node in a HDFS cluster, you will locate a DataNode. These nodes help to control
the data storage of their system as they can perform operations on the file systems if the client
requests, and also create, replicate, and block files when the NameNode instructs.

The HDFS meaning and purpose is to achieve the following goals:


 Manage large datasets - Organizing and storing datasets can be a hard talk to handle. HDFS
is used to manage the applications that have to deal with huge datasets. To do this, HDFS
should have hundreds of nodes per cluster.

 Detecting faults - HDFS should have technology in place to scan and detect faults quickly
and effectively as it includes a large number of commodity hardware. Failure of components
is a common issue.

 Hardware efficiency - When large datasets are involved it can reduce the network traffic and
increase the processing speed.

The history of HDFS

What are Hadoop's origins? The design of HDFS was based on the Google File System. It
was originally built as infrastructure for the Apache Nutch web search engine project but has
since become a member of the Hadoop Ecosystem.

In the earlier years of the internet, web crawlers started to pop up as a way for people to
search for information on web pages. This created various search engines such as the likes of
Yahoo and Google.

It also created another search engine called Nutch, which wanted to distribute data and
calculations across multiple computers simultaneously. Nutch then moved to Yahoo, and was
divided into two. Apache Spark and Hadoop are now their own separate entities. Where
Hadoop is designed to handle batch processing, Spark is made to handle real-time data
efficiently.

Nowadays, Hadoop's structure and framework are managed by the Apache software
foundation which is a global community of software developers and contributors.

HDFS was born from this and is designed to replace hardware storage solutions with a better,
more efficient method - a virtual filing system. When it first came onto the scene,
MapReduce was the only distributed processing engine that could use HDFS. More recently,
alternative Hadoop data services components like HBase and Solr also utilize HDFS to store
data.

What is HDFS in the world of big data?

So, what is big data and how does HDFS come into it? The term "big data" refers to all the
data that's difficult to store, process and analyze. HDFS big data is data organized into the
HDFS filing system.

As we now know, Hadoop is a framework that works by using parallel processing and
distributed storage. This can be used to sort and store big data, as it can't be stored in
traditional ways.
In fact, it's the most commonly used software to handle big data, and is used by companies
such as Netflix, Expedia, and British Airways who have a positive relationship with
Hadoop for data storage. HDFS in big data is vital, as this is how many businesses now
choose to store their data.

There are five core elements of big data organized by HDFS services:

 Velocity - How fast data is generated, collated and analyzed.

 Volume - The amount of data generated.

 Variety - The type of data, this can be structured, unstructured, etc.

 Veracity - The quality and accuracy of the data.

 Value - How you can use this data to bring an insight into your business processes.

Advantages of Hadoop Distributed File System

As an open source subproject within Hadoop, HDFS offers five core benefits when dealing
with big data:

1. Fault tolerance. HDFS has been designed to detect faults and automatically recover quickly
ensuring continuity and reliability.

2. Speed, because of its cluster architecture, it can maintain 2 GB of data per second.

3. Access to more types of data, specifically Streaming data. Because of its design to handle
large amounts of data for batch processing it allows for high data throughput rates making it
ideal to support streaming data.

4. Compatibility and portability. HDFS is designed to be portable across a variety of


hardware setups and compatible with several underlying operating systems ultimately
providing users optionality to use HDFS with their own tailored setup. These advantages are
especially significant when dealing with big data and were made possible with the particular
way HDFS handles data.

This graph demonstrates the difference between a local file system and HDFS.
1. Scalable. You can scale resources according to the size of your file system. HDFS includes
vertical and horizontal scalability mechanisms.

2. Data locality. When it comes to the Hadoop file system, the data resides in data nodes, as
opposed to having the data move to where the computational unit is. By shortening the
distance between the data and the computing process, it decreases network congestion and
makes the system more effective and efficient.

3. Cost effective. Initially, when we think of data we may think of expensive hardware and
hogged bandwidth. When hardware failure strikes, it can be very costly to fix. With HDFS,
the data is stored inexpensively as it's virtual, which can drastically reduce file system
metadata and file system namespace data storage costs. What's more, because HDFS is open
source, businesses don't need to worry about paying a licensing fee.

4. Stores large amounts of data. Data storage is what HDFS is all about - meaning data of all
varieties and sizes - but particularly large amounts of data from corporations that are
struggling to store it. This includes both structured and unstructured data.

5. Flexible. Unlike some other more traditional storage databases, there's no need to process the
data collected before storing it. You're able to store as much data as you want, with the
opportunity to decide exactly what you'd like to do with it and how to use it later. This also
includes unstructured data like text, videos and images.

How to use HDFS

So, how do you use HDFS? Well, HDFS works with a main NameNode and multiple other
datanodes, all on a commodity hardware cluster. These nodes are organized in the same place
within the data center. Next, it's broken down into blocks which are distributed among the
multiple DataNodes for storage. To reduce the chances of data loss, blocks are often
replicated across nodes. It's a backup system should data be lost.

Let's look at NameNodes. The NameNode is the node within the cluster that knows what the
data contains, what block it belongs to, the block size, and where it should go. NameNodes
are also used to control access to files including when someone can write, read, create,
remove, and replicate data across the various data notes.

The cluster can also be adapted where necessary in real-time, depending on the server
capacity - which can be useful when there is a surge in data. Nodes can be added or taken
away when necessary.

Now, onto DataNodes. DataNodes are in constant communication with the NameNodes to
identify whether they need to commence and complete a task. This stream of consistent
collaboration means that the NameNode is acutely aware of each DataNodes status.

When a DataNode is singled out to not be operating the way it should, the namemode is able
to automatically re-assign that task to another functioning node in the same datablock.
Similarly, DataNodes are also able to communicate with each other, which means they can
collaborate during standard file operations. Because the NameNode is aware of DataNodes
and their performance, they're crucial in maintaining the system.

Datablocks are replicated across multiple datanotes and accessed by the NameNode.

To use HDFS you need to install and set up a Hadoop cluster. This can be a single node set
up which is more appropriate for first-time users, or a cluster set up for large, distributed
clusters. You then need to familiarize yourself with HDFS commands, such as the below, to
operate and manage your system.

Command Description
-rm Removes file or directory
-ls Lists files with permissions and other details
-mkdir Creates a directory named path in HDFS
-cat Shows contents of the file
-rmdir Deletes a directory
-put Uploads a file or folder from a local disk to HDFS
-rmr Deletes the file identified by path or folder and subfolders
-get Moves file or folder from HDFS to local file
-count Counts number of files, number of directory, and file size
-df Shows free space
-getmerge Merges multiple files in HDFS
-chmod Changes file permissions
-copyToLocal Copies files to the local system
-Stat Prints statistics about the file or directory
-head Displays the first kilobyte of a file
-usage Returns the help for an individual command
-chown Allocates a new owner and group of a file
How does HDFS work?

As previously mentioned, HDFS uses NameNodes and DataNodes. HDFS allows the quick
transfer of data between compute nodes. When HDFS takes in data, it's able to break down
the information into blocks, distributing them to different nodes in a cluster.

Data is broken down into blocks and distributed among the DataNodes for storage, these
blocks can also be replicated across nodes which allows for efficient parallel processing. You
can access, move around, and view data through various commands. HDFS DFS options such
as "-get" and "-put" allow you to retrieve and move data around as necessary.

What's more, the HDFS is designed to be highly alert and can detect faults quickly. The file
system uses data replication to ensure every piece of data is saved multiple times and then
assigns it across individual nodes, ensuring at least one copy is on a different rack than the
other copies.

This means when a DataNode is no longer sending signals to the NameNode, it removes the
DataNode from the cluster and operates without it. If this data node then comes back, it can
be allocated to a new cluster. Plus, since the datablocks are replicated across several
DataNodes, removing one will not lead to any file corruptions of any kind.

HDFS components

It's important to know that there are three main components of Hadoop. Hadoop HDFS,
Hadoop MapReduce, and Hadoop YARN. Let's take a look at what these components bring
to Hadoop:

 Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.

 Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop. This software
framework is used to write applications to process vast amounts of data.

 Hadoop YARN - Hadoop YARN is a resource management component of Hadoop. It


processes and runs data for batch, stream, interactive, and graph processing - all of which are
stored in HDFS.
Anatomy of File Read in HDFS

Let’s get an idea of how data flows between the client interacting with HDFS, the name
node, and the data nodes with the help of a diagram. Consider the figure:

Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls
(RPCs), to determine the locations of the first few blocks in the file. For each block, the
name node returns the addresses of the data nodes that have a copy of that block. The DFS
returns an FSDataInputStream to the client for it to read data from. FSDataInputStream in
turn wraps a DFSInputStream, which manages the data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the
info node addresses for the primary few blocks within the file, then connects to the primary
(closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to
the data node, then finds the best data node for the next block. This happens transparently
to the client, which from its point of view is simply reading an endless stream. Blocks are
read as, with the DFSInputStream opening new connections to data nodes because the
client reads through the stream. It will also call the name node to retrieve the data node
locations for the next batch of blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.

Anatomy of File Write in HDFS

Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to get a better
understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the
files which are already stored in HDFS, but we can append data by reopening the files.

Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).


Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s
namespace, with no blocks associated with it. The name node performs various checks to
make sure the file doesn’t already exist and that the client has the right permissions to
create the file. If these checks pass, the name node prepares a record of the new file;
otherwise, the file can’t be created and therefore the client is thrown an error i.e.
IOException. The DFS returns an FSDataOutputStream for the client to start out writing
data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it
writes to an indoor queue called the info queue. The data queue is consumed by the
DataStreamer, which is liable for asking the name node to allocate new blocks by picking
an inventory of suitable data nodes to store the replicas. The list of data nodes forms a
pipeline, and here we’ll assume the replication level is three, so there are three nodes in the
pipeline. The DataStreamer streams the packets to the primary data node within the
pipeline, which stores each packet and forwards it to the second data node within the
pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and
last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be
acknowledged by data nodes, called an “ack queue”.
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s
namespace, with no blocks associated with it. The name node performs various checks to
make sure the file doesn’t already exist and that the client has the right permissions to
create the file. If these checks pass, the name node prepares a record of the new file;
otherwise, the file can’t be created and therefore the client is thrown an error i.e.
IOException. The DFS returns an FSDataOutputStream for the client to start out writing
data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it
writes to an indoor queue called the info queue. The data queue is consumed by the
DataStreamer, which is liable for asking the name node to allocate new blocks by picking
an inventory of suitable data nodes to store the replicas. The list of data nodes forms a
pipeline, and here we’ll assume the replication level is three, so there are three nodes in the
pipeline. The DataStreamer streams the packets to the primary data node within the
pipeline, which stores each packet and forwards it to the second data node within the
pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and
last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be
acknowledged by data nodes, called an “ack queue”.

You might also like