Bda Unit-2
Bda Unit-2
Hadoop
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.
Hadoop Architecture
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be
deployed on low-cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.
It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with
single-CPU, as a single functional distributed system and practically, the clustered machines
can read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper
than one high-end server. So this is the first motivational factor behind using Hadoop that it
runs across clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs −
Data is initially divided into directories and files. Files are divided into uniform sized blocks
of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is compatible on all
the platforms since it is Java based.
Apache Hadoop is an open source framework that is used to efficiently store and
Instead of using one large computer to store and process the data, Hadoop allows
Hadoop Distributed File System (HDFS) — A distributed file system that runs on
standard or low-end hardware. HDFS provides better data throughput than traditional file
systems, in addition to high fault tolerance and native support of large datasets.
Yet Another Resource Negotiator (YARN) — Manages and monitors cluster nodes and
The map task takes input data and converts it into a dataset that can be computed in key
value pairs. The output of the map task is consumed by reduce tasks to aggregate output
Hadoop Common — Provides common Java libraries that can be used across all
modules.
Hadoop makes it easier to use all the storage and processing capacity in cluster servers, and to
execute distributed processes against huge amounts of data. Hadoop provides the building
Applications that collect data in various formats can place data into the Hadoop cluster by
using an API operation to connect to the NameNode. The NameNode tracks the file directory
structure and placement of “chunks” for each file, replicated across DataNodes.
To run a job to query the data, provide a MapReduce job made up of many map and reduce
tasks that run against the data in HDFS spread across the DataNodes. Map tasks run on each
node against the input files supplied, and reducers run to aggregate and organize the final
output.
Why Hadoop used for Big Data Analytics?
Hadoop is the best solution for storing and processing Big Data because,
Hadoop stores huge files as they are (raw) without specifying any schema.
High scalability- We can add any number of nodes, hence enhancing performance
dramatically.
machine or few hardware crashes, then we can data from another path.
expensive.
Features of Hadoop
a. Open Source
One of the features of Hadoop is that it is an open-source project, which means its
source code is available to all for modification, inspection, and analysis.
Because the code is open-source, firms may alter it to meet their own requirements.
Because of the code's adaptability, enterprises may customise Hadoop to their own
needs.
Another one of the features of Hadoop is that it achieves fault tolerance by creating
replicas, which spread data blocks from a file among multiple servers in the HDFS
cluster.
By default, HDFS duplicates each block three times on separate machines in the
cluster, providing data access even if a node fails or goes down.
e. Cost-Effective
Hadoop offers businesses working with large data a cost-effective storage alternative
to conventional relational database management systems, which may be expensive to
scale.
Hadoop's scale-out design allows cost-effective storage of all data, reducing raw data
loss and enabling organizations to store and use their entire dataset at a fraction of the
cost.
Hadoop is extremely adaptable and can handle a wide range of data types, including
structured, semi-structured, and unstructured data.
Hadoop can handle and analyse data that is organised and well-defined, somewhat
organised, or even fully unstructured.
g. Easy to Use
Hadoop reduces the need for clients to perform distributed computing jobs by
handling all of its intricacies, making it user-friendly.
Users of Hadoop may concentrate on their data and analytics activities rather than the
complexities of distributed computing, making it easier to use and run.
The data locality feature of Hadoop allows computation to be conducted near the data,
eliminating the need to transport data and lowering network congestion.
Hadoop enhances system throughput and overall performance by minimising data
transport and putting computing closer to the data nodes.
One of the other features of Hadoop is that it is well-known for its fast processing
speed, which enables effective handling of massive amounts of data.
Hadoop may greatly expedite data processing processes by exploiting parallel
processing and distributed computing capabilities, resulting in quicker results and
enhanced overall performance.
Hadoop integrates seamlessly with a wide range of tools and technologies, allowing
data ecosystem interoperability.
Users may easily link Hadoop with a wide range of data processing frameworks,
databases, analytics tools, and visualization platforms, increasing the flexibility and
value of their data infrastructure.
n. Secure
o. Community Support
One of the main features of Hadoop is that it benefits from a vibrant and active
community of developers, users, and contributors.
The community support for Hadoop is extensive, offering resources, documentation,
forums, and continuous development, ensuring users have access to assistance,
updates, and a wealth of knowledge for leveraging Hadoop effectively.
1. Scalability
Hadoop is highly scalable — it can handle petabytes of data by simply adding more
nodes (computers) to the cluster.
Grows linearly without a major overhaul.
2. Cost-Effective
3. Fault Tolerance
Data is replicated across multiple nodes, so if one node fails, the system continues
to operate using copies from other nodes.
Ensures high data reliability and availability.
Can store and process structured, semi-structured, and unstructured data (e.g.,
logs, images, videos, emails).
Not limited to a specific schema like traditional RDBMS.
Being open source means it's free to use and has a huge community.
Easily integrates with other tools like Hive, Pig, HBase, Spark, and Oozie.
Unlike traditional systems, Hadoop stores and processes data on the same machines,
reducing data transfer time and improving efficiency.
Works well with cloud services (e.g., Amazon EMR, Google Cloud Dataproc),
making deployment and scaling even easier.
9. Real-World Proven
Used by tech giants like Facebook, Yahoo, Netflix, and LinkedIn to manage and
analyze their huge data volumes.
Versions of hadoop
Hadoop has gone through several key versions over time, each introducing important
improvements and new components.
Key Features:
Limitations:
Other Improvements:
Key Enhancements:
Each version of Hadoop made the platform more scalable, flexible, and enterprise-ready —
moving from a batch-only framework to a modern ecosystem that supports real-time,
machine learning, and interactive analytics.
HDFS:
1. Node name
2. A data node
A Name Node is a primary node that contains metadata (data about data),
requiring comparatively fewer resources than data nodes that store the actual data.
These data nodes are commodity hardware in a distributed environment.
YARN:
Yet Another Resource Negotiator, as the name suggests, YARN helps manage
1.
1. Resource manager
2. Node manager
3. Application manager
the system. In contrast, node managers work to allocate resources such as CPU,
memory, and bandwidth per machine and later acknowledge the resource
manager and the node manager and performs negotiations according to the two
requirements.
MapReduce:
processing logic and helps you write applications that transform large datasets
MapReduce uses two functions, i.e., Map() and Reduce(), whose task is to:
1.
1. Map() performs sorting and filtering of the data and thus organizes it in the
form of a group. The map generates a result based on the key-value pair,
mapped data. Simply put, Reduce() takes the output generated by Map() as
It is a platform for data flow structuring, processing, and analyzing large data
sets.
Pig does the job of executing the commands, and all the MapReduce activities are
taken care of in the background. After processing, the pig stores the result in
HDFS.
The Pig Latin language is specifically designed for this framework, which runs on
Pig helps to achieve ease of programming and optimization and thus is a core
With the help of SQL methodology and the HIVE interface, it reads and writes
It is highly scalable as it allows both real-time and batches processing. Also, all
SQL data types are supported by Hive, making query processing easier.
Like Query Processing frameworks, HIVE comes with two components: JDBC
JDBC works with ODBC drivers to create data storage and connection
permissions, while the HIVE command line helps with query processing.
Mahout:
visualizations, etc.
Therefore, it consumes memory resources and is faster than the previous one in
terms of optimization.
Spark is best suited for real-time data, while Hadoop is best suited for structured
data or batch processing. Hence both are used interchangeably in most companies.
Apache HBase:
It is a NoSQL database that supports all kinds of data and is capable of processing
database, the request must be processed in a short, fast time frame. At such times,
data.
Other Components:
Apart from all these, some other components perform the huge task of making Hadoop
Solr, Lucene: These are two services that perform the task of searching and
indexing using some java libraries. Lucene is based on Java which also enables a
together. There are two kinds of tasks, i.e., Oozie workflow and Oozie
A Hadoop distribution refers to a packaged and often enhanced version of the Apache
Hadoop ecosystem, typically provided by companies or organizations that tailor it for
performance, manageability, and enterprise use. These distributions generally include:
Core Hadoop components: HDFS (Hadoop Distributed File System), YARN (Yet
Another Resource Negotiator), and MapReduce.
Ecosystem tools: Hive, Pig, HBase, Spark, Flume, Sqoop, Oozie, Zookeeper, etc.
Management & Monitoring tools: For cluster setup, job management, security, and
system health checks.
Security & Governance: Integrations for Kerberos authentication, data encryption,
access control (e.g., Ranger, Sentry), and auditing.
Need of hadoop
The need for Hadoop in Big Data Analytics arises from the challenges associated with
processing massive amounts of structured, semi-structured, and unstructured data. Here’s a
breakdown of why Hadoop plays a crucial role in big data environments:
Big data refers to datasets that are too large or complex for traditional data-processing
software. Hadoop is designed to store and process terabytes to petabytes of data across
clusters of machines.
Hadoop uses the Hadoop Distributed File System (HDFS) to store data across multiple
nodes. This allows:
4. Cost-Effective
Hadoop runs on commodity hardware, making it a low-cost solution for handling big data
compared to traditional high-end servers or data warehouses.
6. Scalability
7. Ecosystem Support
8. Real-World Applications
RDBMS vs Hadoop
RDBMS (Relational Database
Feature Hadoop
Management System)
Handles structured data (rows, Handles structured, semi-
Data Type
columns) structured, and unstructured data
Vertical scaling (adding more power Horizontal scaling (adding more
Scalability
to a single machine) machines to a cluster)
Storage Model Centralized storage Distributed storage (HDFS)
Suitable for small to medium Designed for very large datasets
Data Volume
datasets (terabytes to petabytes)
Processing Model Traditional SQL-based processing MapReduce (batch), Spark (real-time)
Expensive due to proprietary Cost-effective; uses commodity
Cost
hardware/software hardware and open-source software
Built-in fault tolerance with data
Fault Tolerance Limited; requires backups
replication
Performance with Degrades significantly with huge
Optimized for big data workloads
Big Data data volumes
RDBMS (Relational Database
Feature Hadoop
Management System)
Schema Schema-on-write (must define Schema-on-read (flexible; schema
Dependency schema before inserting data) applied when reading data)
Real-Time Supports real-time for transactionalBatch-oriented (Hadoop
Processing data MapReduce); real-time via Spark
Large-scale analytics, log processing,
Use Cases OLTP systems, small-scale analytics
IoT, social media analytics
2. Fault Tolerance
Challenge: Nodes may fail due to hardware issues, network problems, or crashes.
Why it matters: Systems must detect failures quickly and recover without data loss
or duplication.
3. Data Locality
4. Scalability
Challenge: Ensuring the system scales efficiently with more nodes and larger
datasets.
Why it matters: Poorly designed systems face bottlenecks like slow job scheduling,
memory limits, or network congestion.
6. Network Overhead
Challenge: High data movement across nodes increases latency and consumes
bandwidth.
Why it matters: Can significantly slow down computation and analytics pipelines.
9. Load Balancing
History of hadoop
🔹 1. Origin (2003–2004): The Google Influence
Doug Cutting and Mike Cafarella were working on an open-source web search
engine called Nutch.
Nutch needed a scalable storage and processing system.
They started building an open-source implementation of GFS and MapReduce.
The storage part became HDFS (Hadoop Distributed File System).
The processing part became MapReduce.
6. Evolving Ecosystem
Apache Spark emerged for faster, in-memory processing (not a Hadoop component,
but often used with it).
Hadoop shifted from core MapReduce to more modern engines like YARN and
Spark.
Cloud-native solutions (e.g., AWS EMR, GCP Dataproc) started replacing on-prem
Hadoop clusters.
Hadoop overview
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
1. MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on the
YARN framework. The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working so fast. When you
are dealing with Big Data, serial processing is no more of any use. MapReduce has mainly
2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
Here, we can see that the Input is provided to the Map() function then it’s output is used as
an input to the Reduce function and after that, we receive our final output. Let’s understand
What this Map() and Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big Data. The
Input is a set of Data. The Map() function here breaks this DataBlocks into Tuples that are
nothing but a key-value pair. These key-value pairs are now sent as input to the Reduce().
The Reduce() function then combines this broken Tuples or key-value pair based on its Key
value and form set of Tuples, and perform some operation like sorting, summation type job,
etc. which is then sent to the final Output Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the business requirement of
that industry. This is How First Map() and then Reduce is utilized one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:
Once some of the Mapping tasks are done Shuffling begins that is why it is a faster
process and does not wait for the completion of the task performed by Mapper.
Reduce: The main function or task of the Reduce is to gather the Tuple generated from
Map and then perform some sorting and aggregation sort of process on those key-value
depending on its key element.
OutputFormat: Once all the operations are performed, the key-value pairs are written
into the file with the help of record writer, each record in a new line, and the key and
value in a space-separated manner.
2. HDFS
NameNode(Master)
DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even
more than that. The more number of DataNode, the Hadoop cluster will be able to
store more data. So it is advised that the DataNode should have High storing
capacity to store a large number of file blocks.
High Level Architecture Of Hadoop
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single
block of data is divided into multiple blocks of size 128MB which is default and you can
also change it manually.
Let’s understand this concept of breaking down of file in blocks with an example. Suppose
you have uploaded a file of 400MB to your HDFS then what happens is this file got divided
into blocks of 128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created
each of 128MB except the last one. Hadoop doesn’t know or it doesn’t care about what data
is stored in these blocks so it considers the final file blocks as a partial record as it does not
have any idea regarding it. In the Linux file system, the size of a file block is about 4KB
which is very much less than the default size of file blocks in the Hadoop file system. As
we all know Hadoop is mainly configured for storing the large size data which is in
petabyte, this is what makes Hadoop file system different from other file systems as it can
be scaled, nowadays file blocks of 128MB to 256MB are considered in Hadoop.
YARN is a Framework on which MapReduce works. YARN performs 2 operations that are
Job scheduling and Resource Management. The Purpose of Job schedular is to divide a big
task into small jobs so that each job can be assigned to various slaves in a Hadoop cluster
and Processing can be Maximized. Job Scheduler also keeps track of which job is
important, which job has more priority, dependencies between the jobs and all the other
information like job timing, etc. And the use of Resource Manager is to manage all the
resources that are made available for running a Hadoop cluster.
Features of YARN
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
Hadoop common or Common utilities are nothing but our java library and java files or we
can say the java scripts that we need for all the other components present in a Hadoop
cluster. these utilities are used by HDFS, YARN, and MapReduce for running the cluster.
Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to
be solved automatically in software by Hadoop Framework.
HDFS
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. This open source framework works by rapidly transferring data between nodes.
It's often used by companies who need to handle and store big data. HDFS is a key
component of many Hadoop systems, as it provides a means for managing big data, as well as
supporting big data analytics.
HDFS stands for Hadoop Distributed File System. HDFS operates as a distributed file system
designed to run on commodity hardware.
A core difference between Hadoop and HDFS is that Hadoop is the open source framework
that can store, process and analyze data, while HDFS is the file system of Hadoop that
provides access to data. This essentially means that HDFS is a module of Hadoop.
As we can see, it focuses on NameNodes and DataNodes. The NameNode is the hardware
that contains the GNU/Linux operating system and software. The Hadoop distributed file
system acts as the master server and can manage the files, control a client's access to files,
and overseas file operating processes such as renaming, opening, and closing files.
A DataNode is hardware having the GNU/Linux operating system and DataNode software.
For every node in a HDFS cluster, you will locate a DataNode. These nodes help to control
the data storage of their system as they can perform operations on the file systems if the client
requests, and also create, replicate, and block files when the NameNode instructs.
Detecting faults - HDFS should have technology in place to scan and detect faults quickly
and effectively as it includes a large number of commodity hardware. Failure of components
is a common issue.
Hardware efficiency - When large datasets are involved it can reduce the network traffic and
increase the processing speed.
What are Hadoop's origins? The design of HDFS was based on the Google File System. It
was originally built as infrastructure for the Apache Nutch web search engine project but has
since become a member of the Hadoop Ecosystem.
In the earlier years of the internet, web crawlers started to pop up as a way for people to
search for information on web pages. This created various search engines such as the likes of
Yahoo and Google.
It also created another search engine called Nutch, which wanted to distribute data and
calculations across multiple computers simultaneously. Nutch then moved to Yahoo, and was
divided into two. Apache Spark and Hadoop are now their own separate entities. Where
Hadoop is designed to handle batch processing, Spark is made to handle real-time data
efficiently.
Nowadays, Hadoop's structure and framework are managed by the Apache software
foundation which is a global community of software developers and contributors.
HDFS was born from this and is designed to replace hardware storage solutions with a better,
more efficient method - a virtual filing system. When it first came onto the scene,
MapReduce was the only distributed processing engine that could use HDFS. More recently,
alternative Hadoop data services components like HBase and Solr also utilize HDFS to store
data.
So, what is big data and how does HDFS come into it? The term "big data" refers to all the
data that's difficult to store, process and analyze. HDFS big data is data organized into the
HDFS filing system.
As we now know, Hadoop is a framework that works by using parallel processing and
distributed storage. This can be used to sort and store big data, as it can't be stored in
traditional ways.
In fact, it's the most commonly used software to handle big data, and is used by companies
such as Netflix, Expedia, and British Airways who have a positive relationship with
Hadoop for data storage. HDFS in big data is vital, as this is how many businesses now
choose to store their data.
There are five core elements of big data organized by HDFS services:
Value - How you can use this data to bring an insight into your business processes.
As an open source subproject within Hadoop, HDFS offers five core benefits when dealing
with big data:
1. Fault tolerance. HDFS has been designed to detect faults and automatically recover quickly
ensuring continuity and reliability.
2. Speed, because of its cluster architecture, it can maintain 2 GB of data per second.
3. Access to more types of data, specifically Streaming data. Because of its design to handle
large amounts of data for batch processing it allows for high data throughput rates making it
ideal to support streaming data.
This graph demonstrates the difference between a local file system and HDFS.
1. Scalable. You can scale resources according to the size of your file system. HDFS includes
vertical and horizontal scalability mechanisms.
2. Data locality. When it comes to the Hadoop file system, the data resides in data nodes, as
opposed to having the data move to where the computational unit is. By shortening the
distance between the data and the computing process, it decreases network congestion and
makes the system more effective and efficient.
3. Cost effective. Initially, when we think of data we may think of expensive hardware and
hogged bandwidth. When hardware failure strikes, it can be very costly to fix. With HDFS,
the data is stored inexpensively as it's virtual, which can drastically reduce file system
metadata and file system namespace data storage costs. What's more, because HDFS is open
source, businesses don't need to worry about paying a licensing fee.
4. Stores large amounts of data. Data storage is what HDFS is all about - meaning data of all
varieties and sizes - but particularly large amounts of data from corporations that are
struggling to store it. This includes both structured and unstructured data.
5. Flexible. Unlike some other more traditional storage databases, there's no need to process the
data collected before storing it. You're able to store as much data as you want, with the
opportunity to decide exactly what you'd like to do with it and how to use it later. This also
includes unstructured data like text, videos and images.
So, how do you use HDFS? Well, HDFS works with a main NameNode and multiple other
datanodes, all on a commodity hardware cluster. These nodes are organized in the same place
within the data center. Next, it's broken down into blocks which are distributed among the
multiple DataNodes for storage. To reduce the chances of data loss, blocks are often
replicated across nodes. It's a backup system should data be lost.
Let's look at NameNodes. The NameNode is the node within the cluster that knows what the
data contains, what block it belongs to, the block size, and where it should go. NameNodes
are also used to control access to files including when someone can write, read, create,
remove, and replicate data across the various data notes.
The cluster can also be adapted where necessary in real-time, depending on the server
capacity - which can be useful when there is a surge in data. Nodes can be added or taken
away when necessary.
Now, onto DataNodes. DataNodes are in constant communication with the NameNodes to
identify whether they need to commence and complete a task. This stream of consistent
collaboration means that the NameNode is acutely aware of each DataNodes status.
When a DataNode is singled out to not be operating the way it should, the namemode is able
to automatically re-assign that task to another functioning node in the same datablock.
Similarly, DataNodes are also able to communicate with each other, which means they can
collaborate during standard file operations. Because the NameNode is aware of DataNodes
and their performance, they're crucial in maintaining the system.
Datablocks are replicated across multiple datanotes and accessed by the NameNode.
To use HDFS you need to install and set up a Hadoop cluster. This can be a single node set
up which is more appropriate for first-time users, or a cluster set up for large, distributed
clusters. You then need to familiarize yourself with HDFS commands, such as the below, to
operate and manage your system.
Command Description
-rm Removes file or directory
-ls Lists files with permissions and other details
-mkdir Creates a directory named path in HDFS
-cat Shows contents of the file
-rmdir Deletes a directory
-put Uploads a file or folder from a local disk to HDFS
-rmr Deletes the file identified by path or folder and subfolders
-get Moves file or folder from HDFS to local file
-count Counts number of files, number of directory, and file size
-df Shows free space
-getmerge Merges multiple files in HDFS
-chmod Changes file permissions
-copyToLocal Copies files to the local system
-Stat Prints statistics about the file or directory
-head Displays the first kilobyte of a file
-usage Returns the help for an individual command
-chown Allocates a new owner and group of a file
How does HDFS work?
As previously mentioned, HDFS uses NameNodes and DataNodes. HDFS allows the quick
transfer of data between compute nodes. When HDFS takes in data, it's able to break down
the information into blocks, distributing them to different nodes in a cluster.
Data is broken down into blocks and distributed among the DataNodes for storage, these
blocks can also be replicated across nodes which allows for efficient parallel processing. You
can access, move around, and view data through various commands. HDFS DFS options such
as "-get" and "-put" allow you to retrieve and move data around as necessary.
What's more, the HDFS is designed to be highly alert and can detect faults quickly. The file
system uses data replication to ensure every piece of data is saved multiple times and then
assigns it across individual nodes, ensuring at least one copy is on a different rack than the
other copies.
This means when a DataNode is no longer sending signals to the NameNode, it removes the
DataNode from the cluster and operates without it. If this data node then comes back, it can
be allocated to a new cluster. Plus, since the datablocks are replicated across several
DataNodes, removing one will not lead to any file corruptions of any kind.
HDFS components
It's important to know that there are three main components of Hadoop. Hadoop HDFS,
Hadoop MapReduce, and Hadoop YARN. Let's take a look at what these components bring
to Hadoop:
Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.
Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop. This software
framework is used to write applications to process vast amounts of data.
Let’s get an idea of how data flows between the client interacting with HDFS, the name
node, and the data nodes with the help of a diagram. Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls
(RPCs), to determine the locations of the first few blocks in the file. For each block, the
name node returns the addresses of the data nodes that have a copy of that block. The DFS
returns an FSDataInputStream to the client for it to read data from. FSDataInputStream in
turn wraps a DFSInputStream, which manages the data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the
info node addresses for the primary few blocks within the file, then connects to the primary
(closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to
the data node, then finds the best data node for the next block. This happens transparently
to the client, which from its point of view is simply reading an endless stream. Blocks are
read as, with the DFSInputStream opening new connections to data nodes because the
client reads through the stream. It will also call the name node to retrieve the data node
locations for the next batch of blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.
Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to get a better
understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the
files which are already stored in HDFS, but we can append data by reopening the files.