Hadoop Distributed File
System (HDFS)
An Overview
What is HDFS?
• HDFS is a distributed file system designed
for large-scale data storage.
• It is a core component of the Hadoop
ecosystem.
• Primary Use: Manage and store vast
amounts of data across multiple
machines.
• Fault Tolerance:
Automatically
replicates data to
ensure reliability.
Key • Scalability: Capable
of handling large
Features datasets with ease.
• High Throughput:
Optimized for large
files and data access.
• NameNode: The master
node managing metadata
and access.
• DataNodes: Worker
Architecture
nodes that store and
manage actual data. of HDFS
• Block Storage: Data is
split into blocks and
distributed across nodes.
HDFS Architecture
• Hadoop Distributed File System follows the master-slave
architecture.
• Each cluster comprises a single master node and multiple
slave nodes.
• Internally the files get divided into one or more blocks, and
each block is stored on different slave machines depending
on the replication factor.
• The master node stores and manages the file system
namespace, that is information about blocks of files like
block locations, permissions, etc. The slave nodes store
data blocks of files.
• The Master node is the NameNode and DataNodes are the
slave nodes.
• It executes the file system namespace operations
like opening, renaming, and closing files and
directories.
• NameNode manages and maintains the
DataNodes.
• It determines the mapping of blocks of a file to
DataNodes.
Functions of • NameNode records each change made to the file
system namespace.
NameNode • It keeps the locations of each block of a file.
• NameNode takes care of the replication factor of
all the blocks.
• NameNode receives heartbeat and block reports
from all DataNodes that ensure DataNode is alive.
• If the DataNode fails, the NameNode chooses new
DataNodes for new replicas.
• DataNode is responsible for serving the client
read/write requests.
• Based on the instruction from the NameNode,
DataNodes performs block creation, replication,
Functions of and deletion.
DataNode • DataNodes send a heartbeat to NameNode to
report the health of HDFS.
• DataNodes also sends block reports to
NameNode to report the list of blocks it
contains.
Blocks in HDFS Architecture
Blocks in HDFS
Architecture
• HDFS split the file into block-sized chunks called a
block. The size of the block is 128 Mb by default. One
can configure the block size as per the requirement.
• For example, if there is a file of size 612 Mb, then HDFS
will create four blocks of size 128 Mb and one block of
size 100 Mb.
• The file of a smaller size does not occupy the full block
size space in the disk.
• For example, the file of size 2 Mb will occupy only 2 Mb
space in the disk.
• The user doesn’t have any control over the location of
the blocks.
Replication Management
• For a distributed system, the data must be redundant to
multiple places so that if one machine fails, the data is
accessible from other machines.
• In Hadoop, HDFS stores replicas of a block on multiple
DataNodes based on the replication factor.
• The replication factor is the number of copies to be created for
blocks of a file in HDFS architecture.
• If the replication factor is 3, then three copies of a block get
stored on different DataNodes.
• If one DataNode containing the data block fails, then the block
is accessible from the other DataNode containing a replica of
the block.
• If we are storing a file of 128 Mb and the replication factor is 3,
then (3*128=384) 384 Mb of disk space is occupied for a file as
three copies of a block get stored.
• This replication mechanism makes HDFS fault-tolerant.
• HDFS replicates each
file block across
multiple DataNodes.
Default replication Data
factor: 3 (can be Replication
configured). Mechanism
• Ensures data
availability in case of
node failures.
• Efficient handling of
large datasets.
• Built-in redundancy
for high availability. Advantages
• Flexible scalability,
allowing more nodes to
be added as needed.
• Small File Problem: Not
optimized for managing
many small files.
Challenges
• Single Point of Failure:
The NameNode is a critical
component; its failure can
impact the system.
Conclusion
• HDFS is a robust solution for managing distributed
storage.
• It plays a vital role in modern big data infrastructures.
• While highly scalable and efficient, certain challenges,
like small file handling, must be managed.