0% found this document useful (0 votes)
22 views15 pages

HDFS

Uploaded by

chise6969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views15 pages

HDFS

Uploaded by

chise6969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Hadoop Distributed File

System (HDFS)
An Overview
What is HDFS?

• HDFS is a distributed file system designed


for large-scale data storage.
• It is a core component of the Hadoop
ecosystem.
• Primary Use: Manage and store vast
amounts of data across multiple
machines.
• Fault Tolerance:
Automatically
replicates data to
ensure reliability.

Key • Scalability: Capable


of handling large
Features datasets with ease.

• High Throughput:
Optimized for large
files and data access.
• NameNode: The master
node managing metadata
and access.

• DataNodes: Worker
Architecture
nodes that store and
manage actual data. of HDFS

• Block Storage: Data is


split into blocks and
distributed across nodes.
HDFS Architecture
• Hadoop Distributed File System follows the master-slave
architecture.
• Each cluster comprises a single master node and multiple
slave nodes.
• Internally the files get divided into one or more blocks, and
each block is stored on different slave machines depending
on the replication factor.
• The master node stores and manages the file system
namespace, that is information about blocks of files like
block locations, permissions, etc. The slave nodes store
data blocks of files.
• The Master node is the NameNode and DataNodes are the
slave nodes.
• It executes the file system namespace operations
like opening, renaming, and closing files and
directories.
• NameNode manages and maintains the
DataNodes.
• It determines the mapping of blocks of a file to
DataNodes.
Functions of • NameNode records each change made to the file
system namespace.
NameNode • It keeps the locations of each block of a file.
• NameNode takes care of the replication factor of
all the blocks.
• NameNode receives heartbeat and block reports
from all DataNodes that ensure DataNode is alive.
• If the DataNode fails, the NameNode chooses new
DataNodes for new replicas.
• DataNode is responsible for serving the client
read/write requests.
• Based on the instruction from the NameNode,
DataNodes performs block creation, replication,
Functions of and deletion.

DataNode • DataNodes send a heartbeat to NameNode to


report the health of HDFS.
• DataNodes also sends block reports to
NameNode to report the list of blocks it
contains.
Blocks in HDFS Architecture
Blocks in HDFS
Architecture
• HDFS split the file into block-sized chunks called a
block. The size of the block is 128 Mb by default. One
can configure the block size as per the requirement.
• For example, if there is a file of size 612 Mb, then HDFS
will create four blocks of size 128 Mb and one block of
size 100 Mb.
• The file of a smaller size does not occupy the full block
size space in the disk.
• For example, the file of size 2 Mb will occupy only 2 Mb
space in the disk.
• The user doesn’t have any control over the location of
the blocks.
Replication Management
• For a distributed system, the data must be redundant to
multiple places so that if one machine fails, the data is
accessible from other machines.
• In Hadoop, HDFS stores replicas of a block on multiple
DataNodes based on the replication factor.
• The replication factor is the number of copies to be created for
blocks of a file in HDFS architecture.
• If the replication factor is 3, then three copies of a block get
stored on different DataNodes.
• If one DataNode containing the data block fails, then the block
is accessible from the other DataNode containing a replica of
the block.
• If we are storing a file of 128 Mb and the replication factor is 3,
then (3*128=384) 384 Mb of disk space is occupied for a file as
three copies of a block get stored.
• This replication mechanism makes HDFS fault-tolerant.
• HDFS replicates each
file block across
multiple DataNodes.

Default replication Data


factor: 3 (can be Replication
configured). Mechanism

• Ensures data
availability in case of
node failures.
• Efficient handling of
large datasets.

• Built-in redundancy
for high availability. Advantages

• Flexible scalability,
allowing more nodes to
be added as needed.
• Small File Problem: Not
optimized for managing
many small files.

Challenges
• Single Point of Failure:
The NameNode is a critical
component; its failure can
impact the system.
Conclusion

• HDFS is a robust solution for managing distributed


storage.
• It plays a vital role in modern big data infrastructures.
• While highly scalable and efficient, certain challenges,
like small file handling, must be managed.

You might also like