0% found this document useful (0 votes)
11 views20 pages

HDFS

Hadoop Distributed File System (HDFS) is designed to efficiently store and manage large volumes of data, based on Google's File System. It has advantages such as being implementable on commodity hardware and automatic fault recovery, but is not suitable for small files or random access. HDFS consists of components like NameNode for metadata management and DataNodes for data storage, and employs features like rack awareness for improved data accessibility and task processing.

Uploaded by

shivani28ag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views20 pages

HDFS

Hadoop Distributed File System (HDFS) is designed to efficiently store and manage large volumes of data, based on Google's File System. It has advantages such as being implementable on commodity hardware and automatic fault recovery, but is not suitable for small files or random access. HDFS consists of components like NameNode for metadata management and DataNodes for data storage, and employs features like rack awareness for improved data accessibility and task processing.

Uploaded by

shivani28ag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

HDFS

HDFS
Hadoop Distributed File System

File system of the Hadoop framework

Designed to store and manage huge volume of data in an efficient manner

HDFS has been developed based on the paper published by Google about its file system
known as Google File System (GFS)

HDFS is an Userspace File System


Pros of HDFS
HDFS can be implemented on a commodity hardware

HDFS is designed for large files (GB/TB/PB..)

HDFS is suitable for streaming data access


◦ Data is written once but read multiple times. Ex. Log files

Upon discovery of faults, HDFS performs automatic recovery of the file system
Cons of HDFS
HDFS is not suitable for files that are small in size

HDFS is not suitable for reading data from a random position within a file. It is best suited
for reading data either from the beginning of the file or end of the file

HDFS does not support writing of data to the files using multiple writers
HDFS Daemons
SECONDA
NAMENOD DATANODE
RYNAMEN
E
ODE

MASTER SLAVE
NODE NODE
NameNode
Responsible for storing the HDFS metadata
◦ The metadata keeps track of all the files that are present within the HDFS. It stores information
related to the files.
1
File
GB
Name
File
Permissions ~ 1 Million
File
Files
Ownership
File
Location
HDF
S
DataNode

NameNod
e

NameNod
e
Secondary NameNode
Each and every
transaction that
occurs in the file
Transactio system is recorded
within a EditLog file.
n

If NN fails,
fsimage is
NameNod
retrieved e
from the
disk

Transaction
fsimage EditLo
g
Now the Secondary
NN copies the
fsimage and EditLog
file to its checkpoint
directory

Fsimage & CheckPoint


EditLog Directory Compacted
Secondar
NameNode
y
NameNod
fsimag
S NN loads
Secondary NN
instructs the NN to
e e fsimage &
record the applies all the
transactions to a new transactions

Transacti
EditLog file. from EditLog
Compacte files and stores
d fsimage the info onto a

on
new compacted
fsimage file.

EditLog.ne EditLo
wEditLo g
1 Hour or
g 64MB
File System Metadata
Apart from storing
FSM on RAM, NN
NameNode
also stores this info
on a set of files

fsimag EditLo
EditLog contains
fsimage stores the e g all the incremental
complete snapshot modifications done
of FSM to this metadata
fsimage
fsimage stores information about all the blocks that belongs to a file and filesystem
properties

fsImage is a file stored on the OS filesystem that contains the complete directory
structure of the HDFS with details about the location of the data on the Data Blocks and
which blocks are stored on which node.
EditLog files
EditLog keeps track of all the transactions that have took place on the filesystem

EditLo
g Ex: When a new file is
created, an entry is
made into this EditLog
file

Replication
Factor
How Hadoop manages File
System Metadata A new version of
fsimage is written
onto the disk
fsimag
fsimag from the memory
e
e

NameNode

In this new EditLog


file, all the old
When NN is started, it reads EditLo entries are
truncated. This
fsimage and EditLog file from
the disk and applies all the
g process is known
transactions to the metadata EditLo as Checkpoint
from the EditLog file that has
g
been copied to the RAM
CheckPoi
64M nt
B
Reading Data from HDFS
For a client to read data from Hadoop Cluster, it needs to have:
◦ Hadoop Client Library
◦ Cluster Configuration Data

log.txt /user/hadoo
p
Client would begin the process by
contacting the NN and specifying the
name and location of file it would like to
read.
Since Hadoop NameNod
client has Client File Name e
all the NN will validate the
File user 🡪 check
data
Location permissions 🡪 If
related to
the files it everything is OK, NN
is looking will respond back to
for, it can client with the first
File Block
contact block ID of the
ID
DN & read DataNode requested file
the data s alongwith list of all
from it. the DNs that have a
copy of the
This process of reading data from requested file
DN keeps repeating until all the sorted by the
data blocks of the requested file distance.
have not been retrieved or the
client cancels this process by
closing the steam
Writing Data to
HDFS Errors can occur during write process for various
reasons, ex: The disk or DN on which data is being
written fail, the pipeline is immediately closed &
Client makes a request any data sent after the last acknowledgement is
to NN along with file pushed back to the queue. A new healthy DN is
name and location to be identified and a new block ID is assigned & the
created data is transferred.
Hadoop NameNod
File Name If the User has
At the client end,Client
a e appropriate
File
separate thread is permissions, an entry
responsible for Threa
Location
is made within the
writing data from d
DataNod metadata list of NN
the queue onto es
the HDFS. Initially no blocks are associated
to this file. Now NN opens a
This thread would
contact the NN
Strea stream to which the client can
The client indicates the
completion of writing the
write data
requesting the list
of all DNs, on
Threa
4K
m4K 4K As the client writes data to the stream, the data is
splitted into packets of 4KB and stored on a
data, closing the stream. Any
remaining packets in the
which it can store d separate queue in the memory
DAT B B B queue are flushed out &
a copy of this file. Queu metadata info is updated on
Now the client A HADOOP NN
makes a direct
e
RACK 1 CLUSTER RACK
connection to the
first DN. 2
Once the packet is
successfully written to the
AC DataNode 1 DataNode DataNode DataNode DataNode DataNode
disk, an acknowledgement is
2 3 4 5 6
K sent to the client.
Rack Awareness
The process of making Hadoop aware about what machine is part of which Rack and how
these Racks are connected to each other within the Hadoop Cluster
HADOOP
CLUSTER
RACK RACK RACK
1
DataNode 1 2
DataNode 3
DataNode
5 9

DataNode DataNode DataNode


2 6 10

DataNode DataNode DataNode


3 7 11

DataNode DataNode DataNode


4 8 12

DATACENTE DATACENTER 2
R1
Advantages of Rack Awareness
Data stored on different DataNodes makes accessibility easier

Easy Task processing for MapReduce

Time Saving

Efficiency
NameNode Federation
Using multiple NameNodes to manage each portion of the file system.

These multiple NameNodes forms a NameNode Federation.

NameNode NameNode NameNode

/user /data /share


Thank you

You might also like