Unit 3 Full
Unit 3 Full
BIG DATA MACHINE CLOUD STORAGE SOCIAL MEDIA & HEALTHCARE &
ANALYTICS: LEARNING SOLUTIONS: WEB ANALYTICS: GENOMICS:
PIPELINES: USED IN DATA STORING AND PROCESSING LARGE
PROCESSING LARGE
LAKES AND ANALYZING USER MEDICAL DATASETS.
DATASETS (E.G., LOG STORING AND
ANALYSIS, FRAUD PREPROCESSING ENTERPRISE DATA INTERACTIONS.
DETECTION). TRAINING DATA. WAREHOUSES.
• HDFS (Hadoop Distributed File
System) is designed to handle
large-scale data storage and
processing efficiently across
a distributed cluster. The
design of HDFS is influenced
HDFS by the need for scalability,
Design fault tolerance, high
throughput, and cost-
effectiveness while being
optimized for batch processing
workloads.
Design Goals of
HDFS
Fault
Scalability
Tolerance
High Streaming
Throughput Data Access
Simple
Economical
Coherency
Storage
Model
HDFS
Architec
ture
HDFS Architecture
Name nodes, secondary name nodes, data nodes, checkpoint nodes, backup nodes,
and blocks all make up the architecture of HDFS.
The primary difference between Hadoop and Apache HBase is that Apache HBase is
a non-relational database and Apache Hadoop is a non-relational data store.
HDFS is composed of master-slave architecture,
which includes the following elements:
Name nodes,
backup nodes
blocks
Name Node
The NameNode (Master Node) manages all DataNode blocks, performing tasks such as:
• Monitoring and controlling DataNodes.
• Granting user access to files.
• Storing block metadata.
• Committing EditLogs to disk after every write for recovery.
DataNodes operate independently, and failures don’t impact the cluster as lost blocks are re-
replicated. Only the NameNode tracks all DataNodes, managing communication centrally.
The NameNode maintains two key files:
FsImage: A snapshot of the entire filesystem, storing directories and file metadata hierarchically.
EditLogs: Records real-time changes, tracking file modifications for recovery.
Secondary NameNode
When NameNode runs out of disk space, a secondary NameNode is activated
to perform a checkpoint. The secondary NameNode performs the following
duties.
2. It handles all of the requested operations on files, such as reading file content and creating
new data, as described above.
3. All the instructions are followed, including scrubbing data on DataNodes, establishing
partnerships, and so on.
Checkpoint Node
In case one of the active NameNode or DataNodes goes down, the backup node can be
promoted to active and the active node switched over to the backup node.
Backup nodes are not used to recover from a failure of the active NameNode or
DataNodes.
Instead, you use a replica set of the data for that purpose. Data nodes are used to
store the data and to create the FsImage and editsLogs files for replication.
Data nodes connect to one or more replica sets of the data to create the FsImage and
editsLogs files for replication. Data nodes are not used to provide high
availability.
Blocks
•Block size in HDFS can be set between 32 MB and 128 MB, based on performance
needs.
•Data is appended to DataNodes, which are replicated for fault tolerance.
•Automatic recovery occurs if a node fails, restoring data from backups.
•HDFS uses its own file system, not direct hard drive storage.
•Scales horizontally as data and users grow.
•File storage mechanism:If a file exceeds the block size, it splits across multiple
blocks.
Example: A 135 MB file with a 128 MB block size creates two blocks—128 MB + 7
MB.
Scalability – HDFS is highly scalable and can
store and process petabytes of data by adding
new nodes to the cluster.
Benefits of Fault Tolerance – It replicates data across
HDFS multiple nodes, ensuring data availability even
if a node fails.
(Hadoop High Throughput – HDFS is optimized for large-
scale data processing and supports parallel
Distributed data access, improving performance.
File Cost-Effective – It runs on commodity hardware,
reducing the cost compared to traditional
System) storage solutions.
Write Once, Read Many (WORM) Model – Data in
HDFS is written once and read multiple times,
making it ideal for big data analytics.
Data Locality – HDFS moves computation to data
rather than moving data to computation,
reducing network bottlenecks.
Integration with Hadoop Ecosystem – It
seamlessly integrates with Hadoop tools like
MapReduce, Hive, and Spark for efficient big
data processing.
Challenges of HDFS
•Small File Problem – HDFS is optimized for large files; storing many small files can cause
performance issues and increase metadata overhead.
•Latency – Since HDFS is designed for batch processing, it has high latency and is not
suitable for real-time applications.
•Write-Once Limitation – Once written, files in HDFS cannot be modified, making updates
difficult.
•Complex Setup & Maintenance – Managing and configuring an HDFS cluster requires
expertise in Hadoop administration.
•Security Concerns – By default, HDFS lacks advanced security features like encryption and
access control, which must be manually configured.
•Hardware Dependency – Though it runs on commodity hardware, disk failures and
hardware issues still impact performance and require redundancy.
•Memory Overhead – NameNode stores metadata in RAM, which can become a bottleneck
as the cluster grows.
In Hadoop Distributed File System (HDFS),
file size considerations depend on several
factors, such as block size, replication
factor, and compression. Here are key
points:
1. Default Block Size:
• The default block size in HDFS is 128 MB
or 256 MB
File Size • Files smaller than the block size do not
waste space, but large files are split
in HDFS into multiple blocks.
2. Maximum File Size:
Since HDFS files are divided into blocks,
the maximum file size is practically limited
by the number of blocks NameNode can manage.
Theoretically, if each block is 128 MB and a
NameNode can handle hundreds of millions of
blocks, a single file can be petabytes in
size.
3. Replication Factor:
• Each block is replicated (default: 3
times) across different DataNodes.
• Effective Storage used= file size X
Replication Factor
4. Small Files Problem:
• Files in HDFS are broken into block-sized chunks called data blocks. These blocks
are stored as independent units.
• The size of these HDFS data blocks is 128 MB by default. We can configure the block
size as per our requirement by changing the dfs.block.size property in hdfs-site.xml
• Hadoop distributes these blocks on different slave machines, and the master machine
stores the metadata about blocks location.
• All the blocks of a file are of the same size except the last one (if the file size is not a
multiple of 128). See the example below to understand this fact.
Block Size in
HDFS
Suppose we have a file of size 612 MB, and we are using the default block
configuration (128 MB). Therefore five blocks are created, the first four blocks
are 128 MB in size, and the fifth block is 100 MB in size (128*4+100=612).
From the above example, we can conclude that:
1.A file in HDFS, smaller than a single block does not occupy a full block size
space of the underlying storage.
2.Each file stored in HDFS doesn’t need to be an exact multiple of the configured
block size.
• Now let’s see the reasons behind the large size of the data blocks in HDFS.
Why are blocks in HDFS huge?
The default size of the HDFS data block is 128 MB. The
reasons for the large size of blocks are:
To minimize the cost of seek: For the large size blocks, time
taken to transfer the data from disk can be longer as
compared to the time taken to start the block. This results
in the transfer of multiple blocks at the disk transfer rate.
In order to provide reliable storage, HDFS stores large files in multiple locations in a large cluster, with each file in a
sequence of blocks. Each block is stored in a file of the same size, except for the final block, which fills as data is
added.
For added protection, HDFS files are write-once by only one writer at any time. To help ensure that all data is being
replicated as instructed. The NameNode receives a heartbeat (a periodic status report) and blockreport (the block
ID, generation stamp and length of every block replica) from every DataNode attached to the cluster. Receiving a
heartbeat indicates that the DataNode is working correctly.
• The NameNode selects the rack ID for each DataNode by using a process called Hadoop Rack Awareness to help
prevent the loss of data if an entire rack fails. This also enables the use of bandwidth from multiple racks when
reading data.
How HDFS Store a File?
• Block 1 → 128 MB
• Block 2 → 128 MB
• Block 3 → 44 MB
HDFS File Step 2: To find the locations of the file's first blocks, the
Distributed File System (DFS) makes a remote procedure call
Read Request (RPC) to the name node. The addresses of the data nodes that
have copies of each block are returned by the name node. The
DFS provides the client with an FSDataInputStream so that it
Workflow can read data from it. A DFSInputStream, which controls the
data node and name node I/O, is in turn wrapped by an
FSDataInputStream.
HDFS File Step 5: DFSInputStream locates the optimal data node for
the following block after cutting the connection with the data
Read Request node when the block ends. The client observes this
transparently and perceives it as just reading an infinite
stream. Blocks are read as the client reads through the
Workflow stream, causing the DFSInputStream to establish new
connections with data nodes. When necessary, it will also
make a call to the name node to obtain the positions of the
data nodes for the upcoming batch of blocks.
HDFS File Step 1: The client calls DistributedFileSystem(DFS)'s create()
function to create the file.
Write
•
Request Step 2: To create a new file in the file system namespace without
any blocks attached, DFS sends an RPC call to the name node. In
Workflow order to ensure that the file is new and that the client has the
necessary permissions to create it, the name node runs a number
of tests. The name node creates a record of the new file if these
tests are successful; if not, the client receives an error message,
such as an IOException, preventing the creation of the file. The
client can begin writing data to the FSDataOutputStream that the
DFS returns.
Step 3: The DFSOutputStream divides the data written by the client
HDFS File into packets, which it then sends to the info queue, an indoor
queue. The DataStreamer consumes the data queue and is
Write responsible for selecting appropriate data nodes from the
inventory to store the replicas and requesting the name node to
Request allot new blocks. The set of data nodes creates a pipeline; in this
case, we'll suppose that there are three nodes in the pipeline due
Workflow to the replication level of three. The main data node in the pipeline
receives the packets from the DataStreamer and stores them
before forwarding them to the second data node in the pipeline.
•
Step 4: In a similar manner, the packet is stored by the second
data node and forwarded to the third (and final) data node in the
pipeline.
Step 5: The DFSOutputStream maintains an internal "ack queue" of packets
HDFS File awaiting acknowledgement from data nodes.
Write Step 6: To indicate whether or not the file is complete, this action sends up all
Request of the remaining packets to the data node pipeline and then connects to the
name node to await acknowledgments.
Workflow
HDFS adheres to the Write Once, Read Many paradigm. Therefore, files that
are already saved in HDFS cannot be edited, but they can be added by
accessing the file again. Because of this design, HDFS can scale to support
many concurrent clients because data traffic is distributed throughout all of
the cluster's data nodes. As a result, the system's throughput, scalability, and
availability all increase.
Example
Bloc
Size Stored On
k
• Saving Project.pdf (300MB)
DN1, DN2,
B1 128 MB
DN3
DN2, DN4,
B2 128 MB
DN5
DN1, DN5,
B3 44 MB
DN6
Java interfaces to HDFS
HDFS provides several Java interfaces/APIs for interacting with the Hadoop
Distributed File System. These interfaces are used for writing Java
programs that can store, read, and manage files in HDFS.
1. File System Interface
This is the main Java interface for interacting with HDFS.
It provides methods to:
• Create and write files
• Open and read files
• Delete files or directories
• Check if a file exists
• List directory contents
Think of this as the controller that lets Java programs perform file
operations in HDFS.
Java interfaces to HDFS
2. Path Class
• Represents a file or folder location in HDFS.
• Used to specify where a file should be stored, read from, or
deleted.
• Similar to how you'd use a file path in any file system.
3. Configuration Class
Loads the Hadoop environment settings from configuration files.
It helps the Java program know:
• Where the HDFS is running
• What the default file system is
• Any other Hadoop-specific settings
Java interfaces to HDFS
4. FSDataInputStram and FSDataOutputStream
These are data streams used to read from and write
to HDFS files.
They work similarly to Java's standard input/output
streams but are optimized for HDFS.
5. Other Supporting Interfaces
FileStatus: Used to get information about files,
like size, permissions, and modification date.
RemoteIterator: Used to iterate over files and
directories in HDFS.
HDFS Command Line Interface
(CLI)
HDFS CLI allows users to interact with the Hadoop Distributed
File System directly from the terminal. It is used for file
management, monitoring, and troubleshooting in HDFS without
writing code.
Core Purpose:
• Enables manual interaction with files stored in HDFS.
• Useful for administrators, developers, and analysts to
upload, organize, and manage large datasets.
• Supports scripting for automation of data workflows.
Common File Commands:
-ls → Lists files and directories in a given HDFS path.
• Create Schemas
Steps to • Design an Avro schema in JSON format
according to your data structure.
2. Authorization:
Authorization governs the actions an authenticated user can take within the Hadoop cluster. It entails creating access
restrictions and permissions depending on the roles and responsibilities of the users. Organizations can enforce the
concept of least privilege by allowing users only the privileges required to complete their tasks, if adequate
authorization procedures are in place. This reduces the possibility of unauthorized data tampering or exposure.
3. Auditing:
Auditing is essential for monitoring and tracking user activity in the Hadoop cluster. Organizations can investigate
suspicious or unauthorized activity by keeping detailed audit logs. Auditing also aids in compliance reporting, allowing
organizations to demonstrate conformity with regulatory standards. Implementing real-time audit log monitoring and
analysis provides for the timely detection of security incidents and the facilitation of proactive measures.
Hadoop Adminstering
Administering Hadoop involves managing and
maintaining a Hadoop cluster, which includes
tasks like installation, configuration,
monitoring, troubleshooting, and security.
It also encompasses managing the Hadoop Distributed File
System (HDFS) and YARN (Yet Another Resource Negotiator),
as well as other components that work with the core Hadoop
system.
Installation Monitoring and
Cluster
and Performance
Management
Configuration Tuning
Key Aspects of
Hadoop Security
Backup and
Recovery
Troubleshooting
Administration:
Hadoop
Tools and
Administration
Techniques
Roles
HDFS administration
includes monitoring the
HDFS file structure,
locations, and the
updated files.
Monitorin
g MapReduce administration
includes monitoring the
list of applications,
configuration of nodes,
application status, etc.
Hadoop Benchmark
• Hadoop benchmarking involves using specific tools and techniques
to evaluate the performance of a Hadoop cluster.
• These benchmarks allow for testing and tuning the cluster's
hardware, software, and network for optimal performance.
Common Benchmark Tools:
• TestDFSIO – Measures HDFS I/O performance.
• TeraSort – Sorts large amounts of data; tests both
MapReduce and HDFS.
• MRBench – Tests performance of MapReduce
framework.
• HiBench – A comprehensive benchmarking suite by
Intel.
Hadoop in the cloud, also known
as Hadoop as a Service (HaaS),
refers to running the Apache
Hadoop framework on a cloud
provider's infrastructure,
rather than managing it on-
Hadoop in premises.
Cloud This allows organizations to
leverage cloud-based resources
for big data analytics without
the need to invest in and
manage their own hardware and
software.
Examples of Cloud Hadoop Services:
1. Amazon EMR:
A fully managed Hadoop service on Amazon Web
Services, offering a wide range of options for
running Hadoop clusters in the cloud.
2. Google Dataproc:
A managed service for running Apache Hadoop and Apache
Spark clusters on Google Cloud Platform.
3. Azure HDInsight:
A managed Hadoop service on Microsoft Azure, providing a
variety of Hadoop and related services.