0% found this document useful (0 votes)
31 views89 pages

Unit 3 Full

Hadoop Distributed File System (HDFS) is a distributed file system designed for storing and processing large files across a cluster, providing features like fault tolerance through data replication, scalability, and high throughput. HDFS organizes data into blocks, which are replicated across multiple nodes to ensure data availability and reliability, while also optimizing for batch processing workloads. However, it faces challenges such as handling small files, high latency, and complex setup requirements.

Uploaded by

kawor10477
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views89 pages

Unit 3 Full

Hadoop Distributed File System (HDFS) is a distributed file system designed for storing and processing large files across a cluster, providing features like fault tolerance through data replication, scalability, and high throughput. HDFS organizes data into blocks, which are replicated across multiple nodes to ensure data availability and reliability, while also optimizing for batch processing workloads. However, it faces challenges such as handling small files, high latency, and complex setup requirements.

Uploaded by

kawor10477
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

UNIT 3

Hadoop Distributed File System


HADOOP
Hadoop includes two main pieces:
• a distributed architecture for
running MapReduce jobs, which
are Java and other programs used
to convert data from one format to
another, and a
• distributed file system (HDFS)
for storing data in a distributed
architecture. Here we discuss
HDFS.
• Apache Hadoop HDFS is a distributed file system which
provides redundant storage space for storing files which are
huge in sizes; files which are in the range of Terabytes and
Petabytes.
• In HDFS data is stored reliably. Files are broken into blocks
Hadoop and distributed across nodes in a cluster. After that, each block
is replicated, means copies of blocks are created on different
Distribut machines.
• Hence if a machine goes down or gets crashed, then also we
ed File can easily retrieve and access our data from different machines.
By default, 3 copies of a file are created on different machines.
System • Hence it is highly fault-tolerant. HDFS provides faster file read
and writes mechanism, as data is stored in different nodes in a
cluster. Hence the user can easily access the data from any
machine in a cluster. Hence HDFS is highly used as a platform
for storing huge volume and different varieties of data
worldwide.
Features of HDFS
• Distributed Storage: Data is stored across
multiple nodes in a cluster.
• Fault Tolerance: Data is replicated across
different nodes to prevent data loss.
• Scalability: Easily scales by adding new
nodes to the cluster.
• High Throughput: Supports parallel
processing, ensuring efficient data access.
• Streaming Access: Designed for high-speed
data streaming rather than low-latency
access.
• Write Once, Read Many (WORM): Data is
written once and read multiple times,
making it suitable for big data analytics.
HDFS Use Cases

BIG DATA MACHINE CLOUD STORAGE SOCIAL MEDIA & HEALTHCARE &
ANALYTICS: LEARNING SOLUTIONS: WEB ANALYTICS: GENOMICS:
PIPELINES: USED IN DATA STORING AND PROCESSING LARGE
PROCESSING LARGE
LAKES AND ANALYZING USER MEDICAL DATASETS.
DATASETS (E.G., LOG STORING AND
ANALYSIS, FRAUD PREPROCESSING ENTERPRISE DATA INTERACTIONS.
DETECTION). TRAINING DATA. WAREHOUSES.
• HDFS (Hadoop Distributed File
System) is designed to handle
large-scale data storage and
processing efficiently across
a distributed cluster. The
design of HDFS is influenced
HDFS by the need for scalability,
Design fault tolerance, high
throughput, and cost-
effectiveness while being
optimized for batch processing
workloads.
Design Goals of
HDFS

Fault
Scalability
Tolerance

High Streaming
Throughput Data Access

Simple
Economical
Coherency
Storage
Model
HDFS
Architec
ture
HDFS Architecture

HDFS is an Open source component of the Apache Software Foundation that


manages data.

HDFS has scalability, availability, and replication as key features.

Name nodes, secondary name nodes, data nodes, checkpoint nodes, backup nodes,
and blocks all make up the architecture of HDFS.

HDFS is fault-tolerant and is replicated. Files are distributed across the


cluster systems using the Name node and Data Nodes.

The primary difference between Hadoop and Apache HBase is that Apache HBase is
a non-relational database and Apache Hadoop is a non-relational data store.
HDFS is composed of master-slave architecture,
which includes the following elements:

Name nodes,

secondary name nodes,


HDFS
Architect data nodes,

ure checkpoint nodes,

backup nodes

blocks
Name Node
The NameNode (Master Node) manages all DataNode blocks, performing tasks such as:
• Monitoring and controlling DataNodes.
• Granting user access to files.
• Storing block metadata.
• Committing EditLogs to disk after every write for recovery.
DataNodes operate independently, and failures don’t impact the cluster as lost blocks are re-
replicated. Only the NameNode tracks all DataNodes, managing communication centrally.
The NameNode maintains two key files:
FsImage: A snapshot of the entire filesystem, storing directories and file metadata hierarchically.
EditLogs: Records real-time changes, tracking file modifications for recovery.
Secondary NameNode
When NameNode runs out of disk space, a secondary NameNode is activated
to perform a checkpoint. The secondary NameNode performs the following
duties.

Replication ensures Metadata (information


Transaction logs from data availability about data locations)
all sources are stored across servers, either is stored in cluster
in one location for directly or via a nodes and helps
easy replay. distributed file DataNodes read and
system. process data.

FsImage enables data


replication, scaling, Backups can be stored
and backup. It can be in another Hadoop
recreated from an old cluster or on a local
version in case of file system.
failure.
DataNode
Every slave machine that contains data organsises a DataNode. DataNode stores data in ext3 or
ext4 file format on DataNodes. DataNodes do the following:

1. DataNodes store every data.

2. It handles all of the requested operations on files, such as reading file content and creating
new data, as described above.

3. All the instructions are followed, including scrubbing data on DataNodes, establishing
partnerships, and so on.
Checkpoint Node

It establishes checkpoints at specified intervals to


generate checkpoint nodes in FsImage and EditLogs from
NameNode and joins them to produce a new image.

Whenever you generate FsImage and EditLogs from NameNode


and merge them to create a new image, checkpoint nodes in
HDFS create a checkpoint and deliver it to the NameNode.

The directory structure is always identical to that of


the name node, so the checkpointed image is always
available.
Backup Node

Backup nodes are used to provide high availability of the data.

In case one of the active NameNode or DataNodes goes down, the backup node can be
promoted to active and the active node switched over to the backup node.

Backup nodes are not used to recover from a failure of the active NameNode or
DataNodes.

Instead, you use a replica set of the data for that purpose. Data nodes are used to
store the data and to create the FsImage and editsLogs files for replication.

Data nodes connect to one or more replica sets of the data to create the FsImage and
editsLogs files for replication. Data nodes are not used to provide high
availability.
Blocks
•Block size in HDFS can be set between 32 MB and 128 MB, based on performance
needs.
•Data is appended to DataNodes, which are replicated for fault tolerance.
•Automatic recovery occurs if a node fails, restoring data from backups.
•HDFS uses its own file system, not direct hard drive storage.
•Scales horizontally as data and users grow.
•File storage mechanism:If a file exceeds the block size, it splits across multiple
blocks.
Example: A 135 MB file with a 128 MB block size creates two blocks—128 MB + 7
MB.
Scalability – HDFS is highly scalable and can
store and process petabytes of data by adding
new nodes to the cluster.
Benefits of Fault Tolerance – It replicates data across
HDFS multiple nodes, ensuring data availability even
if a node fails.
(Hadoop High Throughput – HDFS is optimized for large-
scale data processing and supports parallel
Distributed data access, improving performance.
File Cost-Effective – It runs on commodity hardware,
reducing the cost compared to traditional
System) storage solutions.
Write Once, Read Many (WORM) Model – Data in
HDFS is written once and read multiple times,
making it ideal for big data analytics.
Data Locality – HDFS moves computation to data
rather than moving data to computation,
reducing network bottlenecks.
Integration with Hadoop Ecosystem – It
seamlessly integrates with Hadoop tools like
MapReduce, Hive, and Spark for efficient big
data processing.
Challenges of HDFS
•Small File Problem – HDFS is optimized for large files; storing many small files can cause
performance issues and increase metadata overhead.
•Latency – Since HDFS is designed for batch processing, it has high latency and is not
suitable for real-time applications.
•Write-Once Limitation – Once written, files in HDFS cannot be modified, making updates
difficult.
•Complex Setup & Maintenance – Managing and configuring an HDFS cluster requires
expertise in Hadoop administration.
•Security Concerns – By default, HDFS lacks advanced security features like encryption and
access control, which must be manually configured.
•Hardware Dependency – Though it runs on commodity hardware, disk failures and
hardware issues still impact performance and require redundancy.
•Memory Overhead – NameNode stores metadata in RAM, which can become a bottleneck
as the cluster grows.
In Hadoop Distributed File System (HDFS),
file size considerations depend on several
factors, such as block size, replication
factor, and compression. Here are key
points:
1. Default Block Size:
• The default block size in HDFS is 128 MB
or 256 MB
File Size • Files smaller than the block size do not
waste space, but large files are split
in HDFS into multiple blocks.
2. Maximum File Size:
Since HDFS files are divided into blocks,
the maximum file size is practically limited
by the number of blocks NameNode can manage.
Theoretically, if each block is 128 MB and a
NameNode can handle hundreds of millions of
blocks, a single file can be petabytes in
size.
3. Replication Factor:
• Each block is replicated (default: 3
times) across different DataNodes.
• Effective Storage used= file size X
Replication Factor
4. Small Files Problem:

File Size HDFS is not efficient for handling a


large number of small files, since each
file's metadata is stored in NameNode
in HDFS memory.
Solution: Use HAR (Hadoop Archive),
SequenceFile, or Avro to merge small
files.
5. Compression (Optional):
HDFS supports Snappy, Gzip, LZO, etc.,
which reduce file size and improve
efficiency.
Block Size in
HDFS

• Files in HDFS are broken into block-sized chunks called data blocks. These blocks
are stored as independent units.
• The size of these HDFS data blocks is 128 MB by default. We can configure the block
size as per our requirement by changing the dfs.block.size property in hdfs-site.xml
• Hadoop distributes these blocks on different slave machines, and the master machine
stores the metadata about blocks location.
• All the blocks of a file are of the same size except the last one (if the file size is not a
multiple of 128). See the example below to understand this fact.
Block Size in
HDFS
Suppose we have a file of size 612 MB, and we are using the default block
configuration (128 MB). Therefore five blocks are created, the first four blocks
are 128 MB in size, and the fifth block is 100 MB in size (128*4+100=612).
From the above example, we can conclude that:
1.A file in HDFS, smaller than a single block does not occupy a full block size
space of the underlying storage.
2.Each file stored in HDFS doesn’t need to be an exact multiple of the configured
block size.
• Now let’s see the reasons behind the large size of the data blocks in HDFS.
Why are blocks in HDFS huge?

The default size of the HDFS data block is 128 MB. The
reasons for the large size of blocks are:

To minimize the cost of seek: For the large size blocks, time
taken to transfer the data from disk can be longer as
compared to the time taken to start the block. This results
in the transfer of multiple blocks at the disk transfer rate.

If blocks are small, there will be too many blocks


in Hadoop HDFS and thus too much metadata to store. Managing
such a huge number of blocks and metadata will create
overhead and lead to traffic in a network.
Block Abstraction in HDFS (Hadoop
Distributed File System)
HDFS (Hadoop Distributed File System) stores large files across multiple
nodes in a Hadoop cluster. Instead of treating files as a whole, HDFS
divides them into blocks, which are the basic unit of storage.
Key Concepts of Block Abstraction in HDFS:
1. Block Size:
1. Default block size in HDFS is 128MB (can be configured).
2. Unlike traditional file systems (which have small block sizes like
4KB), HDFS uses large blocks to minimize metadata overhead
and improve performance.
2. Splitting and Storage:
• Large files are split into fixed-size blocks and distributed across multiple
nodes.
• Example: A 500MB file with a block size of 128MB will be split into 4
full blocks (128MB each) and 1 smaller block (12MB).
Block Abstraction in HDFS (Hadoop
Distributed File System)
3. Replication for Fault Tolerance:
•Each block is replicated (default replication factor:
3).
•Replicas are stored on different nodes to ensure data
availability in case of node failure.
4. Block Mapping:
•The NameNode keeps track of where each block is
stored but does not store the actual data.
•The DataNodes store the actual blocks.
5. Processing and Parallelism:
•Since files are split into blocks across multiple nodes,
Hadoop’s MapReduce can process them in parallel,
improving efficiency.
Example of Block Abstraction in
HDFS:

Suppose you upload a 300MB file into HDFS


with a block size of 128MB and replication
factor 3:
The file is • Block 1: 128MB
split into • Block 2: 128MB
three blocks: • Block 3: 44MB (remaining data)

Each block is replicated three times across


different nodes.
Data Data replication with multiple
copies across many nodes helps
Replicati protect against data loss. HDFS
keeps at least one copy on a
on in different rack from all other
copies. This data storage in a large
HDFS cluster across nodes increases
reliability.
(Hadoop In addition, HDFS can take storage
Distribut snapshots to save point-in-time
(PIT) information.
ed File
System)
Data Replication in HDFS (Hadoop Distributed File
System)

In order to provide reliable storage, HDFS stores large files in multiple locations in a large cluster, with each file in a
sequence of blocks. Each block is stored in a file of the same size, except for the final block, which fills as data is
added.

For added protection, HDFS files are write-once by only one writer at any time. To help ensure that all data is being
replicated as instructed. The NameNode receives a heartbeat (a periodic status report) and blockreport (the block
ID, generation stamp and length of every block replica) from every DataNode attached to the cluster. Receiving a
heartbeat indicates that the DataNode is working correctly.
• The NameNode selects the rack ID for each DataNode by using a process called Hadoop Rack Awareness to help
prevent the loss of data if an entire rack fails. This also enables the use of bandwidth from multiple racks when
reading data.
How HDFS Store a File?

Step 1: Client Initiates the Write


• A user (or application) wants to store a file in HDFS.
• The HDFS client contacts the NameNode to request permission and metadata setup for the file.
Step 2: File is Split into Blocks
• The file is divided into fixed-size blocks, typically 128 MB each.
• For example, a 400 MB file becomes 4 blocks: Block1 (128 MB), Block2 (128 MB), Block3
(128 MB), Block4 (16 MB).
Step 3: NameNode Chooses DataNodes
• For each block, the NameNode picks three DataNodes (by default) where replicas will be stored.
• These selections are based on rack awareness and load balancing.
How HDFS Store a File?
Step 4: Pipeline is Created for Block Writing
• The client receives the list of chosen DataNodes for Block1.
• Then it starts sending Block1 to the first DataNode.
• That DataNode forwards it to the second, and the second forwards it to the third.
• This forms a write pipeline.
Step 5: Block is Replicated and Acknowledged
• Once Block1 is written and replicated on all three DataNodes, an acknowledgment is sent back to the client.
• This process repeats for each block until the entire file is stored.
Step 6: Metadata Stored by NameNode
• The NameNode stores:
• The filename
• Block IDs
• The DataNode locations for each block
Step 7: File is Ready in HDFS
• Now, the file is officially stored in HDFS and can be accessed by the client for reading.
Example
Say you store a file research.pdf (size = 300 MB):

HDFS splits it into 3 blocks:

• Block 1 → 128 MB
• Block 2 → 128 MB
• Block 3 → 44 MB

Each block is replicated to 3 different DataNodes:

• Block 1 → DN1, DN2, DN3


• Block 2 → DN2, DN4, DN5
• Block 3 → DN1, DN3, DN6

NameNode stores the block mapping info for


research.pdf.
Step 1: The client calls open() on the File System Object (which,
in the case of HDFS, is an instance of the Distributed File
System) to open the file it wants to read.

HDFS File Step 2: To find the locations of the file's first blocks, the
Distributed File System (DFS) makes a remote procedure call
Read Request (RPC) to the name node. The addresses of the data nodes that
have copies of each block are returned by the name node. The
DFS provides the client with an FSDataInputStream so that it
Workflow can read data from it. A DFSInputStream, which controls the
data node and name node I/O, is in turn wrapped by an
FSDataInputStream.

Step 3: The client then uses the stream to call read().


DFSInputStream connects to the principal (closest) data node
for the file's primary block after storing the info node addresses
for the first several blocks in the file.
Step 4: The client continually uses read() on the stream after
receiving data via a stream from the data node.

HDFS File Step 5: DFSInputStream locates the optimal data node for
the following block after cutting the connection with the data
Read Request node when the block ends. The client observes this
transparently and perceives it as just reading an infinite
stream. Blocks are read as the client reads through the
Workflow stream, causing the DFSInputStream to establish new
connections with data nodes. When necessary, it will also
make a call to the name node to obtain the positions of the
data nodes for the upcoming batch of blocks.
HDFS File Step 1: The client calls DistributedFileSystem(DFS)'s create()
function to create the file.
Write

Request Step 2: To create a new file in the file system namespace without
any blocks attached, DFS sends an RPC call to the name node. In
Workflow order to ensure that the file is new and that the client has the
necessary permissions to create it, the name node runs a number
of tests. The name node creates a record of the new file if these
tests are successful; if not, the client receives an error message,
such as an IOException, preventing the creation of the file. The
client can begin writing data to the FSDataOutputStream that the
DFS returns.
Step 3: The DFSOutputStream divides the data written by the client
HDFS File into packets, which it then sends to the info queue, an indoor
queue. The DataStreamer consumes the data queue and is
Write responsible for selecting appropriate data nodes from the
inventory to store the replicas and requesting the name node to
Request allot new blocks. The set of data nodes creates a pipeline; in this
case, we'll suppose that there are three nodes in the pipeline due
Workflow to the replication level of three. The main data node in the pipeline
receives the packets from the DataStreamer and stores them
before forwarding them to the second data node in the pipeline.


Step 4: In a similar manner, the packet is stored by the second
data node and forwarded to the third (and final) data node in the
pipeline.
Step 5: The DFSOutputStream maintains an internal "ack queue" of packets
HDFS File awaiting acknowledgement from data nodes.

Write Step 6: To indicate whether or not the file is complete, this action sends up all
Request of the remaining packets to the data node pipeline and then connects to the
name node to await acknowledgments.
Workflow
HDFS adheres to the Write Once, Read Many paradigm. Therefore, files that
are already saved in HDFS cannot be edited, but they can be added by
accessing the file again. Because of this design, HDFS can scale to support
many concurrent clients because data traffic is distributed throughout all of
the cluster's data nodes. As a result, the system's throughput, scalability, and
availability all increase.
Example

Bloc
Size Stored On
k
• Saving Project.pdf (300MB)
DN1, DN2,
B1 128 MB
DN3
DN2, DN4,
B2 128 MB
DN5
DN1, DN5,
B3 44 MB
DN6
Java interfaces to HDFS
HDFS provides several Java interfaces/APIs for interacting with the Hadoop
Distributed File System. These interfaces are used for writing Java
programs that can store, read, and manage files in HDFS.
1. File System Interface
This is the main Java interface for interacting with HDFS.
It provides methods to:
• Create and write files
• Open and read files
• Delete files or directories
• Check if a file exists
• List directory contents
Think of this as the controller that lets Java programs perform file
operations in HDFS.
Java interfaces to HDFS
2. Path Class
• Represents a file or folder location in HDFS.
• Used to specify where a file should be stored, read from, or
deleted.
• Similar to how you'd use a file path in any file system.
3. Configuration Class
Loads the Hadoop environment settings from configuration files.
It helps the Java program know:
• Where the HDFS is running
• What the default file system is
• Any other Hadoop-specific settings
Java interfaces to HDFS
4. FSDataInputStram and FSDataOutputStream
These are data streams used to read from and write
to HDFS files.
They work similarly to Java's standard input/output
streams but are optimized for HDFS.
5. Other Supporting Interfaces
FileStatus: Used to get information about files,
like size, permissions, and modification date.
RemoteIterator: Used to iterate over files and
directories in HDFS.
HDFS Command Line Interface
(CLI)
HDFS CLI allows users to interact with the Hadoop Distributed
File System directly from the terminal. It is used for file
management, monitoring, and troubleshooting in HDFS without
writing code.
Core Purpose:
• Enables manual interaction with files stored in HDFS.
• Useful for administrators, developers, and analysts to
upload, organize, and manage large datasets.
• Supports scripting for automation of data workflows.
Common File Commands:
-ls → Lists files and directories in a given HDFS path.

-mkdir → Creates new directories in HDFS.

-put → Uploads files from the local file system to HDFS.

-get → Downloads files from HDFS to the local system.

-cat → Displays contents of an HDFS file.

-rm → Deletes files from HDFS.

-du, -df → Shows space usage and available capacity.


Hadoop • HDFS is a distributed file system
designed to store large datasets
File across multiple machines, providing
System high throughput, scalability, and
fault tolerance.
Interfa 1. Java API Interface
ces • The primary and most powerful
interface for interacting with
HDFS.
• Provides extensive functionality
for file read/write, directory
management, etc.
• Integrated deeply with Hadoop's
MapReduce and other tools.
• Used mostly by Java developers
building Hadoop-based applications.
Hadoop 2. Hadoop Shell (Command-Line Interface)

File • Provides a set of commands similar to


UNIX/Linux shell (e.g., list, copy, move,
delete).
System • Very useful for administrators and testers.
Interfa • Allows direct interaction with HDFS from the
terminal or scripts.
ces • Example commands: list directories,
upload/download files, remove files.
3. Web Interface (WebHDFS / REST API)
• A language-agnostic RESTful interface that
enables communication over HTTP.
• Allows remote systems or web apps to access
HDFS using URL-based requests.
• Suitable for integration with external
systems or web dashboards.
• Supports operations like read, write, rename,
and delete files.
Hadoop 4. C/C++ Interface (libhdfs)
File • Provides access to HDFS from native
C/C++ applications.
System • Based on Java Native Interface
(JNI).
Interfa • Useful for legacy systems or
performance-critical native
ces applications.
• Includes functions for opening,
reading, writing, and closing
files.
5. Python Interface
• Widely used by data scientists and
analysts in machine learning and AI
projects.
• Enables programmatic file handling
within Python-based data workflows.
Map Reduce Data Flow
• The data processed by MapReduce should be stored
in HDFS, which divides the data into blocks and
store distributedly, for more details about HDFS
follow this HDFS comprehensive tutorial. Below are
the steps for MapReduce data flow:
• Step 1: One block is processed by one mapper at a time. In the
mapper, a developer can specify his own business logic as per the
requirements. In this manner, Map runs on all the nodes of the
cluster and process the data blocks in parallel.
• Step 2: Output of Mapper also known as intermediate output is
written to the local disk. An output of mapper is not stored on HDFS
as this is temporary data and writing on HDFS will create
unnecessary many copies.
Map Reduce Data Flow
• Step 3: Output of mapper is shuffled to reducer node (which is a
normal slave node but reduce phase will run here hence called as
reducer node). The shuffling/copying is a physical movement of data
which is done over the network.
• Step 4: Once all the mappers are finished and their output is
shuffled on reducer nodes then this intermediate output is merged &
sorted. Which is then provided as input to reduce phase.
• Step 5: Reduce is the second phase of processing where the user can
specify his own custom business logic as per the requirements. An
input to a reducer is provided from all the mappers. An output of
reducer is the final output, which is written on HDFS.
Data Ingest with Flume and Sqoop
In the Apache Hadoop ecosystem, Flume and Sqoop are essential tools for data ingestion, each
catering to specific data sources and targets.
Apache Flume: Streaming Data Ingestion
Flume is a distributed, reliable, and highly available service for collecting, aggregating, and moving
large amounts of streaming data into the Hadoop Distributed File System (HDFS).
Key Features of Flume
Flume offers several features that make it well-suited for streaming data ingestion:
Scalability: Flume can be easily scaled horizontally by adding more agents to accommodate
increasing data volumes.
Fault Tolerance: Flume supports reliable and durable message delivery, ensuring that no data is lost
during the ingestion process.
Extensibility: Flume allows custom components to be developed and integrated, making it highly
adaptable to various data sources and sinks.
Flume Architecture: Key Components

The primary components of Flume’s architecture include:

Source: The component that receives data from external systems


and converts it into Flume events.

Channel: The storage mechanism that buffers the events between


the source and sink, providing durability and reliability.

Sink: The component that writes the events to the desired


destination, such as HDFS, HBase, or another Flume agent.
Sqoop is a tool designed to transfer bulk data between
Hadoop and structured datastores, such as relational

Bulk Data Transfer


databases and data warehouses.

Key Features of Sqoop


Apache Sqoop:

Sqoop offers several features that make it


an ideal choice for bulk data transfer:

Parallelism: Sqoop leverages MapReduce to perform


parallel data transfers, improving transfer speeds and
efficiency.

Connectors: Sqoop supports a wide range of


connectors for various databases and datastores,
providing flexibility and compatibility.
Incremental Data Import: Sqoop can
perform incremental data imports,
enabling efficient and up-to-date
data synchronization between Hadoop
and external systems.
The primary components of Sqoop’s architecture
include:
Connectors: The plugins that enable communication
Sqoop between Sqoop and the supported databases or
Architecture: datastores.
Key
Components Import and Export Commands: The commands that
initiate and control the data transfer process, allowing
users to specify the source, target, and various transfer
options.
Hadoop Archive
• A Hadoop Archive is a special type of file format
used in HDFS to bundle many small files into a
single archive file.HARs reduce NameNode memory
usage by minimizing the number of metadata
entries.It is primarily used to improve
performance and scalability when dealing with a
large number of small files in HDFS.
Requirement of HAR:
• HDFS is designed for large files.
• When thousands or millions of small files are
stored, the NameNode gets overloaded as it must
keep metadata for each file.
Hadoop
Archive
HDFS offers unlimited
storage with scalability,
so it can be used as an
archival storage system.
The following Data Flow
Diagram (DFD) depicts
the pattern of HDFS as
an archive store:
Hadoop I/O

• Hadoop comes with a set of primitives for data I/O. Some of


these are techniques that are more general than Hadoop, such
as data integrity and compression, but deserve special
consideration when dealing with multiterabyte datasets. Others
are Hadoop tools or APIs that form the building blocks for
developing distributed systems, such as serialization
frameworks and on-disk data structures.
Compression
File compression brings two major benefits:
• It reduces the space needed to store files, and it speeds up
data transfer across the network, or to or from disk.
• When dealing with large volumes of data, both of these savings
can be significant, so it pays to carefully consider how to use
compression in Hadoop.
A summary of compression formats
Compressio Algorith Filename Multiple
Tool Splittable
n format m extension files
There are many different
compression formats, tools and DEFLAT
algorithms, each with different DEFLATE[a] N/A .deflate No No
E
characteristics. Table-1 lists some
of the more common ones that can gzip gzip
DEFLAT
.gz No No
be used with Hadoop. E

DEFLAT Yes, at file


ZIP zip .zip Yes
E boundaries

bzip2 bzip2 bzip2 .bz2 No Yes

LZO lzop LZO .lzo No No


Serialization
Serialization is the process of turning structured objects into a byte stream for transmission
over a network or for writing to persistent storage. Deserialization is the reverse process
of turning a byte stream back into a series of structured objects.
Serialization appears in two quite distinct areas of distributed data processing: for
interprocess communication and for persistent storage.
In Hadoop, interprocess communication between nodes in the system is implemented
using remote procedure calls (RPCs). The RPC protocol uses serialization to render the
message into a binary stream to be sent to the remote node, which then deserializes the
binary stream into the original message.
Serialization Format
In general, it is desirable that an RPC serialization format is:
• Compact
A compact format makes the best use of network bandwidth, which is the most scarce resource in a data
center.
• Fast
Interprocess communication forms the backbone for a distributed system, so it is essential that there is as
little performance overhead as possible for the serialization and deserialization process.
• Extensible
Protocols change over time to meet new requirements, so it should be straightforward to evolve the protocol
in a controlled manner for clients and servers. For example, it should be possible to add a new argument to a
method call, and have the new servers accept messages in the old format (without the new argument) from
old clients.
Interoperable
For some systems, it is desirable to be able to support clients that are written in different languages to the
server, so the format needs to be designed to make this possible.
Apache Avro is a language-
neutral data serialization
system.

It was developed by Doug Cutting,


the father of Hadoop.
Avro
Avro uses JSON format to declare
the data structures. Presently, it
supports languages such as Java, C,
C++, C#, Python, and Ruby.
Avro depends heavily on
its schema.
It allows every data to
be written with no prior
knowledge of the schema.
Avro It serializes fast and
the resulting serialized
Schemas data is lesser in size.
Schema is stored along
with the Avro data in a
Overview file for any are
Avro schemas further
defined
processing.
with JSON that
simplifies its
implementation in
languages with JSON
Like Avro, there are other
libraries.mechanisms in Hadoop
serialization
such as Sequence Files, Protocol
Buffers, and Thrift.
Create

• Create Schemas
Steps to • Design an Avro schema in JSON format
according to your data structure.

Use Avro Read

for Data • Read the Schema into Your Program


• Two ways to do this:
Serializ • Generate Class from Schema
• Compile the schema using Avro tools.

ation • This creates a Java class


corresponding to the schema.
• Use Parser Library
• Read the schema dynamically using
Avro’s Schema.Parser class.
Step 3: Serialize the Data
• Use the serialization API
provided in
org.apache.avro.specific
• Create an instance of
DatumWriter and
DataFileWriter to write the
Steps to Use data
Step 4: Deserialize the Data
Avro for Data • Use the deserialization API
Serialization provided
in:org.apache.avro.specific
• Create an instance of
DatumWriter and
DataFileWriter to read the
data
Hadoop Cluster
For the purpose of storing as well as analyzing huge amounts of
unstructured data in a distributed computing environment, a
special type of computational cluster is designed that what we
call as Hadoop Clusters.
Though, whenever we talk about Hadoop Clusters, two main
terms come up, they are cluster and node, so on defining them:
• A collection of nodes is what we call the cluster.
• A node is a point of intersection/connection within a network, ie
a server
Setting Up a Hadoop Cluster
• Setting up a Hadoop cluster involves installing
Hadoop on multiple machines (or virtual machines)
and configuring them to work together in a
distributed environment. Here's a step-by-step
guide to set up a basic multi-node Hadoop cluster:
Prerequisites
1.Machines: Minimum 2 (1 Master + 1 or more
Slaves/Workers)
2.OS: Ubuntu (preferred, but others like CentOS also
work)
3.Java: Hadoop requires Java 8 or later.
4.SSH: Passwordless SSH between all nodes.
Setting Up a
Hadoop Cluster
1. Install Java on All Nodes
2. Create Hadoop User on All Nodes
3. Set Up SSH (Passwordless)
4. Download and Extract Hadoop
5. Set Environment Variables
6. Configure Hadoop
7. Distribute Hadoop to All Nodes
8. Format HDFS (On Master)
9. Start Hadoop Cluster
10.Verify Cluster
Hadoop clusters have two
types of machines, such
as Master and Slave,
where:
Hadoop
Cluster Master: HDFS NameNode,
YARN ResourceManager.
Specifica
tion
Slaves: HDFS DataNodes,
YARN NodeManagers.
Hadoop Cluster Specification

DataNode manages storage


NameNode is the HDFS master,
attached to the nodes that
which manages the file system
they run on, basically
namespace and regulates access
there are a number of
to files by clients and also
DataNodes, that means one
consults with DataNodes (HDFS
DataNode per slave in the
slave) while copying data or
cluster.
running MapReduce operations.
Cluster Setup
and Installation

As part of this example you will set


up a cluster with
• one load balancer,
• one administration server and
• four web server instances with
session replication enabled.
Cluster Setup and Installation
Identify the following machines:
• MachineA — Has both the load balancer and the administration server.
• MachineB, MachineC, MachineD and MachineE — Has the
administration node and the web server instances running.
1. Install Administration Server on MachineA.
2. Install the Administration Node on MachineB, MachineC, MachineD
and MachineE.
3. Configure the Web Application.
Cluster Setup and Installation
4. Configure the Instances.
a) Launch WDM
b) Create a new configuration for the load balancer.
c) Set up the reverse proxy (Load balancer).
d) Create an instance.
e) Deploy the Configuration
5. Create and Start the Cluster.
a. Create a new configuration for the cluster.
b. Enable Session Replication.
c. Add the web application.
d. Create the instances.
e. Start the cluster.
Hadoop Security
3 A's of security
Organizations must prioritize the three core security principles known as the 3 A's:
Authentication, Authorization, and Auditing to manage security concerns in a Hadoop
environment effectively.
1. Authentication:
The authentication process ensures that only authorized users can access the Hadoop cluster. It entails
authenticating users' identities using various mechanisms such as usernames and passwords, digital certificates, or
biometric authentication. Organizations can reduce the risk of unauthorized access and secure their data from
dangerous actors by establishing strong authentication protocols.

2. Authorization:
Authorization governs the actions an authenticated user can take within the Hadoop cluster. It entails creating access
restrictions and permissions depending on the roles and responsibilities of the users. Organizations can enforce the
concept of least privilege by allowing users only the privileges required to complete their tasks, if adequate
authorization procedures are in place. This reduces the possibility of unauthorized data tampering or exposure.

3. Auditing:
Auditing is essential for monitoring and tracking user activity in the Hadoop cluster. Organizations can investigate
suspicious or unauthorized activity by keeping detailed audit logs. Auditing also aids in compliance reporting, allowing
organizations to demonstrate conformity with regulatory standards. Implementing real-time audit log monitoring and
analysis provides for the timely detection of security incidents and the facilitation of proactive measures.
Hadoop Adminstering
Administering Hadoop involves managing and
maintaining a Hadoop cluster, which includes
tasks like installation, configuration,
monitoring, troubleshooting, and security.
It also encompasses managing the Hadoop Distributed File
System (HDFS) and YARN (Yet Another Resource Negotiator),
as well as other components that work with the core Hadoop
system.
Installation Monitoring and
Cluster
and Performance
Management
Configuration Tuning

Key Aspects of
Hadoop Security
Backup and
Recovery
Troubleshooting
Administration:
Hadoop
Tools and
Administration
Techniques
Roles
HDFS administration
includes monitoring the
HDFS file structure,
locations, and the
updated files.
Monitorin
g MapReduce administration
includes monitoring the
list of applications,
configuration of nodes,
application status, etc.
Hadoop Benchmark
• Hadoop benchmarking involves using specific tools and techniques
to evaluate the performance of a Hadoop cluster.
• These benchmarks allow for testing and tuning the cluster's
hardware, software, and network for optimal performance.
Common Benchmark Tools:
• TestDFSIO – Measures HDFS I/O performance.
• TeraSort – Sorts large amounts of data; tests both
MapReduce and HDFS.
• MRBench – Tests performance of MapReduce
framework.
• HiBench – A comprehensive benchmarking suite by
Intel.
Hadoop in the cloud, also known
as Hadoop as a Service (HaaS),
refers to running the Apache
Hadoop framework on a cloud
provider's infrastructure,
rather than managing it on-
Hadoop in premises.
Cloud This allows organizations to
leverage cloud-based resources
for big data analytics without
the need to invest in and
manage their own hardware and
software.
Examples of Cloud Hadoop Services:
1. Amazon EMR:
A fully managed Hadoop service on Amazon Web
Services, offering a wide range of options for
running Hadoop clusters in the cloud.
2. Google Dataproc:
A managed service for running Apache Hadoop and Apache
Spark clusters on Google Cloud Platform.
3. Azure HDInsight:
A managed Hadoop service on Microsoft Azure, providing a
variety of Hadoop and related services.

You might also like