Unit-2
WHY HADОOР?
The key consideration (the rationale behind its huge popularity) is: Its capability to handle massive amounts of
data, different categories of data -fairly quickly.
1. Low cost: Hadoop is an open-source framework and uses commodity hardware (commodity hardware is
relatively inexpensive and easy to obtain hardware) to store enormous quantities of data. Inherent data
protection Storage flexibility Low cost Why Hadoop? Scalability Computing power Figure 5.2 Key
considerations of Hadoop
Computing power: Hadoop is based on distributed computing model which processesvery largevolumes of data
fairly quickly. The more the number of computing nodes, the more the processing power at hand.
3. Scalability: This boils down to simply adding nodes as the system grows and requires much less administration.
4. Storage flexibility: Unlike the traditional relational databases, in Hadoop data need not be pre-processed
before storing it. Hadoop provides the convenience of storing as much data as one needs and also the
added flexibility of deciding later as to how to use the stored data. In Hadoop, one can store unstructured data like
images, videos, and free-form text.
5. Inherent data protection: Hadoop protects data and executing applications against hardware failure.
If a node fails, it automatically redirects the jobs that had been assigned to this node to the other functional and
available nodes and ensures that distributed computing does not fail. It goes a step further
to store multiple copies (replicas) of the data on various nodes across the cluster.
.Introduction to Hadoop:
Hadoop is an open-source software framework that is used for storing and processing large amounts of data in a
distributed computing environment. It is designed to handle big data and is based on the MapReduce programming
model, which allows for the parallel processing of large datasets. Its framework is based on Java programming with
some native code in C and shell scripts.
Hadoop is designed to process large volumes of data (Big Data) across many machines without relying on a single
machine. It is built to be scalable, fault-tolerant and cost-effective. Instead of relying on expensive high-end
hardware, Hadoop works by connecting many inexpensive computers (called nodes) in a cluster.
Hadoop is an open-source framework, written primarily in Java, designed for storing and processing massive
datasets across clusters of commodity hardware. It provides a robust, scalable, and fault-tolerant solution for
handling Big Data challenges.
Hadoop Framework Components:
HADOOP
MAP REDUCE
(Distributed Comuputation)
HDFS
(Distributed Storage)
YARN COMMON
FRAMEWORK UTILITIES
The core of the Hadoop framework consists of several key modules:
• Hadoop Distributed File System (HDFS):
This is Hadoop's distributed storage system. HDFS breaks down large files into smaller blocks and distributes them
across multiple nodes in a cluster. It is designed for high-throughput access to application data and provides fault
tolerance by replicating data blocks across different nodes.
• Yet Another Resource Negotiator (YARN):
YARN is the resource management layer of Hadoop. It handles the allocation of resources (CPU, memory) to various
applications running on the Hadoop cluster and schedules tasks for execution. YARN separates resource
management from data processing, allowing for more efficient utilization of cluster resources and support for
diverse processing engines beyond MapReduce.
• MapReduce:
This is a programming model and processing engine for parallel processing of large datasets. MapReduce jobs are
divided into two main phases:
• Map Phase: Processes input data in parallel, transforming it into key-value pairs.
• Reduce Phase: Aggregates and combines the output of the Map phase to produce the final results.
• Hadoop Common Utilities:
These are a set of common utilities and libraries that support the other Hadoop modules. They provide file system
abstractions and other general-purpose functionalities.
FEATURES OF HADOOP:
1. Scalability: Hadoop can easily scale to accommodate massive amounts of data by adding more nodes to the
cluster. This horizontal scaling allows it to handle growing data volumes without performance degradation.
2. Fault Tolerance: Hadoop is designed to be fault-tolerant. It replicates data across multiple nodes, ensuring that
data is still accessible even if some nodes fail. This feature, along with automatic failover mechanisms, ensures high
availability and data reliability.
3. Cost-Effectiveness: Hadoop typically runs on commodity hardware, making it a cost-effective solution for big
data storage and processing compared to traditional systems.
4. Open Source: Being open source, Hadoop allows for community-driven development and customization,
fostering a robust ecosystem and constant innovation.
5. Distributed Processing: Hadoop uses a distributed processing model, breaking down large tasks into smaller,
parallelizable jobs that can be executed across multiple nodes in a cluster.
6. High Availability: Hadoop ensures high availability of data and applications through data replication and
automatic failover mechanisms.
7. Flexibility: Hadoop can handle various data formats, including structured, semi-structured, and unstructured
data
Advantages of hadoop:
• Scalability: Easily scale to thousands of machines.
• Cost-effective: Uses low-cost hardware to process big data.
• Fault Tolerance: Automatic recovery from node failures.
• High Availability: Data replication ensures no loss even if nodes fail.
• Flexibility: Can handle structured, semi-structured and unstructured data.
• Open-source and Community-driven: Constant updates and wide support.
Disadvantages:
• Not ideal for real-time processing (better suited for batch processing).
• Complexity in programming with MapReduce.
• High latency for certain types of queries.
• Requires skilled professionals to manage and develop.
Applications
Hadoop is used across a variety of industries:
1. Banking: Fraud detection, risk modeling.
2. Retail: Customer behavior analysis, inventory management.
3. Healthcare: Disease prediction, patient record analysis.
4. Telecom: Network performance monitoring.
5. Social Media: Trend analysis, user recommendation engines
HDFS CONCEPTS/ hdfs architecture :
HDFS (Hadoop Distributed File System) is the primary storage system for Hadoop, designed to handle massive
datasets across a cluster of commodity hardware. It provides a fault-tolerant, scalable, and high-throughput way
to store and access large amounts of data.
HDFS is designed to be highly scalable, reliable, and efficient, enabling the storage and processing of massive
datasets. Its architecture consists of several key components:
1. NameNode
2. DataNode
3. Secondary NameNode
4. HDFS Client
5. Block Structure
NameNode
The NameNode is the master server that manages the filesystem namespace and controls access to files by
clients. It performs operations such as opening, closing, and renaming files and directories. Additionally, the
NameNode maps file blocks to DataNodes, maintaining the metadata and the overall structure of the file
system. This metadata is stored in memory for fast access and persisted on disk for reliability.
Key Responsibilities:
• Maintaining the filesystem tree and metadata.
• Managing the mapping of file blocks to DataNodes.
• Ensuring data integrity and coordinating replication of data blocks.
DataNode There are multiple DataNodes per cluster. During Pipeline read and write DataNodes communicate
with each other. A DataNode also continuously sends "heartbeat" message to NameNode to ensure the
connectivity between the NameNode and DataNode. In case there is no heartbeat from a DataNode, the
NameNode replicates that DataNode within the cluster and keeps on running as if nothing had happened
DataNode
DataNodes are the worker nodes in HDFS, responsible for storing and retrieving actual data blocks as instructed
by the NameNode. Each DataNode manages the storage attached to it and periodically reports the list of blocks it
stores to the NameNode.
Key Responsibilities:
• Storing data blocks and serving read/write requests from clients.
• Performing block creation, deletion, and replication upon instruction from the NameNode.
• Periodically sending block reports and heartbeats to the NameNode to confirm its status.
Secondary NameNode The Secondary NameNode takes a snapshot of HDFS metadata at intervals specified in the
Hadoop configation. Since the memory requirements of Secondary NameNode are the same as NameNode, it is
better run NameNode and Secondary NameNode on different machines. In case of failure of the NameNode,
Secondary NameNode can be configured manually to bring upp the cluster. However, the Secondary NameNode
does not record any real-time changes that happen to the HDFS metadata.
Secondary NameNode
The Secondary NameNode acts as a helper to the primary NameNode, primarily responsible for merging the
EditLogs with the current filesystem image (FsImage) to reduce the potential load on the NameNode. It creates
checkpoints of the namespace to ensure that the filesystem metadata is up-to-date and can be recovered in case
of a NameNode failure.
Key Responsibilities:
• Merging EditLogs with FsImage to create a new checkpoint.
• Helping to manage the NameNode's namespace metadata.
HDFS Client
The HDFS client is the interface through which users and applications interact with the HDFS. It allows for file
creation, deletion, reading, and writing operations. The client communicates with the NameNode to determine
which DataNodes hold the blocks of a file and interacts directly with the DataNodes for actual data read/write
operations.
Key Responsibilities:
• Facilitating interaction between the user/application and HDFS.
• Communicating with the NameNode for metadata and with DataNodes for data access.
Block Structure
HDFS stores files by dividing them into large blocks, typically 128MB or 256MB in size. Each block is stored
independently across multiple DataNodes, allowing for parallel processing and fault tolerance. The NameNode
keeps track of the block locations and their replicas.
Key Features:
• Large block size reduces the overhead of managing a large number of blocks.
• Blocks are replicated across multiple DataNodes to ensure data availability and fault tolerance.
HDFS Advantages
HDFS offers several advantages that make it a preferred choice for managing large datasets in distributed
computing environments:
Scalability
HDFS is highly scalable, allowing for the storage and processing of petabytes of data across thousands of
machines. It is designed to handle an increasing number of nodes and storage without significant performance
degradation.
Key Aspects:
• Linear scalability allows the addition of new nodes without reconfiguring the entire system.
• Supports horizontal scaling by adding more DataNodes.
Fault Tolerance
HDFS ensures high availability and fault tolerance through data replication. Each block of data is replicated across
multiple DataNodes, ensuring that data remains accessible even if some nodes fail.
Key Features:
• Automatic block replication ensures data redundancy.
• Configurable replication factor allows administrators to balance storage efficiency and fault tolerance.
High Throughput
HDFS is optimized for high-throughput access to large datasets, making it suitable for data-intensive applications.
It allows for parallel processing of data across multiple nodes, significantly speeding up data read and write
operations.
Key Features:
• Supports large data transfers and batch processing.
• Optimized for sequential data access, reducing seek times and increasing throughput.
Cost-Effective
HDFS is designed to run on commodity hardware, significantly reducing the cost of setting up and maintaining a
large-scale storage infrastructure. Its open-source nature further reduces the total cost of ownership.
Key Features:
• Utilizes inexpensive hardware, reducing capital expenditure.
• Open-source software eliminates licensing costs.
Data Locality
HDFS takes advantage of data locality by moving computation closer to where the data is stored. This minimizes
data transfer over the network, reducing latency and improving overall system performance.
Key Features:
• Data-aware scheduling ensures that tasks are assigned to nodes where the data resides.
• Reduces network congestion and improves processing speed.
Reliability and Robustness
HDFS is built to handle hardware failures gracefully. The NameNode and DataNodes are designed to recover from
failures without losing data, and the system continually monitors the health of nodes to prevent data loss.
Key Features:
• Automatic detection and recovery from node failures.
• Regular health checks and data integrity verification.
HADOOP OVERVIEW :
Hadoop is a Open-source software framework to store and process massive amounts of data in a
distributed fashion on large clusters of commodity hardware. Basically, Hadoop accomplishes two tasks: 1.
Massive data storage. 2. Faster data processing.
Key Aspects of Hadoop
Fgure 5.7 describes the key aspects of Hadoop.
RDBMS versus HADООР
HDFS (HADOOP DISTRIBUTED FILE SYSTEM):
Some key Points of Hadoop Distributed File System are as follows:
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and moves computation where data
is stored).
5. You can replicate a file for a configured number of times, which is tolerant in terms of both software
and hardware.
6.Re-replicates data blocks automatically on nodes that have failed.
Anatomy of File Read
The steps involved in the File Read are as follows:
1. The client opens the file that it wishes to read from by calling open() on the DistributedFileSystem.
2. DistributedFileSystem communicates with the NameNode to get the location of data blocks.
NameNode returns with the addresses of the DataNodes that the data blocks are stored on. Subsequent
to this, the DistributedFileSystem returns an FSDatalnputStream to client to read from the file.
3. Client then calls read() on the stream DFSInputStream, which has addresses of the DataNodes for the
4.
first few blocks of the file, connects to the closest DataNode for the first block in the file.
Client calls read() repeatedly to stream the data from the DataNode.
5. When end of the block is reached, DFSInputStream closes the connection with the DataNode. It
repeats the steps to find the best DataNode for the next block and subsequent blocks.
6. When the client completes the reading of the file, it calls close() on the FSDataInputStream to close
the connection.
Anatomy of File Write
Figure 5.19 describes the anatomy of File Write. The steps involved in anatomy of File Write are as follows:
1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the NameNode happens through the DistributedFileSystem to create a new file.
The NameNode performs various checks to create a new file (checks whether such a file exists or
not). Initially, the NameNode creates a file without associating any data blocks to the file. The
Distributed FileSystem returns an FSDataOutputStream to the client to perform write.
3. As the client writes data, data is split into packets by DFSOutputStream, which is then written to an
internal queue, called data queue. DataStreamer consumes the data queue. The DataStreamer requests
the NameNode to allocate new blocks by selecting a list of suitable DataNodes to store replicas. This list of
DataNodes makes a pipeline. Here, we will go with the default replication factor of three,
so there will be three nodes in the pipeline for the first block.
4. DataStreamer streams the packets to the first DataNode in the pipeline. It stores packet and forwards
it to the second DataNode in the pipeline. In the same way, the second DataNode stores the packet and forwards
it to the third DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages an "Ack queue" of packets that are waiting
for the acknowledgement by DataNodes. A packet is removed from the "Ack queue" only if it is acknowledged by
all the DataNodes in the pipeline.
6. When the client finishes writing the file, it calls close() on the stream.
7. This fushes all the remaining packets to the DataNode pipeline and waits for relevant acknowledgments before
communicating with the NameNode to inform the client that the creation of the file is complete.
HDFS Commands:
Objective: To get the list of directories and files at the root of HDFS.
Act: badoop fs -ls/
Objective: To get the list of complete directories and files of HDFS.
Act:
hadoop fs -ls-R/
Objective: To create a directory (say, sample) in HDFS.
Act:
hadoop fs -mkdir /sample
Objective: To copy a file from local file system to HDFS.
Act:
hadoop fs -put /root/sampleltest.txt /sample/test.txt
Objective: To copy a file from HDFS to local file system.
Act:
hadoop fs-get /sample/test.txt /root/sample/testsample.txt
Big Data and Analytics
Objective: To copy a file from local file system to HDFS via copyFromLocal command. Act:
hadoop fs-copyFromLocal /root/sample/test.txt /sample/testsample.txt
Objective: To copy a file from Hadoop file system to local file system via copyToLocal command. Act:
hadoopfs-copyToLocal /sample/test.txt /root/sample/testsample1.txt
Objective: To display the contents of an HDFS file on console.
Act:
hadoop fs -cat /sample/test.txt
Objective: To copy a file from one directory to another on HDFS.
Act:
hadoop fs-cp /sample/test.txt /sample1
95
Objective: To remove a directory from HDFS.
Act:
hadoop fs-rm-r /sample1
DATA INGEST WITH SQOOP:
Sqoop − SQL to Hadoop and Hadoop to SQL
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import
data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases.
Sqoop Import
The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records
are stored as text data in text files or as binary data in Avro and Sequence files.
Sqoop Export
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop contain records, which
are called as rows in table. Those are read and parsed into a set of records and delimited with user-specified delimiter.
Sqoop Features
Sqoop has several features, which makes it helpful in the Big Data world:
1. Parallel Import/Export
Sqoop uses the YARN framework to import and export data. This provides fault tolerance on top of parallelism.
2. Import Results of an SQL Query
Sqoop enables us to import the results returned from an SQL query into HDFS.
3. Connectors For All Major RDBMS Databases
Sqoop provides connectors for multiple RDBMSs, such as the MySQL and Microsoft SQL servers.
4. Kerberos Security Integration
Sqoop supports the Kerberos computer network authentication protocol, which enables nodes communication
over an insecure network to authenticate users securely.
5. Provides Full and Incremental Load
Sqoop can load the entire table or parts of the table with a single command.
After going through the features of Sqoop as a part of this Sqoop tutorial, let us understand the Sqoop architecture.
Sqoop Architecture
the architecture of Sqoop, steps:
1. The client submits the import/ export command to import or export data.
2. Sqoop fetches data from different databases. Here, we have an enterprise data warehouse, document-based systems, and a relational
database. We have a connector for each of these; connectors help to work with a range of accessible databases.
3. Multiple mappers perform map tasks to load the data on to HDFS.
4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS using the Sqoop export command.
This Sqoop tutorial now gives you an insight of the Sqoop import.
Sqoop Import
The diagram below represents the Sqoop import mechanism.
1. In this example, a company’s data is present in the RDBMS. All this metadata is sent to the Sqoop import. Scoop then performs an
introspection of the database to gather metadata (primary key information).
2. It then submits a map-only job. Sqoop divides the input dataset into splits and uses individual map tasks to push the splits to HDFS.
Few of the arguments used in Sqoop import are shown below:
In this Sqoop tutorial, you have learned about the Sqoop import, now let's dive in to understand the Sqoop export.
Sqoop Export
Let us understand the Sqoop export mechanism stepwise:
1. The first step is to gather the metadata through introspection.
2. Sqoop then divides the input dataset into splits and uses individual map tasks to push the splits to RDBMS.
Let’s now have a look at few of the arguments used in Sqoop export:
After understanding the Sqoop import and export, the next section in this Sqoop tutorial is the processing that takes
place in Sqoop.
Sqoop Processing
Processing takes place step by step, as shown below:
1. Sqoop runs in the Hadoop cluster.
2. It imports data from the RDBMS or NoSQL database to HDFS.
3. It uses mappers to slice the incoming data into multiple formats and loads the data in HDFS.
4. Exports data back into the RDBMS while ensuring that the schema of the data in the database is maintained.
DATA INGEST WITH FLUME:
Apache Flume is an open-source tool for collecting, aggregating, and moving huge amounts of streaming data from the
external web servers to the central store, say HDFS, HBase, etc.
Features of Apache Flume
• Apache Flume is a robust, fault-tolerant, and highly available service.
• It is a distributed system with tunable reliability mechanisms for fail-over and recovery.
• Apache Flume is horizontally scalable.
• Apache Flume supports complex data flows such as multi-hop flows, fan-in flows, fan-out flows. Contextual routing etc.
• It provides support for large sets of sources, channels, and sinks.
• Apache Flume can efficiently ingest log data from various servers into a centralized repository.
• With Flume, we can collect data from different web servers in real-time as well as in batch mode.
• We can import large volumes of data generated by social networking sites and e-commerce sites into Hadoop DFS using
Apache Flume.
Apache Flume Architecture
Apache Flume has a simple and flexible architecture. The below diagram depicts Flume architecture.
As shown in the above figure, data generators generate huge volumes of data that are collected by individual agents called
Flume agents which are running on them. The data generators are Facebook, Twitter, e-commerce sites, or various other
external sources.
A data collector collects data from the agents, aggregates them, and pushes them into a centralized repository such as HBase or
HDFS.
Flume Event
A Flume event is a basic unit of data that needs to be transferred from source to destination.
Flume Agent
Flume agent is an independent JVM process (JVM) in Apache Flume. Agent receives events from clients or other Flume agents
and passes it to its next destination which can be sink or other agents.
Flume Agent contains three main components. They are the source, channel, and sink.
Source
A source receives data from the data generators. It transfers the received data to one or more channels in the form of events.
Flume provides support for several types of sources.
Example − Exec source, Thrift source, Avro source, twitter 1% source, etc.
Channel
A channel receives the data or events from the flume source and buffers them till the sinks consume them. It is a transient store.
Flume supports different types of channels.
Example − Memory channel, File system channel, JDBC channel, etc.
Sink
A sink consumes data from the channel and stores them into the destination. The destination can be a centralized store or other
flume agents. Example − HDFS sink.
Apache Flume – Data Flow
A flume is a tool used for moving log data into HDFS. Apache Flume supports complex data flow. There are three types of data
flow in Apache Flume. They are:
1. Multi-hop Flow
Within Apache Flume, there can be multiple agents. So before reaching the final destination, the flume event may travel through
more than one flume agent. This is called a multi-hop flow.
2. Fan-out Flow
The dataflow from one flume source to multiple channels is called fan-out flow. Fan-out flow is of two types − replicating and
multiplexing.
3. Fan-in Flow
The fan-in flow is the data flow where data is transferred from many sources to one channel.
Flume Advantages
1.Apache Flume enables us to store streaming data into any of the centralized repositories (such as HBase, HDFS).
2. Flume provides steady data flow between producer and consumer during reading2/write operations.
3. Flume supports the feature of contextual routing.
4. Apache Flume guarantees reliable message delivery.
5. Flume is reliable, scalable, extensible, fault-tolerant, manageable, and customizable.
Apache Flume Applications
1. Apache Flume is used by e-commerce companies to analyze customer behavior from a particular
region.
2. We can use Apache Flume to move huge amounts of data generated by application servers into
the Hadoop Distributed File System at a higher speed.
3. Apache Flume is used for fraud detections.
4. We can use Apache Flume in IoT applications.
5. Apache Flume can be used for aggregating machine and sensor-generated data.
6. We can use Apache Flume in the alerting or SIEM.
7. Flume specializes in collecting and aggregating log data from various sources and sending it to
centralized storage like HDFS.
8. Apache Flume can easily integrate with different data streaming tools like Apache Spark.