Big Data Analytics 21CS71
MODULE 2
                                    Introduction to Hadoop (T1)
Introduction to Hadoop
Hadoop is an Apache open-source framework written in Java that enables the distributed processing
of large datasets across clusters of computers using simple programming models. It allows
applications to work in an environment that supports distributed storage and computation. Hadoop is
scalable, meaning it can grow from a single server to thousands of machines, each providing local
computation and storage. It is designed to handle Big Data and enable efficient processing of massive
datasets.
Big Data Store Model
The Big Data store model in Hadoop is based on a distributed file system. Data is stored in blocks,
which are physical divisions of data spread across multiple nodes. The architecture is organized in
clusters and racks:
       Data Nodes: Store data in blocks.
       Racks: A collection of data nodes, scalable across clusters.
       Clusters: Racks are grouped into clusters to form the overall storage and processing system.
Hadoop ensures reliability by replicating data blocks across nodes. If a data link or node fails, the
system can still access the replicated data from other nodes.
Big Data Programming Model
In Hadoop's Big Data programming model, jobs and tasks are scheduled to run on the same servers
where the data is stored, minimizing data transfer time. This programming model is enabled by
MapReduce, a powerful tool that divides processing tasks into smaller subtasks that can be executed
in parallel across the cluster.
Example of Jobs in Hadoop
       Query Processing: A job that processes queries on datasets and returns results to an
        application.
       Sorting Data: Sorting performance data from an examination or another large dataset.
Hadoop and Its Ecosystem
The Hadoop framework was developed as part of an Apache project for Big Data storage and
processing, initiated by Doug Cutting and Mike Cafarella. The name Hadoop came from Cutting’s
son, who named his stuffed toy elephant "Hadoop."
Hadoop has two main components:
    1. Hadoop Distributed File System (HDFS): A system for storing data in blocks across
       clusters.
    2. MapReduce: A computational framework that processes data in parallel across the clusters.
Hadoop is written primarily in Java, with some native code in C, and the utilities are managed using
shell scripts. The framework operates on cloud-based infrastructure, making it a cost-effective
solution for managing and processing terabytes of data in minutes.
                                                                                                       1
Big Data Analytics 21CS71
Characteristics of Hadoop
Hadoop offers several key advantages for managing Big Data:
      Scalable: Easily scales from a few machines to thousands.
      Self-manageable: Requires minimal manual intervention for management.
      Self-healing: Automatically manages node failures by replicating data.
      Distributed File System: Ensures reliable storage and quick access to large datasets.
Hadoop Core Components
The Apache Hadoop framework is made up of several core components, which work together to store
and process large datasets in a distributed computing environment. The core components of Hadoop
are as follows:
   1. Hadoop Common:
           o   Description: This is the foundational module that contains the libraries and utilities
               required by other Hadoop components. It provides various common services like file
               system and input/output operations, serialization, and Remote Procedure Calls
               (RPCs).
           o   Features:
                      Common utilities shared across the Hadoop modules.
                      File-based data structures.
                      Essential interfaces for interacting with the distributed file system.
   2. Hadoop Distributed File System (HDFS):
           o   Description: HDFS is a Java-based distributed file system designed to run on
               commodity hardware. It allows Hadoop to store large datasets by distributing data
               blocks across multiple machines (nodes) in the cluster.
           o   Features:
                      Data is stored in blocks and replicated for fault tolerance.
                      Highly scalable and reliable.
                      Optimized for batch processing and provides high throughput for data access.
                                                                                                   2
Big Data Analytics 21CS71
    3. MapReduce v1:
            o   Description: MapReduce v1 is a programming model that allows for the processing
                of large datasets in parallel across multiple nodes. The model divides a job into
                smaller sub-tasks, which are then executed across the cluster.
            o   Features:
                       Jobs are divided into Map tasks and Reduce tasks.
                       Suitable for batch processing large sets of data.
    4. YARN (Yet Another Resource Negotiator):
            o   Description: YARN is responsible for managing computing resources in Hadoop. It
                schedules and manages jobs and sub-tasks by allocating resources to applications and
                ensuring they run efficiently in a distributed environment.
            o   Features:
                       Resource management for Hadoop clusters.
                       Parallel execution of tasks across clusters.
                       Supports dynamic allocation of resources to applications.
    5. MapReduce v2:
            o   Description: An updated version of MapReduce that operates under the YARN
                architecture. It improves resource management and scalability compared to
                MapReduce v1.
            o   Features:
                       YARN-based system for distributed parallel processing.
                       Allows better resource allocation for running large applications.
Features of Hadoop
Hadoop has several features that make it an essential tool for handling Big Data:
    1. Scalability and Modularity:
            o   Hadoop is highly scalable, meaning you can add more nodes to the cluster as data
                grows.
            o   Its modular design allows components to be easily added or replaced.
    2. Robust HDFS:
            o   The Hadoop Distributed File System (HDFS) is designed to handle large-scale data
                reliably.
            o   Data is replicated (default: three copies), ensuring backup and recovery in case of
                node failures.
    3. Big Data Processing:
                                                                                                  3
Big Data Analytics 21CS71
           o   Hadoop processes Big Data characterized by the 3Vs: Volume, Variety, and
               Velocity.
   4. Distributed Cluster Computing with Data Locality:
           o   Hadoop optimizes processing by running tasks on the same nodes where the data is
               stored, enhancing efficiency.
           o   High-speed processing is achieved by distributing tasks across multiple nodes in a
               cluster.
   5. Fault Tolerance:
           o   Hadoop automatically handles hardware failures. If a node fails, the system recovers
               by using data replicated across other nodes.
   6. Open-Source Framework:
           o   Hadoop is open-source, making it cost-effective for handling large data workloads. It
               can run on inexpensive hardware and cloud infrastructure.
   7. Java and Linux Based:
           o   Hadoop is built in Java and runs primarily on Linux. It also includes its own set of
               shell commands for easy management.
Hadoop Ecosystem Components
Hadoop's ecosystem consists of multiple layers, each responsible for different aspects of storage,
resource management, processing, and application support. The key components are:
   1. Distributed Storage Layer:
           o   HDFS: Manages the distributed storage of large datasets.
                                                                                                  4
Big Data Analytics 21CS71
   2. Resource Manager Layer:
           o   YARN: Manages and schedules the distribution of resources for jobs and sub-tasks in
               the cluster.
   3. Processing Framework Layer:
           o   MapReduce: Processes data in parallel by dividing jobs into Mapper and Reducer
               tasks.
   4. APIs at the Application Support Layer:
           o   Provides application interfaces for interacting with the Hadoop ecosystem.
This layered architecture enables Hadoop to efficiently store, manage, and process vast amounts of
data, making it an essential tool for organizations working with Big Data.
HDFS Data Storage
   1. Data Distribution in Clusters:
           o   Hadoop's storage concept involves distributing data across a cluster. A cluster
               consists of multiple racks, and each rack contains several DataNodes.
           o   DataNodes are responsible for storing the actual data blocks, while the NameNode
               manages the file system metadata and keeps track of where the data is stored.
   2. Data Blocks:
           o   HDFS breaks down large files into data blocks. Each block is stored independently
               across various DataNodes.
           o   By default, HDFS stores replicas of each data block on multiple DataNodes to
               ensure data availability even if some nodes fail.
           o   Default block size: 64 MB (this can be configured to be larger, such as 128 MB or
               256 MB).
   3. Rack Awareness:
           o   HDFS is aware of the physical distribution of nodes across racks.
           o   When replicating blocks, Hadoop attempts to place replicas on different racks to
               improve fault tolerance and reduce network bandwidth between nodes on the same
               rack.
   4. Fault Tolerance:
           o   The replication of blocks ensures that data is not lost if a node goes down. The
               default replication factor is 3, meaning that each block is replicated across three
               different nodes.
           o   In the event of a DataNode failure, the NameNode automatically re-replicates the
               missing blocks on other DataNodes.
   5. Processing and Storage:
           o   DataNodes not only store data but also have the capability to process the data stored
               in them. This enables distributed processing and allows Hadoop to process large
               datasets efficiently across clusters.
                                                                                                  5
Big Data Analytics 21CS71
   6. Data Block Management:
           o   When a file is uploaded to HDFS, it is split into blocks. Each block is distributed
               across different nodes to optimize read and write performance.
           o   Blocks are immutable, meaning once written, they cannot be modified. Data can
               only be appended to a file, but not altered in between.
Hadoop Physical Organization
In a Hadoop cluster, nodes are divided into MasterNodes and SlaveNodes.
MasterNodes:
      MasterNodes (or simply Masters) are responsible for coordinating the operations within the
       cluster. These nodes handle the overall management of the Hadoop environment, including
       storage and task distribution.
      Key MasterNodes:
           1. NameNode: The central node that manages the file system's metadata, such as file
              block locations, permissions, and access times. It plays a crucial role in managing
              data within HDFS.
           2. Secondary NameNode: Maintains a backup of the metadata and acts as a failover
              mechanism for the NameNode. It helps in managing metadata snapshots but is not a
              complete replacement for the NameNode.
           3. JobTracker: Oversees the allocation of MapReduce tasks to various nodes and
              ensures job completion by managing job execution across the cluster.
                                                                                                6
Big Data Analytics 21CS71
SlaveNodes:
      SlaveNodes (or DataNodes and Task Trackers) store actual data blocks and execute
       computational tasks. Each node has a significant amount of disk space and is responsible for
       both data storage and processing.
           o   DataNodes handle the storage and management of data blocks.
           o   TaskTrackers execute the processing tasks sent by the MasterNode and return the
               results.
Physical Distribution of Nodes:
      A typical Hadoop cluster consists of many DataNodes that store data, while MasterNodes
       handle administrative tasks. In a large cluster, multiple MasterNodes are used to balance the
       load and ensure redundancy.
Client-Server Interaction:
      Clients interact with the Hadoop system by submitting queries or applications through various
       Hadoop ecosystem projects, such as Hive, Pig, or Mahout.
      The MasterNode coordinates with the DataNodes to store data and process tasks. For
       example, it organizes how files are distributed across the cluster, assigns jobs to the nodes,
       and monitors the health of the system.
Hadoop MapReduce Framework and Programming Model
MapReduce is the primary programming model used for processing large datasets in Hadoop. The
framework is divided into two main functions:
   1. Map Function:
           o   The Map function organizes the data into key/value pairs.
           o   Each mapper works on a subset of the data blocks and produces intermediate results
               that are used by the Reduce function.
           o   Mapping distributes the task across different nodes in the cluster, where each node
               processes its portion of the data.
   2. Reduce Function:
           o   The Reduce function takes the intermediate key/value pairs generated by the Map
               function and processes them to produce a final aggregated result.
           o   It applies aggregation, queries, or other functions to the mapped data, reducing it
               into a smaller, cohesive set of results.
Hadoop MapReduce Execution Process
The MapReduce job execution involves several steps:
   1. Job Submission:
                                                                                                   7
Big Data Analytics 21CS71
           o   A client submits a request to the JobTracker, which estimates the required resources
               and prepares the cluster for execution.
   2. Task Assignment:
           o   The JobTracker assigns Map tasks to nodes that store the relevant data. This is
               called data locality, which reduces network overhead.
   3. Monitoring:
           o   The progress of each task is monitored, and if any task fails, it is restarted on a
               different node with available resources.
   4. Final Output:
           o   After the Map and Reduce jobs are completed, the results are serialized and
               transferred back to the client, typically using formats like AVRO.
MapReduce Programming Model
MapReduce programs can be written in various languages, including Java, C++, and Python. The
basic structure of a MapReduce program includes:
   1. Input Data:
           o   Data is typically stored in HDFS in files or directories, either structured or
               unstructured.
   2. Map Phase:
           o   The map function processes the input data by breaking it down into key/value pairs.
               Each key/value pair is passed to the reduce phase after mapping.
   3. Reduce Phase:
           o   The reduce function collects the output of the map phase and reduces the data by
               aggregating, sorting, or applying user-defined functions.
Hadoop YARN: Resource Management and Execution Model
YARN (Yet Another Resource Negotiator) is a resource management framework used in Hadoop for
managing and scheduling computer resources in a distributed environment. YARN separates the job
processing function from resource management, improving scalability and efficiency.
Components in YARN:
   1. Resource Manager (RM):
           o   The Resource Manager is the master node in the YARN architecture. There is one
               RM per cluster, and it is responsible for:
                     Managing the overall resources of the cluster.
                     Handling job submissions from clients.
                     Monitoring the availability of node resources (Node Managers).
                     Allocating resources to the applications.
                                                                                                 8
Big Data Analytics 21CS71
   2. Node Manager (NM):
           o   The Node Manager is a slave component running on each cluster node. It manages
               the individual node's resources and keeps the RM informed of its status.
               Responsibilities include:
                     Monitoring the resource usage (CPU, memory) of containers running on the
                      node.
                     Starting and stopping containers (which run the actual tasks).
                     Sending periodic heartbeat signals to the RM to indicate its availability.
   3. Application Master (AM):
           o   The Application Master is created for each job submitted to YARN. It handles the
               life cycle of an individual application. Its tasks include:
                     Requesting resources (containers) from the RM.
                     Coordinating the execution of tasks across containers.
                     Monitoring task completion and handling failures.
   4. Containers:
           o   Containers are the basic unit of resource allocation in YARN. Each container is a
               collection of resources (memory, CPU) on a single node, assigned by the RM to the
               Application Master for executing tasks.
                                                                                                   9
Big Data Analytics 21CS71
           o   Containers run the actual tasks of the application in parallel, distributed across
               multiple nodes.
YARN-Based Execution Model
The YARN execution model consists of several steps involving the interaction between different
components. Below is a breakdown of the actions in the YARN resource allocation and scheduling
process:
   1. Client Submission:
           o   The Client Node submits a request for an application or job to the Resource
               Manager (RM). The RM then takes responsibility for managing and executing the
               job.
   2. Job History Server:
           o   The Job History Server keeps track of all the jobs that have been completed in the
               cluster. This helps in maintaining job execution history for future analysis or
               debugging.
   3. Node Manager Startup:
           o   In a YARN cluster, multiple Node Managers (NM) exist. Each NM starts an instance
               of the Application Master (AM). The AM is responsible for managing the lifecycle
               of the application and requesting resources.
   4. Application Master Initialization:
           o   Once the AM instance (AMI) is created, it registers itself with the RM. The AM
               evaluates the resource requirements for the submitted job and requests the necessary
               containers.
   5. Resource Allocation:
           o   The RM analyzes the resource availability in the cluster by tracking heartbeat signals
               from active NMs. The RM allocates the required containers across different nodes
               based on the resource requests from the Application Master.
   6. Container Assignment:
           o   Each NM assigns a container to the AMI. The containers can be assigned either on
               the same NM or across different NMs, depending on resource availability. Each
               Application Master uses the assigned containers to execute the sub-tasks of the
               application.
   7. Execution of Application Sub-Tasks:
           o   Once the containers are assigned, the Application Master coordinates the execution
               of sub-tasks across the allocated containers. The job's tasks run in parallel on different
               containers, utilizing the distributed nature of the Hadoop cluster.
   8. Resource Monitoring:
                                                                                                      10
Big Data Analytics 21CS71
            o   During job execution, the NM monitors resource utilization and ensures the tasks are
                completed successfully. If there are any failures, the RM may reassign tasks to
                available containers.
Hadoop Ecosystem Tools
1. Zookeeper:
Zookeeper is a centralized coordination service for distributed applications. It provides a reliable,
efficient way to manage configuration, synchronization, and name services across distributed systems.
Zookeeper maintains data in nodes called JournalNodes, ensuring that distributed systems function
cohesively. Its main coordination services include:
       Name Service: Similar to DNS, it maps names to information, tracking servers or services
        and checking their statuses.
       Concurrency Control: Manages concurrent access to shared resources, preventing
        inconsistencies and ensuring that distributed processes run smoothly.
       Configuration Management: A centralized configuration manager that updates nodes with
        the current system configuration when they join the system.
       Failure Management: Automatically recovers from node failures by selecting alternative
        nodes to take over processing tasks.
2. Oozie:
Apache Oozie is a workflow scheduler for Hadoop that manages and coordinates complex jobs and
tasks in big data processing. Oozie allows you to create, schedule, and manage multiple workflows. It
organizes jobs into Directed Acyclic Graphs (DAGs) and supports:
       Integration of Multiple Jobs: Oozie integrates MapReduce, Hive, Pig, and Sqoop jobs in a
        sequential workflow.
       Time and Data Triggers: Automatically runs workflows based on time or specific data
        availability.
       Batch Management: Manages the timely execution of thousands of jobs in a Hadoop cluster.
Oozie is efficient for automating and scheduling repetitive jobs, simplifying the management of
multiple workflows.
3. Sqoop:
Apache Sqoop is a tool used for efficiently importing and exporting large amounts of data between
Hadoop and relational databases. It uses the MapReduce framework to parallelize data transfer
tasks. The workflow of Sqoop includes:
       Command-Line Parsing: Sqoop processes the arguments passed through the command line
        and prepares map tasks.
       Data Import and Export: Data from external databases is distributed across multiple
        mappers. Each mapper connects to the database using JDBC to fetch and import the data into
        Hadoop, HDFS, Hive, or HBase.
       Parallel Processing: Sqoop leverages Hadoop's parallel processing to transfer data quickly
        and efficiently. It also provides fault tolerance and schema definition for data import.
                                                                                                  11
Big Data Analytics 21CS71
Sqoop's ability to handle structured data makes it an essential tool for integrating relational databases
with the Hadoop ecosystem.
4. Flume:
Apache Flume is a service designed for efficiently collecting, aggregating, and transferring large
volumes of streaming data into Hadoop, particularly into HDFS. It's highly useful for applications
involving continuous data streams, such as logs, social media feeds, or sensor data. Key components
of Flume include:
       Sources: These collect data from servers or applications.
       Sinks: These store the collected data into HDFS or another destination.
       Channels: These act as a buffer, holding event data (typically 4 KB in size) between sources
        and sinks.
       Agents: Agents run sources and sinks. Interceptors filter or modify the data before it's
        written to the target.
Flume is reliable and fault-tolerant, providing a robust solution for handling massive, continuous data
streams.
----------------------------------------END OF MODULE 2-------------------------------------------------
                                                                                                           12