BZ GROW MORE INSTITUTE OF MSC(CA&IT)                                  SEM-8
Unit-2      INTRODUCTION TO HADOOP AND HADOOP ARCHITECTURE
Big Data – Apache Hadoop & Hadoop EcoSystem,
Hadoop is an open source framework. It is provided by Apache to process
and analyze very huge volume of data. It is written in Java and currently used
by Google, Facebook, LinkedIn, Yahoo, Twitter etc.
Our Hadoop tutorial includes all topics of Big Data Hadoop with HDFS,
MapReduce, Yarn, Hive, HBase, Pig, Sqoop etc.
Apache Hadoop is an open-source software framework used for distributed
storage and processing of large datasets across clusters of computers using
simple programming models. It's designed to scale up from single servers
to thousands of machines, each offering local computation and storage.
Hadoop consists of four main modules:
1. Hadoop Common: A set of common utilities and libraries that support
other Hadoop modules.
2. Hadoop Distributed File System (HDFS): A distributed file system that
stores data across multiple machines. It provides high-throughput access to
application data and is designed to be fault-tolerant.
3. Hadoop YARN (Yet Another Resource Negotiator): A resource
management layer responsible for managing resources and scheduling
applications on the Hadoop cluster.
4. Hadoop MapReduce: A programming model and processing engine for
large-scale data processing. It allows users to write applications that process
large amounts of data in parallel across a distributed cluster.
Hadoop is widely used in industries such as finance, healthcare, advertising,
and social media for tasks like log processing, data warehousing, machine
learning, and more. It's known for its scalability, fault tolerance, and ability
to handle diverse types of data.
Hadoop Ecosystem
The Hadoop Ecosystem is a group of software tools and frameworks. It is
based on the core components of Apache Hadoop. It enables storing,
                                                   ASS.PRO.UPEKSHA CHAUDHRI   1
   BZ GROW MORE INSTITUTE OF MSC(CA&IT)                               SEM-8
  processing, and analyzing large amounts of data. It provides the
  infrastructure needed to process large datasets. Hadoop distributes data
  and processes tasks across clusters of computers.
  Hadoop Ecosystem Components
  The Hadoop Ecosystem is composed of several components. Each
  component works together to enable the storage and analysis of data.
  In the above diagram, we can see the components that collectively form a
  Hadoop
        Components                                   Description
HDFS                          Hadoop Distributed File System
YARN                          Yet Another Resource Negotiator
MapReduce                     Programming-based Data Processing
Spark                         InMemory Data Processing
PIG, HIVE                     Processing of data services on query-based.
                                                 ASS.PRO.UPEKSHA CHAUDHRI    2
   BZ GROW MORE INSTITUTE OF MSC(CA&IT)                                   SEM-8
        Components                                       Description
HBase                           NoSQL Database
Mahout, Spark MLLib             Machine Learning algorithm libraries
Zookeeper                       Managing cluster
Oozie                           Job Scheduling
 Ecosystem.
 Now we will learn about each of the components in detail.
 Hadoop Distributed File System
       • HDFS is the primary storage system in the Hadoop Ecosystem.
         •   It is a distributed file system that provides reliable and scalable
             storage of large datasets across multiple computers.
         •   HDFS divides data into blocks and distributes them across the
             cluster for fault tolerance and high availability.
         •   It consists of 2 basic components
                     o Node Name
                     o Data Node
         •   Node name is a primary Node. It contains metadata, requiring
             comparatively free resources than Data Nodes that store the
             actual data.
         •   It maintains all the coordination between clusters and hardware.
                                                     ASS.PRO.UPEKSHA CHAUDHRI   3
 BZ GROW MORE INSTITUTE OF MSC(CA&IT)                                SEM-8
HDFS Architecture
The main purpose of HDFS is to ensure that data is preserved even in the
event of failures such as NameNode failures, DataNode failures, and
network partitions.
HDFS uses a master/slave architecture, where one device (master) controls
one or more other devices (slaves).
Important points about HDFS architecture:
       1. Files are split into fixed-size chunks and replicated across
          multiple DataNodes.
       2. The NameNode contains file system metadata and coordinates
          data access.
       3. Clients interact with HDFS through APIs to read, write, and
          delete files.
       4. DataNodes send heartbeats to the NameNode to report status
          and block information.
       5. HDFS is rack-aware and places replicas on different racks for
          fault tolerance. Checksums are used for data integrity to ensure
          the accuracy of stored data.
Yarn
       •   YARN (Yet Another Resource Negotiator). YARN helps manage
           resources across the cluster.
       •   It has 3 main components:
                                                ASS.PRO.UPEKSHA CHAUDHRI   4
 BZ GROW MORE INSTITUTE OF MSC(CA&IT)                                SEM-8
                 o   Resource Manager
                 o   Node Manager
                 o   Application Manager
      •   Resource manager allocates resources for applications in the
          system.
      •   Node manager allocates resources such as CPU, bandwidth per
          machine. After allocation, it is later acknowledged to the
          resource manager.
      •   The application manager and node manager perform
          negotiations according to the requirements.
Yarn Architecture
Key points about YARN architecture are:
      • Distributed Resource managers have the privilege of allocating
          resources to applications in the system.
      •   Node managers work on allocating resources such as CPU,
          memory, and bandwidth per machine, and later credit resource
          managers.
      • The Application Manager acts as an interface between the
        Resource Manager and the Node Manager, negotiating their
        needs.
MapReduce
     • MapReduce is a programming model and processing framework
        that enables parallel processing of large data sets.
      •   MapReduce can work with big data. It splits tasks into smaller
          parts called mapping and reducing, which can be done
          simultaneously.
      •   Map tasks process data and produce intermediate results.
      •   The intermediate results are then combined by a reduction task
          to produce the final output.
                                                ASS.PRO.UPEKSHA CHAUDHRI   5
 BZ GROW MORE INSTITUTE OF MSC(CA&IT)                                    SEM-8
        •   MapReduce makes use of two functions Map() and Reduce()
                   o   Map() sorts and filters data, thereby organizing it into
                       groups.
                   o   A Map produces results based on key-value pairs,
                       which are later processed by the Reduce() method.
                   o   Reduce() performs summarization by aggregating
                       related data.
        •Simply put, Reduce() takes the output produced by Map() as
         input and combines these tuples into a set of smaller tuples.
Apache Hive
      • Hive provides a data warehousing infrastructure based on
         Hadoop.
        •   It provides a SQL-like query language called HiveQL. It allows
            users to query, analyze and manage large datasets stored in
            Hadoop.
        •Hive turns queries into MapReduce or other execution engines.
         Thus, enabling data summarization, ad-hoc queries, and data
         analysis.
Apache Pig
      • Pig is a high-level scripting language and platform for simplifying
         data processing tasks in Hadoop.
        •   It provides a language called Pig Latin. It allows users to express
            data transformations and analytical operations.
        •   Pig optimizes these operations and transforms them into
            MapReduce jobs for execution.
HBase
        •   HBase is a distributed columnar NoSQL database that runs on
            Hadoop.
        •   It provides real-time random read/write access to large datasets.
        •   HBase is suitable for applications that require low-latency access
            to data.
                                                    ASS.PRO.UPEKSHA CHAUDHRI   6
  BZ GROW MORE INSTITUTE OF MSC(CA&IT)                                 SEM-8
       • For example: real-time analytics, time-series data, and Online
         Transaction Processing Systems (OLTP).
Apache Spark
      • Spark is a fast and versatile cluster computing system that
         extends the capabilities of Hadoop.
       •   It offers in-memory processing, enabling faster data processing
           and iterative analysis.
       • Spark supports batch processing, real-time stream processing,
         and interactive data analysis. Thus, making it a versatile tool in
         the Hadoop Ecosystem.
Apache Kafka
      • Kafka is a distributed streaming platform that enables the
         ingestion and processing of real-time data streams.
       •   It provides a publish / subscribe model for streaming data. It
           allows applications to process the generated data.
       • Kafka is commonly used to build real-time data pipelines, event-
         driven architectures, and streaming analytics applications.
Apache Sqoop
      • Sqoop is a Hadoop tool that makes it easy to move data
         between Hadoop and structured databases.
       • This tool helps connect traditional databases with the Hadoop
         Ecosystem.
Apache Flume
      • Flume makes getting lots of live data easier and send it to
         Hadoop.
       •  This helps add data from different places like log files, social
          media, and sensors.
Hadoop Performance Optimization
Here are some methods for optimizing Hadoop Ecosystem in Big Data
Take Advantage of HDFS Block Size Optimization
       • Configure the HDFS block size based on the typical size of your
          data files.
       •  Larger block sizes improve performance for reading and writing
         large files, while smaller block sizes are beneficial for smaller
         files.
Optimize Data Replication Factor
                                                  ASS.PRO.UPEKSHA CHAUDHRI    7
  BZ GROW MORE INSTITUTE OF MSC(CA&IT)                                   SEM-8
       •     Adjust the replication factor based on the required fault
             tolerance and cluster storage capacity.
       •  A lower replication factor reduces storage overhead and
         improves performance but at the cost of lower fault tolerance.
Optimize Your Network Settings
      • Configure network settings such as network buffers and TCP
         settings. It helps to maximize data transfer speeds between
         nodes in your cluster.
       •  Hadoop performance improves when network bandwidth
          increases and latency decreases.
Increase Concurrency
       • Split large computing tasks into smaller, parallelizable tasks to
          make optimal use of your cluster's compute resources.
       • This can be achieved by adjusting the number of mappers and
         reducers in your MapReduce job.
Optimize Task Scheduling
      • Configure the Hadoop scheduler. For e.g.: Fair Scheduler or
         Capacity Scheduler for efficient allocation.
       •     Fine-tuning the scheduling parameters ensures fair resource
             allocation and maximizes cluster utilization.
Moving Data in and out of Hadoop –
Understanding inputs and outputs of MapReduce
MapReduce is a programming model and associated implementation for
processing and generating large datasets that can be parallelized across a
distributed cluster of computers. It consists of two main phases: the Map
phase and the Reduce phase. Let me explain the inputs and outputs for
each phase:
### Map Phase:
**Input:**
                                                    ASS.PRO.UPEKSHA CHAUDHRI   8
  BZ GROW MORE INSTITUTE OF MSC(CA&IT)                                  SEM-8
- **Key-Value pairs:** The input data is divided into chunks, and each
chunk is represented as a key-value pair. Typically, the key is used to
identify the data record, and the value contains the actual data.
**Processing:**
- **Mapper function:** A user-defined function called the "mapper" is
applied to each key-value pair independently. The mapper function takes
the input key-value pair and emits intermediate key-value pairs based on
the processing logic. It can filter, transform, or extract information from the
input data.
**Output:**
- **Intermediate Key-Value pairs:** The mapper function generates
intermediate key-value pairs as its output. These key-value pairs are
usually different from the input key-value pairs and are emitted based on
the logic defined in the mapper function. The intermediate key-value pairs
are grouped by key and shuffled across the cluster to prepare for the next
phase.
### Reduce Phase:
**Input:**
- **Grouped Key-Value pairs:** The intermediate key-value pairs generated
by the map phase are shuffled and grouped based on their keys. All
intermediate values associated with the same key are collected together
and passed to the reducer function.
**Processing:**
- **Reducer function:** A user-defined function called the "reducer" is
applied to each group of intermediate values sharing the same key. The
reducer function aggregates, summarizes, or processes these values to
produce the final output.
                                                   ASS.PRO.UPEKSHA CHAUDHRI   9
  BZ GROW MORE INSTITUTE OF MSC(CA&IT)                                SEM-8
**Output:**
- **Final Output Key-Value pairs:** The reducer function generates the final
output key-value pairs based on the processing logic. These key-value
pairs constitute the result of the MapReduce job and typically represent
the desired computation or analysis performed on the input data.
In summary, the inputs to MapReduce are the initial dataset represented as
key-value pairs, and the outputs are the final processed results also
represented as key-value pairs, with intermediate processing stages in
between.
Data Serialization:-
What is Serialization?
Serialization is the process of converting a data object—a
combination of code and data represented within a region of data
storage—into a series of bytes that saves the state of the object in
an easily transmittable form. In this serialized form, the data can be
delivered to another data store (such as an in-memory computing
platform), application, or some other destination.
Data serialization is the process of converting an object into a
stream of bytes to more easily save or transmit it.
The reverse process—constructing a data structure or object from a
series of bytes—is deserialization. The deserialization process
recreates the object, thus making the data easier to read and modify
as a native structure in a programming language.
                                                 ASS.PRO.UPEKSHA CHAUDHRI   10
 BZ GROW MORE INSTITUTE OF MSC(CA&IT)                               SEM-8
Serialization and deserialization work together to transform/recreate
data objects to/from a portable format.
Serialization enables us to save the state of an object and recreate
the object in a new location. Serialization encompasses both the
storage of the object and exchange of data. Since objects are
composed of several components, saving or delivering all the parts
typically requires significant coding effort, so serialization is a
standard way to capture the object into a sharable format.
With serialization, we can transfer objects:
  •   Over the wire for messaging use cases
  •   From application to application via web services such as REST
      APIs
  •   Through firewalls (as JSON or XML strings)
  •   Across domains
  •   To other data stores
  •   To identify changes in data over time
  •   While honoring security and user-specific details across
      applications
Why Is Data Serialization Important for Distributed Systems?
In some distributed systems, data and its replicas are stored in
different partitions on multiple cluster members. If data is not
present on the local member, the system will retrieve that data from
another member. This requires serialization for use cases such as:
  •   Adding key/value objects to a map
  •   Putting items into a queue, set, or list
  •   Sending a lambda functions to another server
  •   Processing an entry within a map
  •   Locking an object
                                               ASS.PRO.UPEKSHA CHAUDHRI   11
 BZ GROW MORE INSTITUTE OF MSC(CA&IT)                               SEM-8
  •   Sending a message to a topic
What Are Common Languages for Data Serialization?
A number of popular object-oriented programming languages
provide either native support for serialization or have libraries that
add non-native capabilities for serialization to their feature set. Java,
.NET, C++, Node.js, Python, and Go, for example, all either have
native serialization support or integrate with serializer libraries.
Data formats such as JSON and XML are often used as the format for
storing serialized data. Customer binary formats are also used, which
tend to be more space-efficient due to less markup/tagging in the
serialization.
What Is Data Serialization in Big Data?
Big data systems often include technologies/data that are described
as “schemaless.” This means that the managed data in these systems
are not structured in a strict format, as defined by a schema.
Serialization provides several benefits in this type of en vironment:
  •   Structure. By inserting some schema or criteria for a data
      structure through serialization on read, we can avoid reading
      data that misses mandatory fields, is incorrectly classified, or
      lacks some other quality control requirement.
  •   Portability. Big data comes from a variety of systems and may
      be written in a variety of languages. Serialization can provide
      the necessary uniformity to transfer such data to other
      enterprise systems or applications.
  •   Versioning. Big data is constantly changing. Serialization allows
      us to apply version numbers to objects for lifecycle
      management.
                                               ASS.PRO.UPEKSHA CHAUDHRI   12