0% found this document useful (0 votes)
31 views16 pages

Learn

Uploaded by

shantanukk0108
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views16 pages

Learn

Uploaded by

shantanukk0108
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Hadoop

Hadoop is an open-source framework that enables the distributed storage and processing of large datasets across
clusters of computers using simple programming models. It is designed to scale from single servers to thousands of
machines, each offering local computation and storage.
Key Modules of Hadoop:
1. Hadoop Common: Provides shared utilities and libraries required by other Hadoop modules.
2. HDFS (Hadoop Distributed File System): A distributed file system that stores data across multiple machines,
ensuring fault tolerance and high throughput.
3. YARN (Yet Another Resource Negotiator): Manages resources and schedules jobs across the cluster.
4. Hadoop MapReduce: A processing model for large-scale data processing that splits jobs into smaller tasks
and processes them in parallel.
Hadoop Stack:
1. HDFS: For distributed storage.
2. MapReduce: For parallel data processing.
3. YARN: For resource management.
4. Additional Tools:
o Hive: Data warehouse infrastructure for querying and managing large datasets.
o HBase: A distributed, scalable database built on HDFS.
o Pig: A high-level platform for creating MapReduce programs.
o Spark: A fast and general engine for big data processing.
HDFS Architecture: Master-Slave (PYQ)
HDFS operates on a Master-Slave architecture, consisting of a NameNode (Master) and multiple DataNodes (Slaves), designed
for distributed storage and fault tolerance.
NameNode: Controller
• File System Namespace Management: The NameNode manages the entire file system's hierarchy and metadata, such
as file names, directories, and access permissions.
• Block Mappings: It keeps track of where blocks of a file are stored across DataNodes, mapping file data to the
respective blocks stored in the cluster.
DataNodes: Work Horses
• Block Operations: DataNodes store actual file data in blocks and perform tasks like read/write operations based on
client requests.
• Replication: DataNodes replicate blocks as instructed by the NameNode to maintain redundancy and ensure fault
tolerance.
Secondary NameNode
• Checkpointing: The Secondary NameNode periodically merges the edit logs with the fsimage (file system image),
optimizing the metadata recovery process in case of a NameNode failure.
• It does not serve as a backup but helps the NameNode restart faster by reducing the metadata logs that need to be
processed.
Multiple-Rack Cluster
HDFS follows a rack-awareness model where DataNodes are grouped into
racks, and block replicas are distributed across different racks for better
fault tolerance and data reliability.
• Reliable Storage: In case a DataNode fails, the NameNode
replicates lost blocks to another node.
• Cross-Rack Replication: To enhance reliability, block replicas are
stored on different racks. This ensures data recovery even in the
event of a full rack failure.
Single Point of Failure
• The NameNode is critical to HDFS operations. If it goes down, the
entire file system becomes inaccessible, making it the Single Point
of Failure (SPOF) in traditional HDFS setups.
HDFS Inside: NameNode
The NameNode manages the FS image (file system structure) and edit log (recent changes) to track all file
operations. It communicates with DataNodes, receiving periodic heartbeats and block reports.
• The Secondary NameNode helps by periodically merging the edit log with the FS image to reduce log size
and maintain system efficiency.
• It performs housekeeping tasks like backing up metadata to ensure the control information is up-to-date.
Important example

HDFS Inside: Read (PYQ)


1. Client connects to NameNode to request data.
2. NameNode provides block locations where the data is stored.
3. Client reads data directly from DataNodes, bypassing the
NameNode.
4. If a DataNode fails, the client connects to another node serving the missing block.
HDFS Inside: Write
1. Client connects to NameNode to initiate a write operation.
2. NameNode assigns DataNodes for the data blocks.
3. Client writes blocks directly to DataNodes with the specified
replication factor.
4. If a DataNode fails, the NameNode replicates the missing blocks
to maintain redundancy.
MapReduce
MapReduce is a programming model in Hadoop designed for processing
large datasets in parallel across a distributed cluster. It divides tasks into
two main phases: Map and Reduce. The Map phase processes and filters
data, while the Reduce phase aggregates the results.

MapReduce Architecture Workflow


1. Client Submits Job: The client submits the MapReduce job to the
Job Tracker and uploads the job code to HDFS.
2. Job Tracker Contacts NameNode: The Job Tracker communicates with the NameNode to locate the required
data blocks.
3. Job Execution Plan: The Job Tracker creates
an execution plan and assigns tasks to Task
Trackers. The plan includes:
o Mapper: Processes input data into
key-value pairs.
o Combiner (optional): Performs local
aggregation on Mapper output to
reduce the amount of data
transferred to the Reducer.
o Reducer: Aggregates the results of
the Map phase based on keys.
4. Task Execution: The Task Trackers execute the assigned tasks (Map and Reduce) and report their
progress/status back to the Job Tracker.
5. Task Phases Management: The Job Tracker oversees the task phases (Map and Reduce), ensuring they run
smoothly.
6. Job Completion: Once all tasks are completed, the Job Tracker finishes the job and updates the status for
the client.
Example of MapReduce
Spark
Apache Spark is an open-source, distributed computing system used for big data processing and analytics. It
supports various workloads, including batch processing, interactive queries, streaming, machine learning, and graph
processing. Spark is known for its in-memory processing, making it much faster than Hadoop's MapReduce.

Spark Architecture
Spark follows a master-slave architecture consisting of several key components:
1. Driver: The main program that defines transformations and actions on data. It coordinates and schedules
tasks.
2. Cluster Manager: Manages the cluster resources and allocates them to applications. Examples include
YARN, Mesos, or a standalone manager.
3. Executors: Workers that run tasks and store data. Each node in the cluster runs an executor.
4. Tasks: Individual units of work sent to the executors.
5. RDD (Resilient Distributed Dataset): A distributed collection of data that is fault-tolerant and can be
operated on in parallel.
Resilient Distributed Datasets (RDDs)
RDDs are the core abstraction in Apache Spark, representing a distributed collection of data that is fault-tolerant
and can be processed in parallel across a cluster. They allow efficient and large-scale data processing, supporting
both in-memory and disk storage.
Key Features of RDDs:
1. Fault Tolerance: RDDs can recover data lost due to failures using lineage information.
2. Parallel Processing: Operations are distributed across multiple nodes in a cluster.
3. Lazy Evaluation: RDD operations are only executed when an action is triggered.
Operations on RDDs
The two major types of operations available are transformations and actions.
1. Transformations:
o Return a new, modified RDD based on the original.
o Common transformations include:
▪ map()
▪ filter()
▪ sample()
▪ union()
2. Actions:
o Return a value based on some computation performed on an RDD.
o Common actions include:
▪ reduce()
▪ count()
▪ first()
▪ foreach()
Iterative Operations on MapReduce
MapReduce lacks efficient support for iterative operations since it writes intermediate results to disk after each
Map and Reduce phase. This is inefficient for algorithms that require multiple passes over the same data (e.g.,
machine learning algorithms), leading to high latency and slower processing times.

Interactive Operations on MapReduce


MapReduce is also inefficient for interactive operations where users query data in real-time. Each query triggers a
separate job, reading from and writing to disk, which results in high latency and slow responses.

Spark Ecosystem
The Spark ecosystem consists of several components and libraries that extend Spark’s capabilities for big data
processing:
1. Spark Core: The foundation that provides basic functionalities like task scheduling and memory
management.
2. Spark SQL: Enables querying of structured data using SQL and DataFrame API.
3. Spark Streaming: Allows real-time stream processing.
4. MLlib: A machine learning library for scalable algorithms like classification and clustering.
5. GraphX: Spark’s API for graph processing and graph-parallel computations.
6. SparkR: R language integration for statistical computing.
Spark vs Hadoop (PYQ)
Feature Apache Spark Apache Hadoop (MapReduce)
Processing Model In-memory computing Disk-based, batch processing
Speed Faster (due to in-memory processing) Slower (due to disk I/O for intermediate data)
Supports APIs in Scala, Java, Python, R; high- Java-based, lower-level MapReduce
Ease of Use
level APIs (DataFrames, SQL) programming
Data Processing Batch, real-time (streaming), interactive, and
Primarily batch processing
Type iterative
Uses data replication across nodes for fault
Fault Tolerance Uses lineage and DAG to recompute lost data
tolerance
Can work with various data sources (HDFS, S3, Works mainly with HDFS (Hadoop Distributed
Data Storage
HBase, etc.) File System)
High-latency (disk I/O for every MapReduce
Latency Low-latency (due to in-memory processing)
step)
Supports real-time stream processing (via
Streaming Support Does not support real-time streaming
Spark Streaming)
Machine Learning Includes MLlib for machine learning tasks Requires third-party libraries (e.g., Mahout)
Graph Processing Provides GraphX for graph computations Lacks built-in graph processing capabilities
Compatible with Hadoop ecosystem (can run
Compatibility Runs only in Hadoop ecosystem (HDFS, YARN)
on YARN, access HDFS)
Resource Can use YARN, Mesos, or standalone cluster
Relies on YARN for resource management
Management manager
Iterative Optimized for iterative algorithms (e.g., ML Not optimized for iterative algorithms (requires
Algorithms algorithms) multiple MapReduce jobs)
Maturity Newer, but rapidly growing in popularity Older, more stable, and widely used
Spark Scheduler
The Spark scheduler handles the execution of jobs by dividing them into stages and tasks. It uses a Dryad-like
Directed Acyclic Graph (DAG) to represent job execution, where nodes are operations (transformations) and
edges represent dependencies.
Key Features of the Spark Scheduler:
1. Pipelines functions within a stage: It groups operations into stages, executing pipelined functions (e.g.,
map, filter) without waiting.
2. Cache-aware work reuse & locality: It optimizes execution by reusing cached data and scheduling tasks
close to where data is stored.
3. Partitioning-aware: To avoid expensive data shuffling, it keeps track of data partitioning.
Example of Stages:
• Stage 1: Operations like groupBy, map (A, B, C, D) are executed.
• Stage 2: Combines transformations like union and join.
• Stage 3: Uses cached data partitions to minimize recomputation.
Hadoop Ecosystem (PYQ)
• Data Storage:
o HDFS: Distributed file system for storing large files.
o HBase: Columnar database for real-time access to large datasets.
• Data Processing:
o MapReduce: Parallel processing framework for handling large-scale data.
o YARN: Manages cluster resources and job scheduling.
• Data Access:
o Hive: SQL-like interface for querying data in HDFS.
o Pig: Data flow scripting for processing large datasets.
o Mahout: Machine learning library for scalable algorithms.
o Avro: Framework for data serialization and RPC.
o Sqoop: Connects and imports data between Hadoop and relational databases.
• Data Management:
o Oozie: Workflow scheduler for managing Hadoop jobs.
o Chukwa: System for monitoring and collecting data.
o Flume: Collects and aggregates log data from various sources.
o ZooKeeper: Coordination and management for distributed applications.
Pig
Apache Pig is a high-level platform for processing large datasets in Hadoop. It uses a language called Pig Latin for
expressing data transformations. Pig simplifies coding in MapReduce by providing a more accessible scripting
interface.
Why do we need Pig?
• Simplified MapReduce: Writing MapReduce directly is complex; Pig provides a more intuitive approach.
• Data Transformation: Useful for tasks like filtering, grouping, and joining large datasets.
• Less Code: With Pig, operations are concise and easier to maintain.
• Extensibility: Supports user-defined functions (UDFs) for custom tasks.
Features of Pig
• Ease of Programming: Pig Latin is easier than raw MapReduce.
• Data Flow Language: Describes transformations as a data flow.
• Schema Flexibility: Works with both structured and unstructured data.
• Optimization: Automatically optimizes execution by generating efficient MapReduce code.
• Extensibility: Supports UDFs in multiple languages (Java, Python).
Applications of Pig
• Log Analysis: Analyze and process web server logs.
• Data Processing: ETL (Extract, Transform, Load) tasks for large datasets.
• Ad Targeting: For marketing data, processing user behavior data.
• Data Research: Quick prototyping of algorithms in big data analytics.
Apache Pig Architecture
Apache Pig’s architecture is designed to execute Pig Latin scripts efficiently over large datasets. The key components
of Pig's architecture include:
1. Pig Latin Script: The user writes a Pig Latin script to specify data transformations.
2. Parser: Converts the Pig Latin script into a logical plan (a series of steps representing the data flow) after
checking syntax and type.
3. Optimizer: Optimizes the logical plan for better performance, generating an optimized logical plan.
4. Compiler: Converts the optimized logical plan into a physical plan of MapReduce jobs.
5. Execution Engine: This executes the physical plan as MapReduce jobs on a Hadoop cluster.
6. HDFS (Hadoop Distributed File System): The data storage and retrieval system, where Pig processes the
data.
Pig vs MapReduce
Feature Pig MapReduce
Language Pig Latin (high-level scripting) Java (low-level programming)
Ease of Use Easier with fewer lines of code Complex and requires more code
Abstraction Higher level; abstracts MapReduce Low level; direct MapReduce coding
Development Speed Faster for developers Slower; requires detailed coding
Optimization Automatically optimized Manual optimization needed
Use Case For ETL, data analysis, and querying Best for complex and custom operations

Pig vs SQL
Feature Pig SQL
Data Type Support Supports both structured and unstructured data Primarily structured (RDBMS)
Data Processing Procedural, step-by-step data flow Declarative, focus on "what" to retrieve
Schema Requirement Can work with or without schema Requires predefined schema
Platform Designed for Hadoop (Big Data) Designed for RDBMS (Relational Databases)
Flexibility More flexible with unstructured data Limited to structured data
Language Pig Latin (procedural) SQL (declarative)

Pig vs Hive
Feature Pig Hive
Language Pig Latin (procedural) HiveQL (SQL-like, declarative)
Data Handling Works with unstructured and structured data Primarily for structured data
Use Case ETL, data processing, and analysis Data querying and reporting
Execution Translates scripts to MapReduce jobs Also translates HiveQL into MapReduce jobs
Learning Curve Easier for programmers Easier for SQL users
Optimization Automatic but procedural control Query optimization via SQL-based execution plan
Hive Architecture
1. User Interfaces
• Web UI: Web-based interaction.
• CLI: Command Line Interface for executing HiveQL queries.
• HDInsight: Cloud-based interface on Azure for Hive.
2. Meta Store
• Stores metadata like table schemas, partitions, and data locations.
• Uses RDBMS (e.g., MySQL) for managing metadata.
3. HiveQL Process Engine
• Parsing: Converts HiveQL queries into a logical plan.
• Optimization: Optimizes query execution using metadata.
4. Execution Engine
• Executes the optimized query using MapReduce, Tez, or Spark based on the configuration.
5. MapReduce
• Hive translates queries into MapReduce jobs for distributed data processing.
6. HDFS or HBase Data Storage
• HDFS: Default data storage for Hive.
• HBase: Supports NoSQL-style data storage for real-time access.
Big Data
Big Data is a term for extremely large and complex datasets that traditional data processing tools can't handle
efficiently. It involves data from diverse sources and is characterized by its vast scale, rapid growth, and varying
formats.
5 V's of Big Data
1. Volume: Amount of data.
2. Velocity: Speed of data generation.
3. Variety: Types of data.
4. Veracity: Data accuracy.
5. Value: Insights and benefits.

You might also like