Learn

Uploaded by

shantanukk0108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views16 pages

Learn

Uploaded by

shantanukk0108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Hadoop

Hadoop is an open-source framework that enables the distributed storage and processing of large datasets across
clusters of computers using simple programming models. It is designed to scale from single servers to thousands of
machines, each offering local computation and storage.
Key Modules of Hadoop:
1. Hadoop Common: Provides shared utilities and libraries required by other Hadoop modules.
2. HDFS (Hadoop Distributed File System): A distributed file system that stores data across multiple machines,
ensuring fault tolerance and high throughput.
3. YARN (Yet Another Resource Negotiator): Manages resources and schedules jobs across the cluster.
4. Hadoop MapReduce: A processing model for large-scale data processing that splits jobs into smaller tasks
and processes them in parallel.
Hadoop Stack:
1. HDFS: For distributed storage.
2. MapReduce: For parallel data processing.
3. YARN: For resource management.
4. Additional Tools:
o Hive: Data warehouse infrastructure for querying and managing large datasets.
o HBase: A distributed, scalable database built on HDFS.
o Pig: A high-level platform for creating MapReduce programs.
o Spark: A fast and general engine for big data processing.
HDFS Architecture: Master-Slave (PYQ)
HDFS operates on a Master-Slave architecture, consisting of a NameNode (Master) and multiple DataNodes (Slaves), designed
for distributed storage and fault tolerance.
NameNode: Controller
• File System Namespace Management: The NameNode manages the entire file system's hierarchy and metadata, such
as file names, directories, and access permissions.
• Block Mappings: It keeps track of where blocks of a file are stored across DataNodes, mapping file data to the
respective blocks stored in the cluster.
DataNodes: Work Horses
• Block Operations: DataNodes store actual file data in blocks and perform tasks like read/write operations based on
client requests.
• Replication: DataNodes replicate blocks as instructed by the NameNode to maintain redundancy and ensure fault
tolerance.
Secondary NameNode
• Checkpointing: The Secondary NameNode periodically merges the edit logs with the fsimage (file system image),
optimizing the metadata recovery process in case of a NameNode failure.
• It does not serve as a backup but helps the NameNode restart faster by reducing the metadata logs that need to be
processed.
Multiple-Rack Cluster
HDFS follows a rack-awareness model where DataNodes are grouped into
racks, and block replicas are distributed across different racks for better
fault tolerance and data reliability.
• Reliable Storage: In case a DataNode fails, the NameNode
replicates lost blocks to another node.
• Cross-Rack Replication: To enhance reliability, block replicas are
stored on different racks. This ensures data recovery even in the
event of a full rack failure.
Single Point of Failure
• The NameNode is critical to HDFS operations. If it goes down, the
entire file system becomes inaccessible, making it the Single Point
of Failure (SPOF) in traditional HDFS setups.
HDFS Inside: NameNode
The NameNode manages the FS image (file system structure) and edit log (recent changes) to track all file
operations. It communicates with DataNodes, receiving periodic heartbeats and block reports.
• The Secondary NameNode helps by periodically merging the edit log with the FS image to reduce log size
and maintain system efficiency.
• It performs housekeeping tasks like backing up metadata to ensure the control information is up-to-date.
Important example

HDFS Inside: Read (PYQ)

1. Client connects to NameNode to request data.
2. NameNode provides block locations where the data is stored.
3. Client reads data directly from DataNodes, bypassing the
NameNode.
4. If a DataNode fails, the client connects to another node serving the missing block.
HDFS Inside: Write
1. Client connects to NameNode to initiate a write operation.
2. NameNode assigns DataNodes for the data blocks.
3. Client writes blocks directly to DataNodes with the specified
replication factor.
4. If a DataNode fails, the NameNode replicates the missing blocks
to maintain redundancy.
MapReduce
MapReduce is a programming model in Hadoop designed for processing
large datasets in parallel across a distributed cluster. It divides tasks into
two main phases: Map and Reduce. The Map phase processes and filters
data, while the Reduce phase aggregates the results.

MapReduce Architecture Workflow

1. Client Submits Job: The client submits the MapReduce job to the
Job Tracker and uploads the job code to HDFS.
2. Job Tracker Contacts NameNode: The Job Tracker communicates with the NameNode to locate the required
data blocks.
3. Job Execution Plan: The Job Tracker creates
an execution plan and assigns tasks to Task
Trackers. The plan includes:
o Mapper: Processes input data into
key-value pairs.
o Combiner (optional): Performs local
aggregation on Mapper output to
reduce the amount of data
transferred to the Reducer.
o Reducer: Aggregates the results of
the Map phase based on keys.
4. Task Execution: The Task Trackers execute the assigned tasks (Map and Reduce) and report their
progress/status back to the Job Tracker.
5. Task Phases Management: The Job Tracker oversees the task phases (Map and Reduce), ensuring they run
smoothly.
6. Job Completion: Once all tasks are completed, the Job Tracker finishes the job and updates the status for
the client.
Example of MapReduce
Spark
Apache Spark is an open-source, distributed computing system used for big data processing and analytics. It
supports various workloads, including batch processing, interactive queries, streaming, machine learning, and graph
processing. Spark is known for its in-memory processing, making it much faster than Hadoop's MapReduce.

Spark Architecture
Spark follows a master-slave architecture consisting of several key components:
1. Driver: The main program that defines transformations and actions on data. It coordinates and schedules
tasks.
2. Cluster Manager: Manages the cluster resources and allocates them to applications. Examples include
YARN, Mesos, or a standalone manager.
3. Executors: Workers that run tasks and store data. Each node in the cluster runs an executor.
4. Tasks: Individual units of work sent to the executors.
5. RDD (Resilient Distributed Dataset): A distributed collection of data that is fault-tolerant and can be
operated on in parallel.
Resilient Distributed Datasets (RDDs)
RDDs are the core abstraction in Apache Spark, representing a distributed collection of data that is fault-tolerant
and can be processed in parallel across a cluster. They allow efficient and large-scale data processing, supporting
both in-memory and disk storage.
Key Features of RDDs:
1. Fault Tolerance: RDDs can recover data lost due to failures using lineage information.
2. Parallel Processing: Operations are distributed across multiple nodes in a cluster.
3. Lazy Evaluation: RDD operations are only executed when an action is triggered.
Operations on RDDs
The two major types of operations available are transformations and actions.
1. Transformations:
o Return a new, modified RDD based on the original.
o Common transformations include:
▪ map()
▪ filter()
▪ sample()
▪ union()
2. Actions:
o Return a value based on some computation performed on an RDD.
o Common actions include:
▪ reduce()
▪ count()
▪ first()
▪ foreach()
Iterative Operations on MapReduce
MapReduce lacks efficient support for iterative operations since it writes intermediate results to disk after each
Map and Reduce phase. This is inefficient for algorithms that require multiple passes over the same data (e.g.,
machine learning algorithms), leading to high latency and slower processing times.

Interactive Operations on MapReduce

MapReduce is also inefficient for interactive operations where users query data in real-time. Each query triggers a
separate job, reading from and writing to disk, which results in high latency and slow responses.

Spark Ecosystem
The Spark ecosystem consists of several components and libraries that extend Spark’s capabilities for big data
processing:
1. Spark Core: The foundation that provides basic functionalities like task scheduling and memory
management.
2. Spark SQL: Enables querying of structured data using SQL and DataFrame API.
3. Spark Streaming: Allows real-time stream processing.
4. MLlib: A machine learning library for scalable algorithms like classification and clustering.
5. GraphX: Spark’s API for graph processing and graph-parallel computations.
6. SparkR: R language integration for statistical computing.
Spark vs Hadoop (PYQ)
Feature Apache Spark Apache Hadoop (MapReduce)
Processing Model In-memory computing Disk-based, batch processing
Speed Faster (due to in-memory processing) Slower (due to disk I/O for intermediate data)
Supports APIs in Scala, Java, Python, R; high- Java-based, lower-level MapReduce
Ease of Use
level APIs (DataFrames, SQL) programming
Data Processing Batch, real-time (streaming), interactive, and
Primarily batch processing
Type iterative
Uses data replication across nodes for fault
Fault Tolerance Uses lineage and DAG to recompute lost data
tolerance
Can work with various data sources (HDFS, S3, Works mainly with HDFS (Hadoop Distributed
Data Storage
HBase, etc.) File System)
High-latency (disk I/O for every MapReduce
Latency Low-latency (due to in-memory processing)
step)
Supports real-time stream processing (via
Streaming Support Does not support real-time streaming
Spark Streaming)
Machine Learning Includes MLlib for machine learning tasks Requires third-party libraries (e.g., Mahout)
Graph Processing Provides GraphX for graph computations Lacks built-in graph processing capabilities
Compatible with Hadoop ecosystem (can run
Compatibility Runs only in Hadoop ecosystem (HDFS, YARN)
on YARN, access HDFS)
Resource Can use YARN, Mesos, or standalone cluster
Relies on YARN for resource management
Management manager
Iterative Optimized for iterative algorithms (e.g., ML Not optimized for iterative algorithms (requires
Algorithms algorithms) multiple MapReduce jobs)
Maturity Newer, but rapidly growing in popularity Older, more stable, and widely used
Spark Scheduler
The Spark scheduler handles the execution of jobs by dividing them into stages and tasks. It uses a Dryad-like
Directed Acyclic Graph (DAG) to represent job execution, where nodes are operations (transformations) and
edges represent dependencies.
Key Features of the Spark Scheduler:
1. Pipelines functions within a stage: It groups operations into stages, executing pipelined functions (e.g.,
map, filter) without waiting.
2. Cache-aware work reuse & locality: It optimizes execution by reusing cached data and scheduling tasks
close to where data is stored.
3. Partitioning-aware: To avoid expensive data shuffling, it keeps track of data partitioning.
Example of Stages:
• Stage 1: Operations like groupBy, map (A, B, C, D) are executed.
• Stage 2: Combines transformations like union and join.
• Stage 3: Uses cached data partitions to minimize recomputation.
Hadoop Ecosystem (PYQ)
• Data Storage:
o HDFS: Distributed file system for storing large files.
o HBase: Columnar database for real-time access to large datasets.
• Data Processing:
o MapReduce: Parallel processing framework for handling large-scale data.
o YARN: Manages cluster resources and job scheduling.
• Data Access:
o Hive: SQL-like interface for querying data in HDFS.
o Pig: Data flow scripting for processing large datasets.
o Mahout: Machine learning library for scalable algorithms.
o Avro: Framework for data serialization and RPC.
o Sqoop: Connects and imports data between Hadoop and relational databases.
• Data Management:
o Oozie: Workflow scheduler for managing Hadoop jobs.
o Chukwa: System for monitoring and collecting data.
o Flume: Collects and aggregates log data from various sources.
o ZooKeeper: Coordination and management for distributed applications.
Pig
Apache Pig is a high-level platform for processing large datasets in Hadoop. It uses a language called Pig Latin for
expressing data transformations. Pig simplifies coding in MapReduce by providing a more accessible scripting
interface.
Why do we need Pig?
• Simplified MapReduce: Writing MapReduce directly is complex; Pig provides a more intuitive approach.
• Data Transformation: Useful for tasks like filtering, grouping, and joining large datasets.
• Less Code: With Pig, operations are concise and easier to maintain.
• Extensibility: Supports user-defined functions (UDFs) for custom tasks.
Features of Pig
• Ease of Programming: Pig Latin is easier than raw MapReduce.
• Data Flow Language: Describes transformations as a data flow.
• Schema Flexibility: Works with both structured and unstructured data.
• Optimization: Automatically optimizes execution by generating efficient MapReduce code.
• Extensibility: Supports UDFs in multiple languages (Java, Python).
Applications of Pig
• Log Analysis: Analyze and process web server logs.
• Data Processing: ETL (Extract, Transform, Load) tasks for large datasets.
• Ad Targeting: For marketing data, processing user behavior data.
• Data Research: Quick prototyping of algorithms in big data analytics.
Apache Pig Architecture
Apache Pig’s architecture is designed to execute Pig Latin scripts efficiently over large datasets. The key components
of Pig's architecture include:
1. Pig Latin Script: The user writes a Pig Latin script to specify data transformations.
2. Parser: Converts the Pig Latin script into a logical plan (a series of steps representing the data flow) after
checking syntax and type.
3. Optimizer: Optimizes the logical plan for better performance, generating an optimized logical plan.
4. Compiler: Converts the optimized logical plan into a physical plan of MapReduce jobs.
5. Execution Engine: This executes the physical plan as MapReduce jobs on a Hadoop cluster.
6. HDFS (Hadoop Distributed File System): The data storage and retrieval system, where Pig processes the
data.
Pig vs MapReduce
Feature Pig MapReduce
Language Pig Latin (high-level scripting) Java (low-level programming)
Ease of Use Easier with fewer lines of code Complex and requires more code
Abstraction Higher level; abstracts MapReduce Low level; direct MapReduce coding
Development Speed Faster for developers Slower; requires detailed coding
Optimization Automatically optimized Manual optimization needed
Use Case For ETL, data analysis, and querying Best for complex and custom operations

Pig vs SQL
Feature Pig SQL
Data Type Support Supports both structured and unstructured data Primarily structured (RDBMS)
Data Processing Procedural, step-by-step data flow Declarative, focus on "what" to retrieve
Schema Requirement Can work with or without schema Requires predefined schema
Platform Designed for Hadoop (Big Data) Designed for RDBMS (Relational Databases)
Flexibility More flexible with unstructured data Limited to structured data
Language Pig Latin (procedural) SQL (declarative)

Pig vs Hive
Feature Pig Hive
Language Pig Latin (procedural) HiveQL (SQL-like, declarative)
Data Handling Works with unstructured and structured data Primarily for structured data
Use Case ETL, data processing, and analysis Data querying and reporting
Execution Translates scripts to MapReduce jobs Also translates HiveQL into MapReduce jobs
Learning Curve Easier for programmers Easier for SQL users
Optimization Automatic but procedural control Query optimization via SQL-based execution plan
Hive Architecture
1. User Interfaces
• Web UI: Web-based interaction.
• CLI: Command Line Interface for executing HiveQL queries.
• HDInsight: Cloud-based interface on Azure for Hive.
2. Meta Store
• Stores metadata like table schemas, partitions, and data locations.
• Uses RDBMS (e.g., MySQL) for managing metadata.
3. HiveQL Process Engine
• Parsing: Converts HiveQL queries into a logical plan.
• Optimization: Optimizes query execution using metadata.
4. Execution Engine
• Executes the optimized query using MapReduce, Tez, or Spark based on the configuration.
5. MapReduce
• Hive translates queries into MapReduce jobs for distributed data processing.
6. HDFS or HBase Data Storage
• HDFS: Default data storage for Hive.
• HBase: Supports NoSQL-style data storage for real-time access.
Big Data
Big Data is a term for extremely large and complex datasets that traditional data processing tools can't handle
efficiently. It involves data from diverse sources and is characterized by its vast scale, rapid growth, and varying
formats.
5 V's of Big Data
1. Volume: Amount of data.
2. Velocity: Speed of data generation.
3. Variety: Types of data.
4. Veracity: Data accuracy.
5. Value: Insights and benefits.

U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Module 2 HDFS
No ratings yet
Module 2 HDFS
33 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Unit 2
No ratings yet
Unit 2
17 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Unit 4
No ratings yet
Unit 4
8 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Hadoop Framework & HDFS Overview
No ratings yet
Hadoop Framework & HDFS Overview
10 pages
HADOOP
No ratings yet
HADOOP
19 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Big Data Processing and Tools Guide
No ratings yet
Big Data Processing and Tools Guide
11 pages
Spark SQL
100% (1)
Spark SQL
25 pages
FALLSEM2025-26 VL ISWE309L 00100 TH 2025-08-08 Module-4
No ratings yet
FALLSEM2025-26 VL ISWE309L 00100 TH 2025-08-08 Module-4
40 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
36 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
HDFS
No ratings yet
HDFS
46 pages
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
No ratings yet
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
30 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
Bda Notes
No ratings yet
Bda Notes
110 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Spark
No ratings yet
Spark
9 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Unit - 2 (A)
No ratings yet
Unit - 2 (A)
8 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Splits Input Into Independent Chunks in Parallel Manner
No ratings yet
Splits Input Into Independent Chunks in Parallel Manner
4 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Analysis of Hadoop MapReduce Scheduling in Heterog 2021 Ain Shams Engineerin
No ratings yet
Analysis of Hadoop MapReduce Scheduling in Heterog 2021 Ain Shams Engineerin
10 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
CH 2. HADOOP
No ratings yet
CH 2. HADOOP
25 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
Unit 2
No ratings yet
Unit 2
9 pages
Bda QB Sample Unit
No ratings yet
Bda QB Sample Unit
12 pages
CC Unit 5
No ratings yet
CC Unit 5
43 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Module II
No ratings yet
Module II
46 pages
CH 2
No ratings yet
CH 2
6 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Unit 5
No ratings yet
Unit 5
32 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Continue
No ratings yet
Continue
3 pages
GAC041 - SIG - Aula01 - Conceitos - A
No ratings yet
GAC041 - SIG - Aula01 - Conceitos - A
30 pages
How To Define Build and Operationalize A Data Fabric
100% (1)
How To Define Build and Operationalize A Data Fabric
51 pages
Visual Media Portfolio
No ratings yet
Visual Media Portfolio
21 pages
Smart Glasses: Farhana Abdullah, Arjun Vishwakarma
No ratings yet
Smart Glasses: Farhana Abdullah, Arjun Vishwakarma
5 pages
Ai Icn 07-Jun-2025
No ratings yet
Ai Icn 07-Jun-2025
1 page
Bing Xii Mia 1,2
No ratings yet
Bing Xii Mia 1,2
4 pages
IT-510 Module 4 Part Two
No ratings yet
IT-510 Module 4 Part Two
3 pages
BRKSPG 2002
No ratings yet
BRKSPG 2002
90 pages
10 Best Email Extractor For Lead Generation 5cqxph PDF
No ratings yet
10 Best Email Extractor For Lead Generation 5cqxph PDF
12 pages
Session11 Papers
No ratings yet
Session11 Papers
13 pages
Event Log
No ratings yet
Event Log
111 pages
Selenium 4 Features
No ratings yet
Selenium 4 Features
4 pages
PLX3x-EIP-MBTCP Migration Guide
No ratings yet
PLX3x-EIP-MBTCP Migration Guide
4 pages
Delhi Police: (S.I/Constable)
No ratings yet
Delhi Police: (S.I/Constable)
7 pages
PDF Solutions for Businesses
No ratings yet
PDF Solutions for Businesses
4 pages
Cyber Security Analysis Using Vulnerability Assessment and Penetration Testing
No ratings yet
Cyber Security Analysis Using Vulnerability Assessment and Penetration Testing
5 pages
051790e1.sch-1 - Wed Oct 01 10:10:48 2003
No ratings yet
051790e1.sch-1 - Wed Oct 01 10:10:48 2003
19 pages
LoRa Networking in Mobile Scenarios Using UAV
No ratings yet
LoRa Networking in Mobile Scenarios Using UAV
108 pages
Lab #2 - Data Analysis With NumPy and Pandas
No ratings yet
Lab #2 - Data Analysis With NumPy and Pandas
7 pages
First Hello World Program in JavaScript
No ratings yet
First Hello World Program in JavaScript
2 pages
Eye-Lcd-Epu Datasheet en
No ratings yet
Eye-Lcd-Epu Datasheet en
2 pages
Have A Question?: Customers Who Viewed This Item Also Viewed
No ratings yet
Have A Question?: Customers Who Viewed This Item Also Viewed
1 page
v100NX Manual QG-Z3
No ratings yet
v100NX Manual QG-Z3
2 pages
Drug Procurement and Institution Tracking System
No ratings yet
Drug Procurement and Institution Tracking System
78 pages
Koldenhof BA EEMCS
No ratings yet
Koldenhof BA EEMCS
9 pages
Desktop Publish Material
No ratings yet
Desktop Publish Material
34 pages
JavaScript Regex & OOP Guide
No ratings yet
JavaScript Regex & OOP Guide
38 pages
Trending Historical Data
No ratings yet
Trending Historical Data
134 pages
Quilting Patterns & Fabrics
0% (1)
Quilting Patterns & Fabrics
8 pages

Learn

Uploaded by

Learn

Uploaded by

Hadoop

HDFS Inside: Read (PYQ)

MapReduce Architecture Workflow

Interactive Operations on MapReduce

You might also like