0% found this document useful (0 votes)
12 views32 pages

PDC Lecture 13

The document discusses big data, its characteristics, and the challenges of processing large datasets. It introduces Hadoop as an open-source framework for managing big data, detailing its key components like HDFS, MapReduce, and YARN, along with their functionalities. Additionally, it covers the MapReduce programming model, its implementation, and applications in various fields, highlighting the importance of efficient data processing in today's data-driven world.

Uploaded by

Zeeshan Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

PDC Lecture 13

The document discusses big data, its characteristics, and the challenges of processing large datasets. It introduces Hadoop as an open-source framework for managing big data, detailing its key components like HDFS, MapReduce, and YARN, along with their functionalities. Additionally, it covers the MapReduce programming model, its implementation, and applications in various fields, highlighting the importance of efficient data processing in today's data-driven world.

Uploaded by

Zeeshan Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CS-402 Parallel and Distributed Systems

Spring 2025

Lecture No. 13
Big Data

Big data refers to extremely large datasets that are complex and grow rapidly over time.

These datasets are so vast that traditional data processing software can't manage them

efficiently.

Big data is characterized by the three Vs:

Volume: The sheer amount of data.

Velocity: The speed at which data is generated and processed.

Variety: The different types of data (structured, unstructured, and semi-structured).

2
Big Data Statistics
The amount of data produced daily is staggering and continues to grow rapidly. Here are some
of the latest statistics:
Daily Data Production: Approximately 402.74 million terabytes of data are created each day.
Annual Data Generation: In 2024, the global data volume reached 147 zettabytes, and it's
projected to hit 181 zettabytes by 2025.
Text Messages: Every minute, 16 million text messages are sent globally.
Emails: Over 361 billion emails are sent each day.
Video Traffic: Videos account for over half (53%) of internet data traffic.

3
Introduction to Hadoop
 Hadoop is an open-source framework developed by the Apache Software Foundation,
designed to store and process large data sets across clusters of computers using simple
programming models.
 It is known for its ability to handle vast amounts of data and perform large-scale data
processing tasks efficiently.
Key Components of Hadoop
1. Hadoop Distributed File System (HDFS):
 Storage System: HDFS is designed to store large files across multiple machines in a distributed
fashion.
 Fault Tolerance: It replicates data across different nodes to ensure reliability and fault tolerance.
 Scalability: Can scale up from single servers to thousands of machines, each offering local
computation and storage.

4
Key Components of Hadoop
2. MapReduce:
 Processing Model: This is the core processing engine that works by dividing tasks into two main functions:
Map and Reduce.
 Parallel Processing: MapReduce processes large datasets in parallel by splitting them into smaller tasks.

3. YARN (Yet Another Resource Negotiator):


 Resource Management: YARN manages and allocates resources to various applications running in a
Hadoop cluster.
 Scheduling: It schedules tasks and monitors resource usage.

4. Hadoop Common:
 Utilities: A collection of common utilities and libraries that support the other Hadoop modules.
 Necessary for Functionality: Ensures all other Hadoop components can integrate and work together
smoothly.

5
Benefits and Real-World Applications of Hadoop
 Cost-Effective: Utilizes commodity hardware, making it a cost-effective solution for storing and
processing large amounts of data.
 Scalability: Easily scales horizontally by adding more machines to the cluster.
 Flexibility: Can store and process various types of data, whether structured, semi-structured, or
unstructured.
 Fault Tolerance: Automatically replicates data and manages failures within the cluster.
 Data Locality: Moves computation to the data rather than moving data to the computation,
improving processing speed and efficiency.
Real-World Applications
 Big Data Analytics: Used by companies to analyze large datasets for business insights.
 Search Engines: Powers data processing and indexing for search engines.
 Social Media Analysis: Helps analyze user data and interactions on social media platforms.
 Healthcare: Assists in analyzing large amounts of medical data for research and diagnostics.
6
MapReduce

 MapReduce is a powerful programming model and processing technique for handling large
data sets with a distributed algorithm on a cluster.
 Developed by Google, it's designed to process vast amounts of data in parallel across many
machines in a reliable and fault-tolerant manner. Here's a breakdown of the core concepts:
Key Components:
 Map Function: This function processes input data and transforms it into a set of
intermediate key/value pairs. The input data is typically divided into smaller chunks, and the
map function is applied to each chunk in parallel.
 Reduce Function: After the map function processes the data, the reduce function takes the
intermediate key/value pairs and merges them to produce the final output. It essentially
aggregates the results from the map function.

7
How MapReduce Works

1. Splitting: The input data is split into smaller, manageable chunks.

2. Mapping: The map function is applied to each chunk, producing intermediate key/value
pairs.

3. Shuffling and Sorting: The intermediate pairs are shuffled and sorted by key. This step
ensures that all values associated with a given key are grouped together.

4. Reducing: The reduce function processes each group of intermediate data to generate the
final output.

8
MapReduce

 “A new abstraction that allows us to express the simple computations we were trying to
perform but hides the messy details of parallelization, fault-tolerance, data distribution and
load balancing in a library.”

 Programming model:
o Provides abstraction to express computation
 Library:
o To take care the runtime parallelization of the computation.

9
Example: counting the number of occurrences of each word in the text below from
Wikipedia

“CLOUD COMPUTING IS A RECENTLY EVOLVED COMPUTING TERMINOLOGY OR METAPHOR


BASED ON UTILITY AND CONSUMPTION OF COMPUTING RESOURCES. CLOUD COMPUTING
INVOLVES DEPLOYING GROUPS OF REMOTE SERVERS AND SOFTWARE NETWORKS THAT
ALLOW CENTRALIZED DATA STORAGE AND ONLINE ACCESS TO COMPUTER SERVICES OR
RESOURCES. CLOUD CAN BE CLASSIFIED AS PUBLIC, PRIVATE OR HYBRID.”

WORD: NUMBER OF OCCURRENCES


CLOUD 3
COMPUTING 1
IS 1 2
A 1
RECENTLY 1
EVOLVED 1
COMPUTING 1?
TERMINOLOGY 1

10
Programming Model

 INPUT: A SET OF KEY/VALUE PAIRS


 OUTPUT: A SET OF KEY/VALUE PAIRS

 COMPUTATION IS EXPRESSED USING THE TWO FUNCTIONS:


1. MAP TASK: A SINGLE PAIR  A LIST OF INTERMEDIATE PAIRS
 MAP(INPUT-KEY, INPUT-VALUE)  LIST(OUT-KEY, INTERMEDIATE-VALUE)
 <KI, VI>  { < KINT, VINT > }

1. REDUCE TASK: ALL INTERMEDIATE PAIRS WITH THE SAME KINT  A LIST OF VALUES
 REDUCE(OUT-KEY, LIST(INTERMEDIATE-VALUE))  LIST(OUT-VALUES)
 < KINT, {VINT} >  < KO, VO >

11
Example: counting the number of occurrences of each word in A collection of
documents

MAP(STRING INPUT_KEY, STRING INPUT_VALUE):


// INPUT_KEY: DOCUMENT NAME
// INPUT_VALUE: DOCUMENT CONTENTS
FOR EACH WORD W IN INPUT_VALUE:
EMITINTERMEDIATE(W, "1");

REDUCE(STRING OUTPUT_KEY, ITERATOR INTERMEDIATE_VALUES):


// OUTPUT_KEY: A WORD
// OUTPUT_VALUES: A LIST OF COUNTS
INT RESULT = 0;
FOR EACH V IN INTERMEDIATE_VALUES:
RESULT += PARSEINT(V);
EMIT(ASSTRING(RESULT));

12
MapReduce Example Applications

 THE MAPREDUCE MODEL CAN BE APPLIED TO MANY APPLICATIONS:


 DISTRIBUTED GREP:
 MAP: EMITS A LINE, IF LINE MATCHED THE PATTERN
 REDUCE: IDENTITY FUNCTION
 COUNT OF URL ACCESS FREQUENCY
 REVERSE WEB-LINK GRAPH
 INVERTED INDEX
 DISTRIBUTED SORT
 ….

13
MapReduce Implementation

 MAPREDUCE IMPLEMENTATION PRESENTED IN THE PAPER MATCHED GOOGLE


INFRASTRUCTURE AT-THE-TIME:
1. LARGE CLUSTER OF COMMODITY PCS CONNECTED VIA SWITCHED ETHERNET
2. MACHINES ARE TYPICALLY DUAL-PROCESSOR X86, RUNNING LINUX, 2-4GB OF MEM! (SLOW
MACHINES FOR TODAY’S STANDARDS)
3. A CLUSTER OF MACHINES, SO FAILURES ARE ANTICIPATED
4. STORAGE WITH (GFS) GOOGLE FILE SYSTEM (2003) ON IDE DISKS ATTACHED TO PCS. GFS IS A
DISTRIBUTED FILE SYSTEM, USES REPLICATION FOR AVAILABILITY AND RELIABILITY.
 SCHEDULING SYSTEM:
1. USERS SUBMIT JOBS
2. EACH JOB CONSISTS OF TASKS; SCHEDULER ASSIGNS TASKS TO MACHINES

14
Google File System (GFS)
 FILE IS DIVIDED INTO SEVERAL CHUNKS OF PREDEFINED SIZE;
 TYPICALLY, 16-64 MB
 THE SYSTEM REPLICATES EACH CHUNK BY A NUMBER:
 USUALLY THREE REPLICAS
 TO ACHIEVE FAULT-TOLERANCE, AVAILABILITY AND RELIABILITY

15
Parallel Execution
 USER SPECIFIES:
 M: NUMBER OF MAP TASKS
 R: NUMBER OF REDUCE TASKS
 MAP:
 MAPREDUCE LIBRARY SPLITS THE INPUT FILE INTO M PIECES
 TYPICALLY 16-64MB PER PIECE
 MAP TASKS ARE DISTRIBUTED ACROSS THE MACHINES
 REDUCE:
 PARTITIONING THE INTERMEDIATE KEY SPACE INTO R PIECES
 HASH(INTERMEDIATE_KEY) MOD R
 TYPICAL SETTING:
 2,000 MACHINES
 M = 200,000
 R = 5,000

16
Execution Flow

17
Master Data Structures

 FOR EACH MAP/REDUCE TASK:


 STATE STATUS {IDLE, IN-PROGRESS, COMPLETED}
 IDENTITY OF THE WORKER MACHINE (FOR NON-IDLE TASKS)

 THE LOCATION OF INTERMEDIATE FILE REGIONS IS PASSED FROM MAPS TO REDUCERS


TASKS THROUGH THE MASTER.
 THIS INFORMATION IS PUSHED INCREMENTALLY (AS MAP TASKS FINISH) TO WORKERS
THAT HAVE IN-PROGRESS REDUCE TASKS.

18
Fault-Tolerance
TWO TYPES OF FAILURES:
1. WORKER FAILURES:
 IDENTIFIED BY SENDING HEARTBEAT MESSAGES BY THE MASTER. IF NO RESPONSE WITHIN A CERTAIN AMOUNT
OF TIME, THEN THE WORKER IS DEAD.
 IN-PROGRESS AND COMPLETED MAP TASKS ARE RE-SCHEDULED  IDLE
 IN-PROGRESS REDUCE TASKS ARE RE-SCHEDULED  IDLE
 WORKERS EXECUTING REDUCE TASKS AFFECTED FROM FAILED MAP/WORKERS ARE NOTIFIED OF RE-SCHEDULING
 QUESTION: WHY COMPLETED TASKS HAVE TO BE RE-SCHEDULER?
 ANSWER: MAP OUTPUT IS STORED ON LOCAL FS, WHILE REDUCE OUTPUT IS STORED ON GFS
2. MASTER FAILURE:
1. RARE
2. CAN BE RECOVERED FROM CHECKPOINTS
3. SOLUTION: ABORTS THE MAPREDUCE COMPUTATION AND STARTS AGAIN

19
Disk Locality
 NETWORK BANDWIDTH IS A RELATIVELY SCARCE RESOURCE AND ALSO INCREASES
LATENCY
 THE GOAL IS TO SAVE NETWORK BANDWIDTH

 USE OF GFS THAT STORES TYPICALLY THREE COPIES OF THE DATA BLOCK ON
DIFFERENT MACHINES
 MAP TASKS ARE SCHEDULED “CLOSE” TO DATA
 ON NODES THAT HAVE INPUT DATA (LOCAL DISK)
 IF NOT, ON NODES THAT ARE NEARER TO INPUT DATA (E.G., SAME SWITCH)

20
Task Granularity
 NUMBER OF MAP TASKS > NUMBER OF WORKER NODES
 BETTER LOAD BALANCING
 BETTER RECOVERY

 BUT, THIS, INCREASES LOAD ON THE MASTER


 MORE SCHEDULING
 MORE STATES TO BE SAVED

 M COULD BE CHOSEN WITH RESPECT TO THE BLOCK SIZE OF THE FILE SYSTEM
 FOR LOCALITY PROPERTIES
 R IS USUALLY SPECIFIED BY USERS
 EACH REDUCE TASKS PRODUCES ONE OUTPUT FILE

21
Stragglers
 SLOW WORKERS DELAY OVERALL COMPLETION TIME  STRAGGLERS
 BAD DISKS WITH SOFT ERRORS
 OTHER TASKS USING UP RESOURCES
 MACHINE CONFIGURATION PROBLEMS, ETC

 VERY CLOSE TO END OF MAPREDUCE OPERATION, MASTER SCHEDULES BACKUP EXECUTION OF


THE REMAINING IN-PROGRESS TASKS.
 A TASK IS MARKED AS COMPLETE WHENEVER EITHER THE PRIMARY OR THE BACKUP EXECUTION
COMPLETES.

 EXAMPLE: SORT OPERATION TAKES 44% LONGER TO COMPLETE WHEN THE BACKUP TASK
MECHANISM IS DISABLED.

22
Refinements: Partitioning Function

 PARTITIONING FUNCTION IDENTIFIES THE REDUCE TASK


 USERS SPECIFY THE DESIRED OUTPUT FILES THEY WANT, R
 BUT, THERE MAY BE MORE KEYS THAN R
 USES THE INTERMEDIATE KEY AND R
 DEFAULT: HASH(KEY) MOD R

 IMPORTANT TO CHOOSE WELL-BALANCED PARTITIONING FUNCTIONS:


 HASH(HOSTNAME(URLKEY)) MOD R
 FOR OUTPUT KEYS THAT ARE URLS

23
Refinements: Combiner Function

 INTRODUCE A MINI-REDUCE PHASE BEFORE INTERMEDIATE DATA IS SENT TO


REDUCE
 WHEN THERE IS SIGNIFICANT REPETITION OF INTERMEDIATE KEYS
 MERGE VALUES OF INTERMEDIATE KEYS BEFORE SENDING TO REDUCE TASKS
 EXAMPLE: WORD COUNT, MANY RECORDS OF THE FORM <WORD_NAME, 1>. MERGE RECORDS
WITH THE SAME WORD_NAME
 SIMILAR TO REDUCE FUNCTION

 SAVES NETWORK BANDWIDTH

24
Evaluation - Setup
 EVALUATION ON TWO PROGRAMS RUNNING ON A LARGE CLUSTER AND PROCESSING 1
TB OF DATA:
1. GREP: SEARCH OVER 1010 100-BYTE RECORDS LOOKING FOR A RARE 3-CHARACTER PATTERN
2. SORT: SORTS 1010 100-BYTE RECORDS

 CLUSTER CONFIGURATION:
 1,800 MACHINES
 EACH MACHINE HAS 2 GHZ INTEL XEON PROC., 4GB MEM, 2 160GB IDE DISKS
 GIGABIT ETHERNET LINK
 HOSTED IN THE SAME FACILITY

25
GREP
 M = 15,000 OF 64MB EACH SPLIT
R = 1
 ENTIRE COMPUTATION FINISHES AT 150S
 STARTUP OVERHEAD ~60S
 PROPAGATION OF PROGRAM TO WORKERS
 DELAYS TO INTERACT WITH GFS TO OPEN 1,000 FILES
 …
 PICKS AT 30GB/S WITH 1,764 WORKERS

26
SORT
 M = 15,000 SPLITS, 64MB EACH
 R = 4,000 FILES
 WORKERS = 1,700
 EVALUATED ON THREE EXECUTIONS:
 WITH BACKUP TASKS
 WITHOUT BACKUP TASKS
 WITH MACHINE FAILURES

27
Sort Results
Top: rate at which input is read
Middle: rate at which data is sent from mappers to reducers
Bottom: rate at which sorted data is written to output file by reducers

Without backup With machine failures,


Normal execution
tasks, 5 reduce 200 out of 1746 workers, 28
with backup
tasks stragglers,  a 5% increase over
44% increase normal execution time
Implementation
 FIRST MAPREDUCE LIBRARY IN 02/2003
 USE CASES (BACK THEN):
 LARGE-SCALE MACHINE LEARNING PROBLEMS
 CLUSTERING PROBLEMS FOR THE GOOGLE NEWS
 EXTRACTION OF DATA FOR REPORTS GOOGLE ZEITGEIST
MapReduce jobs run in 8/2004
 LARGE-SCALE GRAPH COMPUTATIONS

29
Apache Spark
Architecture: Spark is designed for fast data processing. It uses in-memory computing to speed up data
processing tasks, which can be significantly faster than Hadoop's disk-based processing.

Components:
• Spark Core: The foundation for parallel processing.
• Spark SQL: For structured data processing.
• Spark Streaming: For real-time data processing.
• MLlib: For machine learning.
• GraphX: For graph processing.
Strengths:

• Speed: In-memory processing can be up to 100 times faster than Hadoop for certain tasks.

• Versatility: Supports batch processing, real-time data processing, machine learning, and graph
processing.

• Ease of Use: Provides user-friendly APIs in Java, Scala, Python, and R.


30
Apache Spark
When to Use Which?

• Hadoop: Best for batch processing and when dealing with large volumes of data
that don't require real-time processing.

• Spark: Ideal for real-time data processing, iterative algorithms, and machine
learning tasks where speed is crucial.

Both frameworks can be used together to leverage their respective strengths. For
example, Hadoop can handle large-scale data storage and batch processing, while
Spark can be used for real-time analytics and machine learning.

31
Summary

 MAPREDUCE IS A VERY POWERFUL AND EXPRESSIVE MODEL


 PERFORMANCE DEPENDS A LOT ON IMPLEMENTATION DETAILS

 MATERIAL IS FROM THE PAPER:


“MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS”, BY JEFFREY DEAN AND
SANJAY GHEMAWAT FROM GOOGLE PUBLISHED IN USENIX OSDI CONFERENCE, 2004

32

You might also like