0% found this document useful (0 votes)

12 views32 pages

PDC Lecture 13

The document discusses big data, its characteristics, and the challenges of processing large datasets. It introduces Hadoop as an open-source framework for managing big data, detailing its key components like HDFS, MapReduce, and YARN, along with their functionalities. Additionally, it covers the MapReduce programming model, its implementation, and applications in various fields, highlighting the importance of efficient data processing in today's data-driven world.

Uploaded by

Zeeshan Ahmad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views32 pages

PDC Lecture 13

Uploaded by

Zeeshan Ahmad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

CS-402 Parallel and Distributed Systems

Spring 2025

Lecture No. 13
Big Data

Big data refers to extremely large datasets that are complex and grow rapidly over time.

These datasets are so vast that traditional data processing software can't manage them

efficiently.

Big data is characterized by the three Vs:

Volume: The sheer amount of data.

Velocity: The speed at which data is generated and processed.

Variety: The different types of data (structured, unstructured, and semi-structured).

2
Big Data Statistics
The amount of data produced daily is staggering and continues to grow rapidly. Here are some
of the latest statistics:
Daily Data Production: Approximately 402.74 million terabytes of data are created each day.
Annual Data Generation: In 2024, the global data volume reached 147 zettabytes, and it's
projected to hit 181 zettabytes by 2025.
Text Messages: Every minute, 16 million text messages are sent globally.
Emails: Over 361 billion emails are sent each day.
Video Traffic: Videos account for over half (53%) of internet data traffic.

3
Introduction to Hadoop
 Hadoop is an open-source framework developed by the Apache Software Foundation,
designed to store and process large data sets across clusters of computers using simple
programming models.
 It is known for its ability to handle vast amounts of data and perform large-scale data
processing tasks efficiently.
Key Components of Hadoop
1. Hadoop Distributed File System (HDFS):
 Storage System: HDFS is designed to store large files across multiple machines in a distributed
fashion.
 Fault Tolerance: It replicates data across different nodes to ensure reliability and fault tolerance.
 Scalability: Can scale up from single servers to thousands of machines, each offering local
computation and storage.

4
Key Components of Hadoop
2. MapReduce:
 Processing Model: This is the core processing engine that works by dividing tasks into two main functions:
Map and Reduce.
 Parallel Processing: MapReduce processes large datasets in parallel by splitting them into smaller tasks.

3. YARN (Yet Another Resource Negotiator):

 Resource Management: YARN manages and allocates resources to various applications running in a
Hadoop cluster.
 Scheduling: It schedules tasks and monitors resource usage.

4. Hadoop Common:
 Utilities: A collection of common utilities and libraries that support the other Hadoop modules.
 Necessary for Functionality: Ensures all other Hadoop components can integrate and work together
smoothly.

5
Benefits and Real-World Applications of Hadoop
 Cost-Effective: Utilizes commodity hardware, making it a cost-effective solution for storing and
processing large amounts of data.
 Scalability: Easily scales horizontally by adding more machines to the cluster.
 Flexibility: Can store and process various types of data, whether structured, semi-structured, or
unstructured.
 Fault Tolerance: Automatically replicates data and manages failures within the cluster.
 Data Locality: Moves computation to the data rather than moving data to the computation,
improving processing speed and efficiency.
Real-World Applications
 Big Data Analytics: Used by companies to analyze large datasets for business insights.
 Search Engines: Powers data processing and indexing for search engines.
 Social Media Analysis: Helps analyze user data and interactions on social media platforms.
 Healthcare: Assists in analyzing large amounts of medical data for research and diagnostics.
6
MapReduce

 MapReduce is a powerful programming model and processing technique for handling large
data sets with a distributed algorithm on a cluster.
 Developed by Google, it's designed to process vast amounts of data in parallel across many
machines in a reliable and fault-tolerant manner. Here's a breakdown of the core concepts:
Key Components:
 Map Function: This function processes input data and transforms it into a set of
intermediate key/value pairs. The input data is typically divided into smaller chunks, and the
map function is applied to each chunk in parallel.
 Reduce Function: After the map function processes the data, the reduce function takes the
intermediate key/value pairs and merges them to produce the final output. It essentially
aggregates the results from the map function.

7
How MapReduce Works

1. Splitting: The input data is split into smaller, manageable chunks.

2. Mapping: The map function is applied to each chunk, producing intermediate key/value
pairs.

3. Shuffling and Sorting: The intermediate pairs are shuffled and sorted by key. This step
ensures that all values associated with a given key are grouped together.

4. Reducing: The reduce function processes each group of intermediate data to generate the
final output.

8
MapReduce

 “A new abstraction that allows us to express the simple computations we were trying to
perform but hides the messy details of parallelization, fault-tolerance, data distribution and
load balancing in a library.”

 Programming model:
o Provides abstraction to express computation
 Library:
o To take care the runtime parallelization of the computation.

9
Example: counting the number of occurrences of each word in the text below from
Wikipedia

“CLOUD COMPUTING IS A RECENTLY EVOLVED COMPUTING TERMINOLOGY OR METAPHOR

BASED ON UTILITY AND CONSUMPTION OF COMPUTING RESOURCES. CLOUD COMPUTING
INVOLVES DEPLOYING GROUPS OF REMOTE SERVERS AND SOFTWARE NETWORKS THAT
ALLOW CENTRALIZED DATA STORAGE AND ONLINE ACCESS TO COMPUTER SERVICES OR
RESOURCES. CLOUD CAN BE CLASSIFIED AS PUBLIC, PRIVATE OR HYBRID.”

WORD: NUMBER OF OCCURRENCES

CLOUD 3
COMPUTING 1
IS 1 2
A 1
RECENTLY 1
EVOLVED 1
COMPUTING 1?
TERMINOLOGY 1

10
Programming Model

 INPUT: A SET OF KEY/VALUE PAIRS

 OUTPUT: A SET OF KEY/VALUE PAIRS

 COMPUTATION IS EXPRESSED USING THE TWO FUNCTIONS:

1. MAP TASK: A SINGLE PAIR  A LIST OF INTERMEDIATE PAIRS
 MAP(INPUT-KEY, INPUT-VALUE)  LIST(OUT-KEY, INTERMEDIATE-VALUE)
 <KI, VI>  { < KINT, VINT > }

1. REDUCE TASK: ALL INTERMEDIATE PAIRS WITH THE SAME KINT  A LIST OF VALUES
 REDUCE(OUT-KEY, LIST(INTERMEDIATE-VALUE))  LIST(OUT-VALUES)
 < KINT, {VINT} >  < KO, VO >

11
Example: counting the number of occurrences of each word in A collection of
documents

MAP(STRING INPUT_KEY, STRING INPUT_VALUE):

// INPUT_KEY: DOCUMENT NAME
// INPUT_VALUE: DOCUMENT CONTENTS
FOR EACH WORD W IN INPUT_VALUE:
EMITINTERMEDIATE(W, "1");

REDUCE(STRING OUTPUT_KEY, ITERATOR INTERMEDIATE_VALUES):

// OUTPUT_KEY: A WORD
// OUTPUT_VALUES: A LIST OF COUNTS
INT RESULT = 0;
FOR EACH V IN INTERMEDIATE_VALUES:
RESULT += PARSEINT(V);
EMIT(ASSTRING(RESULT));

12
MapReduce Example Applications

 THE MAPREDUCE MODEL CAN BE APPLIED TO MANY APPLICATIONS:

 DISTRIBUTED GREP:
 MAP: EMITS A LINE, IF LINE MATCHED THE PATTERN
 REDUCE: IDENTITY FUNCTION
 COUNT OF URL ACCESS FREQUENCY
 REVERSE WEB-LINK GRAPH
 INVERTED INDEX
 DISTRIBUTED SORT
 ….

13
MapReduce Implementation

 MAPREDUCE IMPLEMENTATION PRESENTED IN THE PAPER MATCHED GOOGLE

INFRASTRUCTURE AT-THE-TIME:
1. LARGE CLUSTER OF COMMODITY PCS CONNECTED VIA SWITCHED ETHERNET
2. MACHINES ARE TYPICALLY DUAL-PROCESSOR X86, RUNNING LINUX, 2-4GB OF MEM! (SLOW
MACHINES FOR TODAY’S STANDARDS)
3. A CLUSTER OF MACHINES, SO FAILURES ARE ANTICIPATED
4. STORAGE WITH (GFS) GOOGLE FILE SYSTEM (2003) ON IDE DISKS ATTACHED TO PCS. GFS IS A
DISTRIBUTED FILE SYSTEM, USES REPLICATION FOR AVAILABILITY AND RELIABILITY.
 SCHEDULING SYSTEM:
1. USERS SUBMIT JOBS
2. EACH JOB CONSISTS OF TASKS; SCHEDULER ASSIGNS TASKS TO MACHINES

14
Google File System (GFS)
 FILE IS DIVIDED INTO SEVERAL CHUNKS OF PREDEFINED SIZE;
 TYPICALLY, 16-64 MB
 THE SYSTEM REPLICATES EACH CHUNK BY A NUMBER:
 USUALLY THREE REPLICAS
 TO ACHIEVE FAULT-TOLERANCE, AVAILABILITY AND RELIABILITY

15
Parallel Execution
 USER SPECIFIES:
 M: NUMBER OF MAP TASKS
 R: NUMBER OF REDUCE TASKS
 MAP:
 MAPREDUCE LIBRARY SPLITS THE INPUT FILE INTO M PIECES
 TYPICALLY 16-64MB PER PIECE
 MAP TASKS ARE DISTRIBUTED ACROSS THE MACHINES
 REDUCE:
 PARTITIONING THE INTERMEDIATE KEY SPACE INTO R PIECES
 HASH(INTERMEDIATE_KEY) MOD R
 TYPICAL SETTING:
 2,000 MACHINES
 M = 200,000
 R = 5,000

16
Execution Flow

17
Master Data Structures

 FOR EACH MAP/REDUCE TASK:

 STATE STATUS {IDLE, IN-PROGRESS, COMPLETED}
 IDENTITY OF THE WORKER MACHINE (FOR NON-IDLE TASKS)

 THE LOCATION OF INTERMEDIATE FILE REGIONS IS PASSED FROM MAPS TO REDUCERS

TASKS THROUGH THE MASTER.
 THIS INFORMATION IS PUSHED INCREMENTALLY (AS MAP TASKS FINISH) TO WORKERS
THAT HAVE IN-PROGRESS REDUCE TASKS.

18
Fault-Tolerance
TWO TYPES OF FAILURES:
1. WORKER FAILURES:
 IDENTIFIED BY SENDING HEARTBEAT MESSAGES BY THE MASTER. IF NO RESPONSE WITHIN A CERTAIN AMOUNT
OF TIME, THEN THE WORKER IS DEAD.
 IN-PROGRESS AND COMPLETED MAP TASKS ARE RE-SCHEDULED  IDLE
 IN-PROGRESS REDUCE TASKS ARE RE-SCHEDULED  IDLE
 WORKERS EXECUTING REDUCE TASKS AFFECTED FROM FAILED MAP/WORKERS ARE NOTIFIED OF RE-SCHEDULING
 QUESTION: WHY COMPLETED TASKS HAVE TO BE RE-SCHEDULER?
 ANSWER: MAP OUTPUT IS STORED ON LOCAL FS, WHILE REDUCE OUTPUT IS STORED ON GFS
2. MASTER FAILURE:
1. RARE
2. CAN BE RECOVERED FROM CHECKPOINTS
3. SOLUTION: ABORTS THE MAPREDUCE COMPUTATION AND STARTS AGAIN

19
Disk Locality
 NETWORK BANDWIDTH IS A RELATIVELY SCARCE RESOURCE AND ALSO INCREASES
LATENCY
 THE GOAL IS TO SAVE NETWORK BANDWIDTH

 USE OF GFS THAT STORES TYPICALLY THREE COPIES OF THE DATA BLOCK ON
DIFFERENT MACHINES
 MAP TASKS ARE SCHEDULED “CLOSE” TO DATA
 ON NODES THAT HAVE INPUT DATA (LOCAL DISK)
 IF NOT, ON NODES THAT ARE NEARER TO INPUT DATA (E.G., SAME SWITCH)

20
Task Granularity
 NUMBER OF MAP TASKS > NUMBER OF WORKER NODES
 BETTER LOAD BALANCING
 BETTER RECOVERY

 BUT, THIS, INCREASES LOAD ON THE MASTER

 MORE SCHEDULING
 MORE STATES TO BE SAVED

 M COULD BE CHOSEN WITH RESPECT TO THE BLOCK SIZE OF THE FILE SYSTEM
 FOR LOCALITY PROPERTIES
 R IS USUALLY SPECIFIED BY USERS
 EACH REDUCE TASKS PRODUCES ONE OUTPUT FILE

21
Stragglers
 SLOW WORKERS DELAY OVERALL COMPLETION TIME  STRAGGLERS
 BAD DISKS WITH SOFT ERRORS
 OTHER TASKS USING UP RESOURCES
 MACHINE CONFIGURATION PROBLEMS, ETC

 VERY CLOSE TO END OF MAPREDUCE OPERATION, MASTER SCHEDULES BACKUP EXECUTION OF

THE REMAINING IN-PROGRESS TASKS.
 A TASK IS MARKED AS COMPLETE WHENEVER EITHER THE PRIMARY OR THE BACKUP EXECUTION
COMPLETES.

 EXAMPLE: SORT OPERATION TAKES 44% LONGER TO COMPLETE WHEN THE BACKUP TASK
MECHANISM IS DISABLED.

22
Refinements: Partitioning Function

 PARTITIONING FUNCTION IDENTIFIES THE REDUCE TASK

 USERS SPECIFY THE DESIRED OUTPUT FILES THEY WANT, R
 BUT, THERE MAY BE MORE KEYS THAN R
 USES THE INTERMEDIATE KEY AND R
 DEFAULT: HASH(KEY) MOD R

 IMPORTANT TO CHOOSE WELL-BALANCED PARTITIONING FUNCTIONS:

 HASH(HOSTNAME(URLKEY)) MOD R
 FOR OUTPUT KEYS THAT ARE URLS

23
Refinements: Combiner Function

 INTRODUCE A MINI-REDUCE PHASE BEFORE INTERMEDIATE DATA IS SENT TO

REDUCE
 WHEN THERE IS SIGNIFICANT REPETITION OF INTERMEDIATE KEYS
 MERGE VALUES OF INTERMEDIATE KEYS BEFORE SENDING TO REDUCE TASKS
 EXAMPLE: WORD COUNT, MANY RECORDS OF THE FORM <WORD_NAME, 1>. MERGE RECORDS
WITH THE SAME WORD_NAME
 SIMILAR TO REDUCE FUNCTION

 SAVES NETWORK BANDWIDTH

24
Evaluation - Setup
 EVALUATION ON TWO PROGRAMS RUNNING ON A LARGE CLUSTER AND PROCESSING 1
TB OF DATA:
1. GREP: SEARCH OVER 1010 100-BYTE RECORDS LOOKING FOR A RARE 3-CHARACTER PATTERN
2. SORT: SORTS 1010 100-BYTE RECORDS

 CLUSTER CONFIGURATION:
 1,800 MACHINES
 EACH MACHINE HAS 2 GHZ INTEL XEON PROC., 4GB MEM, 2 160GB IDE DISKS
 GIGABIT ETHERNET LINK
 HOSTED IN THE SAME FACILITY

25
GREP
 M = 15,000 OF 64MB EACH SPLIT
R = 1
 ENTIRE COMPUTATION FINISHES AT 150S
 STARTUP OVERHEAD ~60S
 PROPAGATION OF PROGRAM TO WORKERS
 DELAYS TO INTERACT WITH GFS TO OPEN 1,000 FILES
 …
 PICKS AT 30GB/S WITH 1,764 WORKERS

26
SORT
 M = 15,000 SPLITS, 64MB EACH
 R = 4,000 FILES
 WORKERS = 1,700
 EVALUATED ON THREE EXECUTIONS:
 WITH BACKUP TASKS
 WITHOUT BACKUP TASKS
 WITH MACHINE FAILURES

27
Sort Results
Top: rate at which input is read
Middle: rate at which data is sent from mappers to reducers
Bottom: rate at which sorted data is written to output file by reducers

Without backup With machine failures,

Normal execution
tasks, 5 reduce 200 out of 1746 workers, 28
with backup
tasks stragglers,  a 5% increase over
44% increase normal execution time
Implementation
 FIRST MAPREDUCE LIBRARY IN 02/2003
 USE CASES (BACK THEN):
 LARGE-SCALE MACHINE LEARNING PROBLEMS
 CLUSTERING PROBLEMS FOR THE GOOGLE NEWS
 EXTRACTION OF DATA FOR REPORTS GOOGLE ZEITGEIST
MapReduce jobs run in 8/2004
 LARGE-SCALE GRAPH COMPUTATIONS

29
Apache Spark
Architecture: Spark is designed for fast data processing. It uses in-memory computing to speed up data
processing tasks, which can be significantly faster than Hadoop's disk-based processing.

Components:
• Spark Core: The foundation for parallel processing.
• Spark SQL: For structured data processing.
• Spark Streaming: For real-time data processing.
• MLlib: For machine learning.
• GraphX: For graph processing.
Strengths:

• Speed: In-memory processing can be up to 100 times faster than Hadoop for certain tasks.

• Versatility: Supports batch processing, real-time data processing, machine learning, and graph
processing.

• Ease of Use: Provides user-friendly APIs in Java, Scala, Python, and R.

30
Apache Spark
When to Use Which?

• Hadoop: Best for batch processing and when dealing with large volumes of data
that don't require real-time processing.

• Spark: Ideal for real-time data processing, iterative algorithms, and machine
learning tasks where speed is crucial.

Both frameworks can be used together to leverage their respective strengths. For
example, Hadoop can handle large-scale data storage and batch processing, while
Spark can be used for real-time analytics and machine learning.

31
Summary

 MAPREDUCE IS A VERY POWERFUL AND EXPRESSIVE MODEL

 PERFORMANCE DEPENDS A LOT ON IMPLEMENTATION DETAILS

 MATERIAL IS FROM THE PAPER:

“MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS”, BY JEFFREY DEAN AND
SANJAY GHEMAWAT FROM GOOGLE PUBLISHED IN USENIX OSDI CONFERENCE, 2004

Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
Distributed Systems: MapReduce Basics
No ratings yet
Distributed Systems: MapReduce Basics
24 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Week 02
No ratings yet
Week 02
115 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Unit 4
No ratings yet
Unit 4
10 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Lecture 05
No ratings yet
Lecture 05
23 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Hadoop
No ratings yet
Hadoop
34 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
Unit 5
No ratings yet
Unit 5
32 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Unit 5
No ratings yet
Unit 5
35 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Lec 6
No ratings yet
Lec 6
16 pages
DCC Chapter 4
No ratings yet
DCC Chapter 4
37 pages
Lec 6
No ratings yet
Lec 6
14 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
Big Data
No ratings yet
Big Data
29 pages
Samba + OpenLDAP Setup Guide
No ratings yet
Samba + OpenLDAP Setup Guide
17 pages
Notes About ADO - NET in The .NET Framework
No ratings yet
Notes About ADO - NET in The .NET Framework
17 pages
SAP Bi 7.3
No ratings yet
SAP Bi 7.3
95 pages
Database Programming Course Syllabus
No ratings yet
Database Programming Course Syllabus
10 pages
Document From Raja Mitra
No ratings yet
Document From Raja Mitra
10 pages
DBMS Lab Manual - 20CS34P
No ratings yet
DBMS Lab Manual - 20CS34P
55 pages
Instructor Guide40 - HR311 - Time Eval Without Clock Times
100% (1)
Instructor Guide40 - HR311 - Time Eval Without Clock Times
37 pages
PL/SQL Cursors
No ratings yet
PL/SQL Cursors
14 pages
File System in Unix
No ratings yet
File System in Unix
31 pages
5 Hidden Google Search Console Features SEO Pros Must Use
No ratings yet
5 Hidden Google Search Console Features SEO Pros Must Use
8 pages
Amazon Data Warehouse
No ratings yet
Amazon Data Warehouse
21 pages
Deep Dive Faculty Enablement Program On Foundation Program 4.1 Anand Institute of Higher Technology, Kazhipattur, Chennai
No ratings yet
Deep Dive Faculty Enablement Program On Foundation Program 4.1 Anand Institute of Higher Technology, Kazhipattur, Chennai
6 pages
IT Era: Data & Database Essentials
No ratings yet
IT Era: Data & Database Essentials
22 pages
Search Engine Indexing Guide
No ratings yet
Search Engine Indexing Guide
10 pages
File System Structure
No ratings yet
File System Structure
12 pages
Paun Mihut Catalin Tema 5
75% (4)
Paun Mihut Catalin Tema 5
43 pages
Instructions IL AC19 1b
No ratings yet
Instructions IL AC19 1b
3 pages
Automate Change Capture For Siebel OLTP
No ratings yet
Automate Change Capture For Siebel OLTP
10 pages
Datbase Systems 622 Assignment Booklet (2nd Semester)
No ratings yet
Datbase Systems 622 Assignment Booklet (2nd Semester)
3 pages
Efilm: Dicom Conformance Statement September 10/07
No ratings yet
Efilm: Dicom Conformance Statement September 10/07
32 pages
ALM Requirements Template Generic
No ratings yet
ALM Requirements Template Generic
15 pages
Banking Management System
67% (3)
Banking Management System
94 pages
Python Library System Project
No ratings yet
Python Library System Project
3 pages
BI Project Management
100% (1)
BI Project Management
49 pages
DBMS Lab Record
No ratings yet
DBMS Lab Record
34 pages
Grade 12 CS Self Test
No ratings yet
Grade 12 CS Self Test
3 pages
Itip05 - Real-timeDataSharing
No ratings yet
Itip05 - Real-timeDataSharing
4 pages
DBMS Practical Lab Sheet - 1: Oracle Express
100% (1)
DBMS Practical Lab Sheet - 1: Oracle Express
2 pages
Advanced Database Modeling
0% (1)
Advanced Database Modeling
50 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages

PDC Lecture 13

Uploaded by

PDC Lecture 13

Uploaded by

CS-402 Parallel and Distributed Systems

Big data is characterized by the three Vs:

Volume: The sheer amount of data.

Velocity: The speed at which data is generated and processed.

Variety: The different types of data (structured, unstructured, and semi-structured).

3. YARN (Yet Another Resource Negotiator):

1. Splitting: The input data is split into smaller, manageable chunks.

“CLOUD COMPUTING IS A RECENTLY EVOLVED COMPUTING TERMINOLOGY OR METAPHOR

WORD: NUMBER OF OCCURRENCES

 INPUT: A SET OF KEY/VALUE PAIRS

 COMPUTATION IS EXPRESSED USING THE TWO FUNCTIONS:

MAP(STRING INPUT_KEY, STRING INPUT_VALUE):

REDUCE(STRING OUTPUT_KEY, ITERATOR INTERMEDIATE_VALUES):

 THE MAPREDUCE MODEL CAN BE APPLIED TO MANY APPLICATIONS:

 MAPREDUCE IMPLEMENTATION PRESENTED IN THE PAPER MATCHED GOOGLE

 FOR EACH MAP/REDUCE TASK:

 THE LOCATION OF INTERMEDIATE FILE REGIONS IS PASSED FROM MAPS TO REDUCERS

 BUT, THIS, INCREASES LOAD ON THE MASTER

 VERY CLOSE TO END OF MAPREDUCE OPERATION, MASTER SCHEDULES BACKUP EXECUTION OF

 PARTITIONING FUNCTION IDENTIFIES THE REDUCE TASK

 IMPORTANT TO CHOOSE WELL-BALANCED PARTITIONING FUNCTIONS:

 INTRODUCE A MINI-REDUCE PHASE BEFORE INTERMEDIATE DATA IS SENT TO

 SAVES NETWORK BANDWIDTH

Without backup With machine failures,

• Ease of Use: Provides user-friendly APIs in Java, Scala, Python, and R.

 MAPREDUCE IS A VERY POWERFUL AND EXPRESSIVE MODEL

 MATERIAL IS FROM THE PAPER:

You might also like