0% found this document useful (0 votes)

93 views59 pages

9 Hadoop PDF

Hadoop is a framework for distributed processing of large data sets across clusters of computers. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity hardware. Hadoop features include distributed storage with the Hadoop Distributed File System (HDFS), which provides high-throughput access to application data and is suitable for storing very large data sets reliably, and a processing engine called MapReduce that can distribute computations across large clusters.

Uploaded by

Amine Hamdouchi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views59 pages

9 Hadoop PDF

Uploaded by

Amine Hamdouchi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Introduction to Hadoop

ID2210
Jim Dowling
Large Scale Distributed Computing
In #Nodes
- BitTorrent (millions)
- Peer-to-Peer

In #Instructions/sec
- Teraflops, Petaflops, Exascale
- Super-Computing

In #Bytes stored
- Facebook: 300+ Petabytes (April 2014)*
- Hadoop

In #Bytes processed/time
- Google processed 24 petabytes of data per day in 2013
- Colossus, Spanner, BigQuery, BigTable, Borg, Omega, ..

*http://www.adweek.com/socialtimes/orcfile/434041
Where does Big Data Come From?

•On-line services PBs per day

•Scientiﬁc instruments PBs per minute

•Whole genome sequencing 250 GB per person

•Internet-of-Things Will be lots!

What is Big Data?

Small Data Big Data

Why is Big Data “hot”?

•Companies like Google and Facebook have shown

how to extract value from Big Data

Orbitz looks for higher prices

from Safari users [WSJ’12]
Why is Big Data “hot”?

•Big Data helped Obama win the 2012 election

through data-driven decision making*

Data said: middle-aged females like contests, dinners and celebrity

*http://swampland.time.com/2012/11/07/inside-the-secret-world-of-quants-and-data-crunchers-who-helped-obama-win/
Why is Big Data Important in Science?

•In a wide array of academic fields, the ability to

effectively process data is superseding other more
classical modes of research.

“More data trumps better algorithms”*

*“The Unreasonable Effectiveness of Data” [Halevey et al 09]

4 Vs of Big Data

•Volume

•Velocity

•Variety

•Veracity/Variability/Value
A quick historical tour of data systems
Batch Sequential Processing

Scan → Sort

IBM 082 Punch Card Sorter No Fault Tolerance 

1960s
First Database Management Systems
COBOL

DBMS
Hierarchical and Network
Database Management Systems
You had to know what data you want, and how to find it
Early DBMS’ were Disk-Aware
Codd's Relational Model

Just tell me
the data you want,
the system will
find it.
SystemR

CREATE TABLE Students(

id INT PRIMARY_KEY,
firstname VARCHAR(96),
lastname VARCHAR(96)
Views
);

SELECT * FROM Students Relations Structured Query

WHERE id > 10; Language

Disk Access
Indexes
Methods
? Disk
Finding the Data using a Query Optimizer
Each color represents a program in this plan diagram

•Each program
produces the same
Data Characteristics Change

result for the Query.

•Each program has
different performance
characteristics
depending on changes
in the data
characteristics
Data Characteristics Change
What if I have lots of Concurrent Queries?

•Data Integrity using Transactions*

ACID
Atomicity Consistency Isolation Durability

*Jim Gray, ”The Transaction Concept: Virtues and Limitation”

In the 1990s
Data Read Rates Increased Dramatically
Distribute within a Data Center
Master-Slave Replication

Data-location awareness is back:

Clients read from slaves, write to master.
Possibility of reading stale data.
In the 2000s
Data Write Rates Increased Dramatically
Unstructured Data explodes

Source: IDC whitepaper. As the Economy contracts, the Digital Universe Explodes. 2009
Key-Value stores don’t do Big Data yet.
Existing Big Data systems currently only
work for a single Data Centre.*

*The usual Google Exception applies

Storage and Processing of Big Data
What is Apache Hadoop?

 Huge data sets and large files

 Gigabytes files, petabyte data sets
 Scales to thousands of nodes on commodity hardware

 No Schema Required
 Data can be just copied in, extract required columns later

 Fault tolerant

 Network topology-aware, Data Location-Aware

 Optimized for analytics: high-throughput file access

Hadoop (version 1)

Application

MapReduce

Hadoop Filesystem
HDFS: Hadoop Filesystem

write “/crawler/bot/jd.io/1”
Under-replicated blocks
Name node

Heartbeats Rebalance
Re-replicate
blocks

1 2 35 1 3 1 3 4 1 2 3 2 4 5
4 5 6
Data nodes 2 6 5 Data nodes 6
HDFS v2 Architecture

Active-Standby Replication of NN Log Journal Nodes Zookeeper Nodes

Agreement on the Active NameNode
Faster Recovery - Cut the NN Log

NameNode Standby Snapshot

NameNode Node

HDFS Client

DataNodes

30
HopsFS Architecture

NDB

Load
Balancer

HopsFS Client
NameNodes
Leader

HDFS Client

DataNodes

31
Processing Big Data
Big Data Processing with No Data Locality
submit
Job(“/crawler/bot/jd.io/1”)

Workflow Manager

Compute Grid Node Job

This doesn’t scale.

Bandwidth is the bottleneck

1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5
MapReduce – Data Locality
submit
Job(“/crawler/bot/jd.io/1”)

Job Tracker

Task Task Task Task Task Task

Tracker Tracker Tracker Tracker Tracker Tracker
Job Job Job Job Job Job

DN DN DN DN DN DN
1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5
R R = resultFile(s) R R
MapReduce*

1. Programming Paradigm

2. Processing Pipeline (moving computation to data)

*Dean et al, OSDI’04

MapReduce Programming Paradigm

map(record) ->
{(keyi, valuei), .., (keyl, valuel)}

reduce((keyi, {valuek, .., valuey}) -> output

MapReduce Programming Paradigm

•Also found in:

Functional programming languages

MongoDB

Cassandra
Example: Building a Web Search Index

map(url, doc) ->

{(termi, url),(termm, url)}

reduce((term,{urlk,..,urly}) ->
(term, (posting list of url, count))
Example: Building a Web Search Index

map( (“jd.io”, “A hipster website with news”))

->
{
emit(“a”, “jd.io”),
emit(“hipster”, “jd.io”),
emit(“website”, “jd.io”),
emit(“with”, “jd.io”),
emit(“news”, “jd.io”)
}
Example: Building a Web Search Index

map( (“hn.io”, “Hacker hipster news”))

->
{
emit(“hacker”, “hn.io”),
emit(“hipster”, “hn.io”),
emit(“news”, “hn.io”)
}
Example: Building a Web Search Index

reduce( “hipster”, { “jd.io”, “hn.io” }) ->

( “hipster”, ([“jd.io”, “hn.io”], 2))
Example: Building a Web Search Index

reduce( “website”, { “jd.io”}) ->

( “website”, ([“jd.io”], 1))
Example: Building a Web Search Index

reduce( “news”, { “jd.io”, “hn.io” }) ->

( “news”, ([“jd.io”, “hn.io”], 2))
Map Phase
MapReduce
map(url, doc) -> {(termi, url),(terml, url)}

Mapper1 Mapper6 Mapper4 Mapper3 Mapper2 Mapper5

1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5
1' 6’ 4’ 3’ 2’ 5’

DN DN DN DN DN DN
Shuffle Phase
MapReduce
group by term

Shuffle over the Network using a Partitioner

A-D E-H I-L M-P Q-T U-Z

1' 6’ 4’ 3’ 2’ 5’

DN DN DN DN DN DN
Reduce Phase
MapReduce
reduce((term,{urlk,urly}) ->
(term, (posting list of url, count))

A-D E-H I-L M-P Q-T U-Z

Reducer1 Reducer2 Reducer3 Reducer4 Reducer5 Reducer6

A’-D’ E’-H’ I’-L’ M’-P’ Q’-T’ U’-Z’

DN DN DN DN DN DN
Hadoop 2.x

Single Processing Framework Multiple Processing Frameworks

Batch Apps Batch, Interactive, Streaming …

Hadoop 1.x Hadoop 2.x

MapReduce Others
(data processing) (spark, mpi, giraph, etc)

MapReduce YARN
(resource mgmt, job scheduler, (resource mgmt, job scheduler)
data processing)

HDFS HDFS
(distributed storage) (distributed storage)
MapReduce and MPI as YARN Applications

[Murthy et. al, Apache Hadoop YARN: Yet Another Resource Negotiator”, SOCC’13]
Data Locality in Hadoop v2
Limitations of MapReduce [Zaharia’11]
•MapReduce is based on an acyclic data flow from
stable storage to stable storage.
- Slow writes data to HDFS at every stage in the pipeline
•Acyclic data flow is inefficient for applications that
repeatedly reuse a working set of data:
- Iterative algorithms (machine learning, graphs)
- Interactive data mining tools (R, Excel, Python)

Map
Reduce

Input Map Output

Reduce
Map
Iterative Data Processing Frameworks
val input= TextFile(textInput)

val words = input

.flatMap
{ line => line.split(” ”) }

val counts = words

.groupBy
{ word => word }
.count()

val output = counts

.write (wordsOutput,
RecordDataSinkFormat() )

val plan = new ScalaPlan(Seq(output))

Spark – Resiliant Distributed Datasets

•Allow apps to keep working sets in memory for

efficient reuse
•Retain the attractive properties of MapReduce
- Fault tolerance, data locality, scalability

Resilient distributed datasets (RDDs)

- Immutable, partitioned collections of objects
- Created through parallel transformations (map, filter,
groupBy, join, …) on data in stable storage
- Can be cached for efficient reuse
Actions on RDDs
- Count, reduce, collect, save, …
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

BaseTransformed
RDD Cache 1
lines = spark.textFile(“hdfs://...”) RDD
Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
tasks
messages = errors.map(_.split(‘\t’)(2)) Block 1
Driver
cachedMsgs = messages.cache()
Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Worker Block 2
Result:
Result: scaled tosearch
full-text 1 TB data in 5-7 secin
of Wikipedia
<1(vs
sec170
(vssec for on-disk
20 sec data)
for on-disk data)
Block 3
Apache Flink – DataFlow Operators

Flink Map Iterate Project

Reduce Delta Iterate Aggregate
Join Filter Distinct
CoGroup FlatMap Vertex Update
Union GroupReduce Accumulators

Iterate
Source Map Reduce

Join Reduce Sink

Source Map

*Alexandrov et al.: “The Stratosphere Platform for Big Data Analytics,” VLDB Journal 5/2014
Built-in vs. driver-based looping

Client

Loop outside the system,

Step Step Step Step Step
in driver program

Client
Iterative program looks
Step Step Step Step Step
like many independent
jobs

Dataflows with feedback

red. edges
map join
System is iteration-
Flink
join
aware, can optimize the
job
Hadoop on the Cloud
• Cloud Computing traditionally separates storage and
computation.

Amazon
OpenStack
Web Services

Nova (Compute)
EC2 Swift (Object
Elastic Block Storage)
Storage

Glance (VM
S3 Images)
Data Locality for Hadoop on the Cloud
• Cloud hardware
configurations should
support data locality

• Hadoop’s original topology

awareness breaks
• Placement of >1 VM
containing block replicas for
the same file on the same
physical host increases
correlated failures

• VMWare introduced a
NodeGroup aware topology
• HADOOP-8468
Conclusions

•Hadoop is the open-source enabling technology for

Big Data

•YARN is rapidly becoming the operating system for

the Data Center

•Apache Spark and Flink are in-memory processing

frameworks for Hadoop
References

•Dean et. Al, “MapReduce: Simplified Data Processing

on Large Clusters”, OSDI’04.

•Schvachko, “HDFS Scalability: The limits to growth”,

Usenix, :login, April 2010.

•Murthy et al, “Apache Hadoop YARN: Yet Another

Resource Negotiator”, SOCC’13.

•“Processing a Trillion Cells per Mouse Click”,

VLDB’12

Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Unit 5
No ratings yet
Unit 5
32 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Big Data and Analytics and MapReduce 29052023 054155pm
No ratings yet
Big Data and Analytics and MapReduce 29052023 054155pm
35 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
BDA Unit-4
No ratings yet
BDA Unit-4
47 pages
Chapter - 2 Hadoop
100% (1)
Chapter - 2 Hadoop
32 pages
Biggdata
No ratings yet
Biggdata
24 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
2-Introduction To Hadoop Eco System
No ratings yet
2-Introduction To Hadoop Eco System
35 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
HADOOP
No ratings yet
HADOOP
55 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Big Data
No ratings yet
Big Data
67 pages
Big Data
No ratings yet
Big Data
79 pages
Week 14
No ratings yet
Week 14
33 pages
Big Data
No ratings yet
Big Data
51 pages
Week 02
No ratings yet
Week 02
115 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Big Data
No ratings yet
Big Data
43 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Data Science
No ratings yet
Data Science
87 pages
Introduction To
No ratings yet
Introduction To
7 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
M5
No ratings yet
M5
18 pages
Hadoop Trainting in Hyderabad@KellyTechnologies
No ratings yet
Hadoop Trainting in Hyderabad@KellyTechnologies
23 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Big Data
No ratings yet
Big Data
12 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Big Data
No ratings yet
Big Data
11 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Slides PDF
No ratings yet
Slides PDF
30 pages
Intro to DBMS for IT Students
No ratings yet
Intro to DBMS for IT Students
7 pages
Courtesy:: Lesson 1: Create A Project and Basic Package With SSIS
No ratings yet
Courtesy:: Lesson 1: Create A Project and Basic Package With SSIS
2 pages
Making Connections To A Local Oracle Database
No ratings yet
Making Connections To A Local Oracle Database
11 pages
Cognitive Radio Network Security Status and Challenges: March 2017
No ratings yet
Cognitive Radio Network Security Status and Challenges: March 2017
7 pages
Introduction To Databases and DBMSS: Lesson 7: Dbms Technology Evolution
No ratings yet
Introduction To Databases and DBMSS: Lesson 7: Dbms Technology Evolution
6 pages
Group Operations L&D Update
No ratings yet
Group Operations L&D Update
16 pages
Olive Oil Extraction System Invoice
No ratings yet
Olive Oil Extraction System Invoice
1 page
MICROSOFT SQL Server 2008/2012 R2 Business Intelligence Development-MSBI (SSIS, SSAS, SSRS)
No ratings yet
MICROSOFT SQL Server 2008/2012 R2 Business Intelligence Development-MSBI (SSIS, SSAS, SSRS)
12 pages
Courtesy:: Report Title Purpose Relationships Tutorial
No ratings yet
Courtesy:: Report Title Purpose Relationships Tutorial
3 pages
Realistic or Not
No ratings yet
Realistic or Not
8 pages
Topic 6 Worksheet 6 (1) - Answers
No ratings yet
Topic 6 Worksheet 6 (1) - Answers
5 pages
Byte of Python
No ratings yet
Byte of Python
108 pages
Umbilical Cord Prolapse
No ratings yet
Umbilical Cord Prolapse
31 pages
iR-ADV+C5560 C5550 C5540 C5535+Series Partscatalog E EUR PDF
No ratings yet
iR-ADV+C5560 C5550 C5540 C5535+Series Partscatalog E EUR PDF
222 pages
Oracle9i (통합)
No ratings yet
Oracle9i (통합)
349 pages
Lab Manual: Department of Computer Science and Engineering
No ratings yet
Lab Manual: Department of Computer Science and Engineering
30 pages
Introduction To Cybercrime
No ratings yet
Introduction To Cybercrime
72 pages
Account Statement: Ramakant Sharma
No ratings yet
Account Statement: Ramakant Sharma
4 pages
Canon IR8070 Error Codes List
100% (1)
Canon IR8070 Error Codes List
18 pages
English Uidai Enrolment Study Material
No ratings yet
English Uidai Enrolment Study Material
132 pages
Network Design and Topologies
No ratings yet
Network Design and Topologies
13 pages
Data Mining
No ratings yet
Data Mining
35 pages
Xamarin to .NET MAUI Migration Guide
No ratings yet
Xamarin to .NET MAUI Migration Guide
2 pages
Outline Agreement
No ratings yet
Outline Agreement
2 pages
MDM Warranty Periode 02042024
No ratings yet
MDM Warranty Periode 02042024
69 pages
VLAN Types for Network Professionals
No ratings yet
VLAN Types for Network Professionals
6 pages
Dump-Networking-2021 1015 122151 Stop
100% (1)
Dump-Networking-2021 1015 122151 Stop
422 pages
2 - CE 727 Remote Sensing - Image Processing
No ratings yet
2 - CE 727 Remote Sensing - Image Processing
79 pages
BDCOM GP1704 Series - V3.2
No ratings yet
BDCOM GP1704 Series - V3.2
6 pages
Software Testing Basics & Methods
No ratings yet
Software Testing Basics & Methods
111 pages
Lesson 1 For GCSE Maths Grade 10
No ratings yet
Lesson 1 For GCSE Maths Grade 10
14 pages
AES 32 An FPGA Implementation of Lightweight-AES For
No ratings yet
AES 32 An FPGA Implementation of Lightweight-AES For
10 pages
AI in Game Playing
No ratings yet
AI in Game Playing
85 pages
Digital Twin Use Cases for SMEs
100% (2)
Digital Twin Use Cases for SMEs
30 pages
HMI Installation HB1 Size Configuration Manual - en
No ratings yet
HMI Installation HB1 Size Configuration Manual - en
34 pages
PIRATE Bingo
No ratings yet
PIRATE Bingo
17 pages
Fuzzy Logic: Dr. Umang Soni Nsut
No ratings yet
Fuzzy Logic: Dr. Umang Soni Nsut
28 pages
Standar Instalasi - Indosat - Cabinet For TP48200 - Dwi - Review Nurudin - Rev1
No ratings yet
Standar Instalasi - Indosat - Cabinet For TP48200 - Dwi - Review Nurudin - Rev1
31 pages
R SVM Classifier Tutorial
No ratings yet
R SVM Classifier Tutorial
4 pages
Production - Process Multimedia
No ratings yet
Production - Process Multimedia
59 pages

9 Hadoop PDF

Uploaded by

9 Hadoop PDF

Uploaded by

Introduction to Hadoop

•On-line services PBs per day

•Scientiﬁc instruments PBs per minute

•Whole genome sequencing 250 GB per person

•Internet-of-Things Will be lots!

Small Data Big Data

•Companies like Google and Facebook have shown

Orbitz looks for higher prices

•Big Data helped Obama win the 2012 election

Data said: middle-aged females like contests, dinners and celebrity

•In a wide array of academic fields, the ability to

“More data trumps better algorithms”*

*“The Unreasonable Effectiveness of Data” [Halevey et al 09]

IBM 082 Punch Card Sorter No Fault Tolerance 

CREATE TABLE Students(

SELECT * FROM Students Relations Structured Query

result for the Query.

•Data Integrity using Transactions*

*Jim Gray, ”The Transaction Concept: Virtues and Limitation”

Data-location awareness is back:

*The usual Google Exception applies

 Huge data sets and large files

 Network topology-aware, Data Location-Aware

 Optimized for analytics: high-throughput file access

Active-Standby Replication of NN Log Journal Nodes Zookeeper Nodes

NameNode Standby Snapshot

Compute Grid Node Job

This doesn’t scale.

Task Task Task Task Task Task

2. Processing Pipeline (moving computation to data)

*Dean et al, OSDI’04

reduce((keyi, {valuek, .., valuey}) -> output

•Also found in:

Functional programming languages

map(url, doc) ->

map( (“jd.io”, “A hipster website with news”))

map( (“hn.io”, “Hacker hipster news”))

reduce( “hipster”, { “jd.io”, “hn.io” }) ->

reduce( “website”, { “jd.io”}) ->

reduce( “news”, { “jd.io”, “hn.io” }) ->

Mapper1 Mapper6 Mapper4 Mapper3 Mapper2 Mapper5

Shuffle over the Network using a Partitioner

A-D E-H I-L M-P Q-T U-Z

A-D E-H I-L M-P Q-T U-Z

Reducer1 Reducer2 Reducer3 Reducer4 Reducer5 Reducer6

A’-D’ E’-H’ I’-L’ M’-P’ Q’-T’ U’-Z’

Single Processing Framework Multiple Processing Frameworks

Hadoop 1.x Hadoop 2.x

Input Map Output

val words = input

val counts = words

val output = counts

val plan = new ScalaPlan(Seq(output))

•Allow apps to keep working sets in memory for

Resilient distributed datasets (RDDs)

Flink Map Iterate Project

Join Reduce Sink

Loop outside the system,

Dataflows with feedback

• Hadoop’s original topology

•Hadoop is the open-source enabling technology for

•YARN is rapidly becoming the operating system for

•Apache Spark and Flink are in-memory processing

•Dean et. Al, “MapReduce: Simplified Data Processing

•Schvachko, “HDFS Scalability: The limits to growth”,

•Murthy et al, “Apache Hadoop YARN: Yet Another

•“Processing a Trillion Cells per Mouse Click”,

You might also like