0% found this document useful (0 votes)
93 views59 pages

9 Hadoop PDF

Hadoop is a framework for distributed processing of large data sets across clusters of computers. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity hardware. Hadoop features include distributed storage with the Hadoop Distributed File System (HDFS), which provides high-throughput access to application data and is suitable for storing very large data sets reliably, and a processing engine called MapReduce that can distribute computations across large clusters.

Uploaded by

Amine Hamdouchi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views59 pages

9 Hadoop PDF

Hadoop is a framework for distributed processing of large data sets across clusters of computers. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity hardware. Hadoop features include distributed storage with the Hadoop Distributed File System (HDFS), which provides high-throughput access to application data and is suitable for storing very large data sets reliably, and a processing engine called MapReduce that can distribute computations across large clusters.

Uploaded by

Amine Hamdouchi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Introduction to Hadoop

ID2210
Jim Dowling
Large Scale Distributed Computing
In #Nodes
- BitTorrent (millions)
- Peer-to-Peer

In #Instructions/sec
- Teraflops, Petaflops, Exascale
- Super-Computing

In #Bytes stored
- Facebook: 300+ Petabytes (April 2014)*
- Hadoop

In #Bytes processed/time
- Google processed 24 petabytes of data per day in 2013
- Colossus, Spanner, BigQuery, BigTable, Borg, Omega, ..

*http://www.adweek.com/socialtimes/orcfile/434041
Where does Big Data Come From?

•On-line services PBs per day

•Scientific instruments PBs per minute

•Whole genome sequencing 250 GB per person

•Internet-of-Things Will be lots!


What is Big Data?

Small Data Big Data


Why is Big Data “hot”?

•Companies like Google and Facebook have shown


how to extract value from Big Data

Orbitz looks for higher prices


from Safari users [WSJ’12]
Why is Big Data “hot”?

•Big Data helped Obama win the 2012 election


through data-driven decision making*

Data said: middle-aged females like contests, dinners and celebrity


*http://swampland.time.com/2012/11/07/inside-the-secret-world-of-quants-and-data-crunchers-who-helped-obama-win/
Why is Big Data Important in Science?

•In a wide array of academic fields, the ability to


effectively process data is superseding other more
classical modes of research.

“More data trumps better algorithms”*

*“The Unreasonable Effectiveness of Data” [Halevey et al 09]


4 Vs of Big Data

•Volume

•Velocity

•Variety

•Veracity/Variability/Value
A quick historical tour of data systems
Batch Sequential Processing

Scan → Sort

IBM 082 Punch Card Sorter No Fault Tolerance 


1960s
First Database Management Systems
COBOL

DBMS
Hierarchical and Network
Database Management Systems
You had to know what data you want, and how to find it
Early DBMS’ were Disk-Aware
Codd's Relational Model

Just tell me
the data you want,
the system will
find it.
SystemR

CREATE TABLE Students(


id INT PRIMARY_KEY,
firstname VARCHAR(96),
lastname VARCHAR(96)
Views
);

SELECT * FROM Students Relations Structured Query


WHERE id > 10; Language

Disk Access
Indexes
Methods
? Disk
Finding the Data using a Query Optimizer
Each color represents a program in this plan diagram

•Each program
produces the same
Data Characteristics Change

result for the Query.


•Each program has
different performance
characteristics
depending on changes
in the data
characteristics
Data Characteristics Change
What if I have lots of Concurrent Queries?

•Data Integrity using Transactions*

ACID
Atomicity Consistency Isolation Durability

*Jim Gray, ”The Transaction Concept: Virtues and Limitation”


In the 1990s
Data Read Rates Increased Dramatically
Distribute within a Data Center
Master-Slave Replication

Data-location awareness is back:


Clients read from slaves, write to master.
Possibility of reading stale data.
In the 2000s
Data Write Rates Increased Dramatically
Unstructured Data explodes

Source: IDC whitepaper. As the Economy contracts, the Digital Universe Explodes. 2009
Key-Value stores don’t do Big Data yet.
Existing Big Data systems currently only
work for a single Data Centre.*

*The usual Google Exception applies


Storage and Processing of Big Data
What is Apache Hadoop?

 Huge data sets and large files


 Gigabytes files, petabyte data sets
 Scales to thousands of nodes on commodity hardware

 No Schema Required
 Data can be just copied in, extract required columns later

 Fault tolerant

 Network topology-aware, Data Location-Aware

 Optimized for analytics: high-throughput file access


Hadoop (version 1)

Application

MapReduce

Hadoop Filesystem
HDFS: Hadoop Filesystem

write “/crawler/bot/jd.io/1”
Under-replicated blocks
Name node

Heartbeats Rebalance
Re-replicate
blocks

1 2 35 1 3 1 3 4 1 2 3 2 4 5
4 5 6
Data nodes 2 6 5 Data nodes 6
HDFS v2 Architecture

Active-Standby Replication of NN Log Journal Nodes Zookeeper Nodes


Agreement on the Active NameNode
Faster Recovery - Cut the NN Log

NameNode Standby Snapshot


NameNode Node

HDFS Client

DataNodes

30
HopsFS Architecture

NDB

Load
Balancer

HopsFS Client
NameNodes
Leader

HDFS Client

DataNodes

31
Processing Big Data
Big Data Processing with No Data Locality
submit
Job(“/crawler/bot/jd.io/1”)

Workflow Manager

Compute Grid Node Job

This doesn’t scale.


Bandwidth is the bottleneck

1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5
MapReduce – Data Locality
submit
Job(“/crawler/bot/jd.io/1”)

Job Tracker

Task Task Task Task Task Task


Tracker Tracker Tracker Tracker Tracker Tracker
Job Job Job Job Job Job

DN DN DN DN DN DN
1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5
R R = resultFile(s) R R
MapReduce*

1. Programming Paradigm

2. Processing Pipeline (moving computation to data)

*Dean et al, OSDI’04


MapReduce Programming Paradigm

map(record) ->
{(keyi, valuei), .., (keyl, valuel)}

reduce((keyi, {valuek, .., valuey}) -> output


MapReduce Programming Paradigm

•Also found in:

Functional programming languages

MongoDB

Cassandra
Example: Building a Web Search Index

map(url, doc) ->


{(termi, url),(termm, url)}

reduce((term,{urlk,..,urly}) ->
(term, (posting list of url, count))
Example: Building a Web Search Index

map( (“jd.io”, “A hipster website with news”))


->
{
emit(“a”, “jd.io”),
emit(“hipster”, “jd.io”),
emit(“website”, “jd.io”),
emit(“with”, “jd.io”),
emit(“news”, “jd.io”)
}
Example: Building a Web Search Index

map( (“hn.io”, “Hacker hipster news”))


->
{
emit(“hacker”, “hn.io”),
emit(“hipster”, “hn.io”),
emit(“news”, “hn.io”)
}
Example: Building a Web Search Index

reduce( “hipster”, { “jd.io”, “hn.io” }) ->


( “hipster”, ([“jd.io”, “hn.io”], 2))
Example: Building a Web Search Index

reduce( “website”, { “jd.io”}) ->


( “website”, ([“jd.io”], 1))
Example: Building a Web Search Index

reduce( “news”, { “jd.io”, “hn.io” }) ->


( “news”, ([“jd.io”, “hn.io”], 2))
Map Phase
MapReduce
map(url, doc) -> {(termi, url),(terml, url)}

Mapper1 Mapper6 Mapper4 Mapper3 Mapper2 Mapper5

1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5
1' 6’ 4’ 3’ 2’ 5’

DN DN DN DN DN DN
Shuffle Phase
MapReduce
group by term

Shuffle over the Network using a Partitioner

A-D E-H I-L M-P Q-T U-Z


1' 6’ 4’ 3’ 2’ 5’

DN DN DN DN DN DN
Reduce Phase
MapReduce
reduce((term,{urlk,urly}) ->
(term, (posting list of url, count))

A-D E-H I-L M-P Q-T U-Z

Reducer1 Reducer2 Reducer3 Reducer4 Reducer5 Reducer6

A’-D’ E’-H’ I’-L’ M’-P’ Q’-T’ U’-Z’

DN DN DN DN DN DN
Hadoop 2.x

Single Processing Framework Multiple Processing Frameworks


Batch Apps Batch, Interactive, Streaming …

Hadoop 1.x Hadoop 2.x


MapReduce Others
(data processing) (spark, mpi, giraph, etc)

MapReduce YARN
(resource mgmt, job scheduler, (resource mgmt, job scheduler)
data processing)

HDFS HDFS
(distributed storage) (distributed storage)
MapReduce and MPI as YARN Applications

[Murthy et. al, Apache Hadoop YARN: Yet Another Resource Negotiator”, SOCC’13]
Data Locality in Hadoop v2
Limitations of MapReduce [Zaharia’11]
•MapReduce is based on an acyclic data flow from
stable storage to stable storage.
- Slow writes data to HDFS at every stage in the pipeline
•Acyclic data flow is inefficient for applications that
repeatedly reuse a working set of data:
- Iterative algorithms (machine learning, graphs)
- Interactive data mining tools (R, Excel, Python)

Map
Reduce

Input Map Output

Reduce
Map
Iterative Data Processing Frameworks
val input= TextFile(textInput)

val words = input


.flatMap
{ line => line.split(” ”) }

val counts = words


.groupBy
{ word => word }
.count()

val output = counts


.write (wordsOutput,
RecordDataSinkFormat() )

val plan = new ScalaPlan(Seq(output))


Spark – Resiliant Distributed Datasets

•Allow apps to keep working sets in memory for


efficient reuse
•Retain the attractive properties of MapReduce
- Fault tolerance, data locality, scalability

Resilient distributed datasets (RDDs)


- Immutable, partitioned collections of objects
- Created through parallel transformations (map, filter,
groupBy, join, …) on data in stable storage
- Can be cached for efficient reuse
Actions on RDDs
- Count, reduce, collect, save, …
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

BaseTransformed
RDD Cache 1
lines = spark.textFile(“hdfs://...”) RDD
Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
tasks
messages = errors.map(_.split(‘\t’)(2)) Block 1
Driver
cachedMsgs = messages.cache()
Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Worker Block 2
Result:
Result: scaled tosearch
full-text 1 TB data in 5-7 secin
of Wikipedia
<1(vs
sec170
(vssec for on-disk
20 sec data)
for on-disk data)
Block 3
Apache Flink – DataFlow Operators

Flink Map Iterate Project


Reduce Delta Iterate Aggregate
Join Filter Distinct
CoGroup FlatMap Vertex Update
Union GroupReduce Accumulators

Iterate
Source Map Reduce

Join Reduce Sink

Source Map

*Alexandrov et al.: “The Stratosphere Platform for Big Data Analytics,” VLDB Journal 5/2014
Built-in vs. driver-based looping

Client

Loop outside the system,


Step Step Step Step Step
in driver program

Client
Iterative program looks
Step Step Step Step Step
like many independent
jobs

Dataflows with feedback


red. edges
map join
System is iteration-
Flink
join
aware, can optimize the
job
Hadoop on the Cloud
• Cloud Computing traditionally separates storage and
computation.

Amazon
OpenStack
Web Services

Nova (Compute)
EC2 Swift (Object
Elastic Block Storage)
Storage

Glance (VM
S3 Images)
Data Locality for Hadoop on the Cloud
• Cloud hardware
configurations should
support data locality

• Hadoop’s original topology


awareness breaks
• Placement of >1 VM
containing block replicas for
the same file on the same
physical host increases
correlated failures

• VMWare introduced a
NodeGroup aware topology
• HADOOP-8468
Conclusions

•Hadoop is the open-source enabling technology for


Big Data

•YARN is rapidly becoming the operating system for


the Data Center

•Apache Spark and Flink are in-memory processing


frameworks for Hadoop
References

•Dean et. Al, “MapReduce: Simplified Data Processing


on Large Clusters”, OSDI’04.

•Schvachko, “HDFS Scalability: The limits to growth”,


Usenix, :login, April 2010.

•Murthy et al, “Apache Hadoop YARN: Yet Another


Resource Negotiator”, SOCC’13.

•“Processing a Trillion Cells per Mouse Click”,


VLDB’12

You might also like