0% found this document useful (1 vote)

922 views49 pages

Advanced Spark Training

This document provides an overview of Spark's RDD abstraction and the life cycle of a Spark application for performance debugging purposes. It discusses how the Spark scheduler builds a DAG of stages and tasks from RDD transformations and actions, and executes the tasks across executors in a cluster. It also demonstrates how to use the Spark UI, executor logs, and profiling tools like jstack to identify slow tasks, garbage collection issues, and inefficient code. The goal is to give developers a scientific understanding of RDDs and best practices for optimizing and troubleshooting Spark job performance.

Uploaded by

Syed Safian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

922 views49 pages

Advanced Spark Training

Uploaded by

Syed Safian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Advanced Spark

Reynold Xin, July 2, 2014 @ Spark Summit Training

This Talk
Formalize RDD concept

Life of a Spark Application

Performance Debugging

Mechanical sympathy by Jackie Stewart: a driver does not
need to know how to build an engine but they need to know
* the
Assumes
you can write word count, knows what
fundamentals of how one works to get the best out of it
transformation/action is

Reynold Xin
Apache Spark committer (worked on almost every
module: core, sql, mllib, graph)

Product & open-source eng @ Databricks

On leave from PhD @ UC Berkeley AMPLab

Example Application
val sc = new SparkContext(...)

Resilient distributed
datasets (RDDs)

val file = sc.textFile(hdfs://...)

val errors = file.filter(_.contains(ERROR))
errors.cache()
errors.count()

Action

Quiz: what is an RDD?

A: distributed collection of objects on disk

B: distributed collection of objects in memory

C: distributed collection of objects in Cassandra

Answer: could be any of the above!

Scientific Answer: RDD is an Interface!

1. Set of partitions (splits in Hadoop)
2. List of dependencies on parent RDDs

lineage

3. Function to compute a partition"

(as an Iterator) given its parent(s)
4. (Optional) partitioner (hash, range)
5. (Optional) preferred location(s)"
for each partition

optimized
execution

Example: HadoopRDD
partitions = one per HDFS block

dependencies = none

compute(part) = read corresponding block

preferredLocations(part) = HDFS block location

partitioner = none

Example: Filtered RDD

partitions = same as parent RDD

dependencies = one-to-one on parent

compute(part) = compute parent and filter it

preferredLocations(part) = none (ask parent)

partitioner = none

RDD Graph (DAG of tasks)

Dataset-level view:

file:

Partition-level view:

HadoopRDD"

path = hdfs://...

errors:

FilteredRDD"

func = _.contains()"
shouldCache = true

Task1 Task2 ...

Example: JoinedRDD
partitions = one per reduce task

dependencies = shuffle on each parent

compute(partition) = read and join shuffled data

preferredLocations(part) = none"

partitioner = HashPartitioner(numTasks)
Spark will now know
this data is hashed!

Dependency Types
Narrow (pipeline-able)

Wide (shuffle)

groupByKey on"
non-partitioned data

map, filter

join with inputs

co-partitioned
union

join with inputs not"

co-partitioned

Recap
Each RDD consists of 5 properties:

1. partitions
2. dependencies
3. compute
4. (optional) partitioner
5. (optional) preferred locations

Life of a Spark Application

Spark Application
Your program
(JVM / Python)

Spark driver"
(app master)

RDD graph

sc = new SparkContext

f = sc.textFile()"

Scheduler

Spark executor
(multiple of them)

Cluster"
manager

Task
threads

"
f.filter()"
.count()"
"

Block tracker

...

Shuffle tracker

A single application often contains multiple actions

Block
manager

HDFS, HBase,

Job Scheduling Process

Scheduler
(DAGScheduler)

RDD Objects

Executors
Threads
Task

DAG

rdd1.join(rdd2)
.groupBy()
.filter()
.count()

build operator DAG

Block
manager

split graph into

stages of tasks

execute tasks

submit each
stage as ready

store and serve

blocks

DAG Scheduler
Input: RDD and partitions to compute

Output: output from actions on those partitions

Roles:
> Build stages of tasks
> Submit them to lower level scheduler (e.g. YARN,
Mesos, Standalone) as ready
> Lower level scheduler will schedule data based on
locality
> Resubmit failed stages if outputs are lost

Scheduler Optimizations
Pipelines operations
within a stage

Picks join algorithms
based on partitioning
(minimize shuffles)

Reuses previously
cached data

Stage 1

groupBy

Task

map
E:
Stage 2

join
union

Stage 3

= previously computed partition

Task
Unit of work to execute on in an executor thread

Unlike MR, there is no map vs reduce task

Each task either partitions its output for shuffle, or
send the output back to the driver

Shuffle
Redistributes data among partitions

Partition keys into buckets
(user-defined partitioner)

Optimizations:
> Avoided when possible, if"
data is already properly"
partitioned
> Partial aggregation reduces"
data movement

Stage 1

Stage 2

Shuffle
Write intermediate files to disk
Fetched by the next stage of tasks (reduce in MR)
Stage 1

Disk

Stage 2

Recap: Job Scheduling

Scheduler
(DAGScheduler)

RDD Objects

Executors
Threads
Task

DAG

rdd1.join(rdd2)
.groupBy()
.filter()
.count()

build operator DAG

Block
manager

split graph into

stages of tasks

execute tasks

submit each
stage as ready

store and serve

blocks

Performance Debugging

Performance Debugging
Distributed performance: program slow due to
scheduling, coordination, or data distribution)

Local performance: program slow because whatever
Im running is just slow on a single node

Two useful tools:
> Application web UI (default port 4040)
> Executor logs (spark/work)

Find Slow Stage(s)

Stragglers?
Some tasks are just slower than others.

Easy to identify from summary metrics:

Stragglers due to slow nodes

sc.parallelize(1 to 15, 15).map { index =>
val host = java.net.InetAddress.getLocalHost.getHostName
if (host == "ip-172-31-2-222") {
Thread.sleep(10000)
} else {
Thread.sleep(1000)
}
}.count()

Stragglers due to slow nodes

Turn speculation on to mitigates this problem.

Speculation: Spark identifies slow tasks (by looking
at runtime distribution), and re-launches those tasks
on other nodes.

spark.speculation true

Demo Time: slow node

Stragglers due to data skew

sc.parallelize(1 to 15, 15)
.flatMap { i => 1 to i }
.map { i => Thread.sleep(1000) }
.count()

Speculation is not going to help because the
problem is inherent in the algorithm/data.

Pick a different algorithm or restructure the data.

Demo Time

Tasks are just slow

Garbage collection

Performance of the code running in each task

Garbage Collection
Look at the GC Time column in the web UI

What if the task is still running?

To discover whether GC is the problem:

1. Set spark.executor.extraJavaOptions to include:
-XX:-PrintGCDetails -XX:+PrintGCTimeStamps
2. Look at spark/work/app/[n]/stdout on
executors
3. Short GC times are OK. Long ones are bad.

jmap: heap analysis

jmap -histo [pid]
Gets a histogram of objects in the JVM heap

jmap -histo:live [pid]
Gets a histogram of objects in the heap after GC
(thus live)

Find out what objects are the trouble

Demo: GC log & jmap

Reduce GC impact
class DummyObject(var i: Int) {
def toInt = i
}

sc.parallelize(1 to 100 * 1000 * 1000, 1).map { i =>
new DummyObject(i) // new object every record
obj.toInt
}

sc.parallelize(1 to 100 * 1000 * 1000, 1).mapPartitions { iter =>
val obj = new DummyObject(0) // reuse the same object
iter.map { i =>
obj.i = i
obj.toInt
}
}

Local Performance
Each Spark executor runs a JVM/Python process

Insert your favorite JVM/Python profiling tool
> jstack
> YourKit
> VisualVM
> println
> (sorry I dont know a whole lot about Python)
>

Example: identify expensive comp.

def someCheapComputation(record: Int): Int = record + 1

def someExpensiveComputation(record: Int): String = {
Thread.sleep(1000)
record.toString
}

sc.parallelize(1 to 100000).map { record =>
val step1 = someCheapComputation(record)
val step2 = someExpensiveComputation(step1)
step2
}.saveAsTextFile("hdfs:/tmp1")

Demo Time

jstack

Can often pinpoint problems just by jstack a few times

YourKit (free for open source dev)

Debugging Tip

Local Debugging
Run in local mode (i.e. Spark master local) and
debug with your favorite debugger
> IntelliJ
> Eclipse
> println

With a sample dataset

What we have learned?

RDD abstraction
> lineage info: partitions, dependencies, compute
> optimization info: partitioner, preferred locations

Execution process (from RDD to tasks)

Performance & debugging

Thank You!

Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Apache Spark Interview Questions Guide
100% (1)
Apache Spark Interview Questions Guide
7 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Apache Spark Interview Questions Book
100% (1)
Apache Spark Interview Questions Book
15 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
How To Work With Apache Airflow
No ratings yet
How To Work With Apache Airflow
111 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
9-10 Spark Architecture
No ratings yet
9-10 Spark Architecture
25 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
Airflow User Guide
No ratings yet
Airflow User Guide
444 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Spark Intreview FAQ
100% (2)
Spark Intreview FAQ
21 pages
Mastering Apache Spark
67% (3)
Mastering Apache Spark
1,831 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Cloudera Developer Training For Apache Spark: Hands-On Exercises
No ratings yet
Cloudera Developer Training For Apache Spark: Hands-On Exercises
61 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Aksha Interview Questions
100% (1)
Aksha Interview Questions
52 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Spark Optimization Techniques
No ratings yet
Spark Optimization Techniques
7 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
99 Apache Spark Interview Questions For Professionals
33% (12)
99 Apache Spark Interview Questions For Professionals
11 pages
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Spark and Scala Week 1
No ratings yet
Spark and Scala Week 1
16 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
T/F Please Answer Before The Number.: Name: - Mohammed Safian Syed
No ratings yet
T/F Please Answer Before The Number.: Name: - Mohammed Safian Syed
1 page
All Answers
67% (3)
All Answers
63 pages
Homework 5 Solution
No ratings yet
Homework 5 Solution
1 page
Requirement
No ratings yet
Requirement
1 page
Automata Theory
No ratings yet
Automata Theory
3 pages
FTS Proforma New For April 2011
No ratings yet
FTS Proforma New For April 2011
1 page
TCS Interview Questions
100% (1)
TCS Interview Questions
2 pages
Lab04 - Sysinst
No ratings yet
Lab04 - Sysinst
22 pages
Practical Assignment 1 (A)
No ratings yet
Practical Assignment 1 (A)
25 pages
Data Abstraction Problem Solving With C 6th Edition Frank M. Carrano Download
100% (1)
Data Abstraction Problem Solving With C 6th Edition Frank M. Carrano Download
61 pages
Solution Function Worksheet
No ratings yet
Solution Function Worksheet
7 pages
Xics QB
No ratings yet
Xics QB
28 pages
ERD Model
No ratings yet
ERD Model
19 pages
Csi API Etabs v1
No ratings yet
Csi API Etabs v1
3,901 pages
FANUC Parameter Manual Update
No ratings yet
FANUC Parameter Manual Update
3 pages
Django Setup for Beginners
No ratings yet
Django Setup for Beginners
14 pages
ML Material Unit1
No ratings yet
ML Material Unit1
32 pages
Uetr Parser
No ratings yet
Uetr Parser
111 pages
Salon Booking System Project
No ratings yet
Salon Booking System Project
8 pages
Unit 3 - OOPs Concepts
No ratings yet
Unit 3 - OOPs Concepts
40 pages
Vue开发规范-V1 0
No ratings yet
Vue开发规范-V1 0
20 pages
Debugger Logs
No ratings yet
Debugger Logs
6 pages
Coursera Answer
80% (5)
Coursera Answer
3 pages
Salesforce Flows Interview Questions
100% (1)
Salesforce Flows Interview Questions
4 pages
IBM WebSphere Transformation Extender Pack For SWIFTNet FIN DownloadServlet
No ratings yet
IBM WebSphere Transformation Extender Pack For SWIFTNet FIN DownloadServlet
4 pages
Osvvm Release Notes
No ratings yet
Osvvm Release Notes
7 pages
Azure Foundations Cheat Sheets
93% (14)
Azure Foundations Cheat Sheets
19 pages
01 03 b1 Tb1300 LM en Sol Di Connection
No ratings yet
01 03 b1 Tb1300 LM en Sol Di Connection
6 pages
Unit 5 Java
No ratings yet
Unit 5 Java
11 pages
Java Server Faces
No ratings yet
Java Server Faces
24 pages
Acronis #CyberFit Cloud Sales Fundamentals 2022
No ratings yet
Acronis #CyberFit Cloud Sales Fundamentals 2022
140 pages
Prince
No ratings yet
Prince
26 pages
Sourabh Singh Resume
No ratings yet
Sourabh Singh Resume
1 page
XLOOkUP Functions
No ratings yet
XLOOkUP Functions
17 pages
Google Tag Manager Guide
No ratings yet
Google Tag Manager Guide
11 pages
4.2 Lesson 1 - Loops (45 Mins)
No ratings yet
4.2 Lesson 1 - Loops (45 Mins)
14 pages

Advanced Spark Training

Uploaded by

Advanced Spark Training

Uploaded by

Advanced Spark

Reynold Xin, July 2, 2014 @ Spark Summit Training

val file = sc.textFile(hdfs://...)

Quiz: what is an RDD?

Scientific Answer: RDD is an Interface!

3. Function to compute a partition"

Example: Filtered RDD

RDD Graph (DAG of tasks)

Task1 Task2 ...

join with inputs

join with inputs not"

Life of a Spark Application

A single application often contains multiple actions

Job Scheduling Process

build operator DAG

split graph into

store and serve

= previously computed partition

Recap: Job Scheduling

build operator DAG

split graph into

store and serve

Find Slow Stage(s)

Stragglers due to slow nodes

Stragglers due to slow nodes

Demo Time: slow node

Stragglers due to data skew

Tasks are just slow

What if the task is still running?

jmap: heap analysis

Find out what objects are the trouble

Demo: GC log & jmap

Example: identify expensive comp.

Can often pinpoint problems just by jstack a few times

YourKit (free for open source dev)

What we have learned?

You might also like