0% found this document useful (0 votes)

223 views40 pages

Spark & Scala for Developers

This document provides an overview of Spark, a fast and general engine for large-scale data processing. Spark introduces the concept of resilient distributed datasets (RDDs) that allow data to be distributed across a cluster and operated on in parallel. RDDs track their lineage to enable fault tolerance by recomputing lost data. Spark supports transformations like map, filter, and actions like count through its Scala and Java APIs.

Uploaded by

Amit Dubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

223 views40 pages

Spark & Scala for Developers

Uploaded by

Amit Dubey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Parallel

Programming
With Spark
Matei Zaharia

UC Berkeley

www.spark-project.org
UC BERKELEY

What is Spark?
Fast and expressive cluster computing system
compatible with Apache Hadoop
Works with any Hadoop-supported storage system
and data format (HDFS, S3, SequenceFile, )

Improves eciency through:

In-memory computing primitives

General computation graphs

As much as
30 faster

Improves usability through rich Scala and Java

APIs and interactive shell

Often 2-10 less code

How to Run It
Local multicore: just a library in your program
EC2: scripts for launching a Spark cluster
Private cluster: Mesos, YARN*, standalone*

*Coming soon in Spark 0.6

Scala vs Java APIs

Spark originally written in Scala, which allows
concise function syntax and interactive use
Recently added Java API for standalone apps
(dev branch on GitHub)
Interactive shell still in Scala
This course: mostly Scala, with translations to Java

Outline
Introduction to Scala & functional programming
Spark concepts
Tour of Spark operations
Job execution

About Scala
High-level language for the Java VM

Object-oriented + functional programming

Statically typed

Comparable in speed to Java

But often no need to write types due to type inference

Interoperates with Java

Can use any Java class, inherit from it, etc; can also
call Scala code from Java

Best Way to Learn Scala

Interactive shell: just type scala
Supports importing libraries, tab completion,
and all constructs in the language

Quick Tour
Declaring variables:

Java equivalent:

var x: Int = 7
var x = 7 // type inferred

int x = 7;

val y = hi

final String y = hi;

// read-only

Functions:

Java equivalent:

def square(x: Int): Int = x*x

int square(int x) {
return x*x;
}

def square(x: Int): Int = {

x*x
Last expression in block returned
}
def announce(text: String) {
println(text)
}

void announce(String text) {

System.out.println(text);
}

Quick Tour
Generic types:

Java equivalent:

var arr = new Array[Int](8)

int[] arr = new int[8];

var lst = List(1, 2, 3)

// type of lst is List[Int]

List<Integer> lst =
new ArrayList<Integer>();
lst.add(...)

Factory method

Cant hold primitive types

Indexing:

Java equivalent:

arr(5) = 7

arr[5] = 7;

println(lst(5))

System.out.println(lst.get(5));

Quick Tour
Processing collections with functional programming:
val list = List(1, 2, 3)

Function expression (closure)

list.foreach(x => println(x))

list.foreach(println)
list.map(x => x + 2)
list.map(_ + 2)

// prints 1, 2, 3
// same

// => List(3, 4, 5)
// same, with placeholder notation

list.filter(x => x % 2 == 1)
list.filter(_ % 2 == 1)

// => List(1, 3)
// => List(1, 3)

list.reduce((x, y) => x + y)
list.reduce(_ + _)

// => 6
// => 6

All of these leave the list unchanged (List is immutable)

Scala Closure Syntax

(x: Int) => x + 2

// full version

x => x + 2

// type inferred

_ + 2

// when each argument is used exactly once

x => {
// when body is a block of code
val numberToAdd = 2
x + numberToAdd
}
// If closure is too long, can always pass a function
def addTwo(x: Int): Int = x + 2
list.map(addTwo)

Scala allows dening a local

function inside another function

Other Collection Methods

Scala collections provide many other functional
methods; for example, Google for Scala Seq
Method on Seq[T]

Explanation

map(f: T => U): Seq[U]

Pass each element through f

flatMap(f: T => Seq[U]): Seq[U]

One-to-many map

filter(f: T => Boolean): Seq[T]

Keep elements passing f

exists(f: T => Boolean): Boolean

True if one element passes

forall(f: T => Boolean): Boolean

True if all elements pass

reduce(f: (T, T) => T): T

Merge elements using f

groupBy(f: T => K): Map[K,List[T]]

Group elements by f(element)

sortBy(f: T => K): Seq[T]

Sort elements by f(element)

. . .

Outline
Introduction to Scala & functional programming
Spark concepts
Tour of Spark operations
Job execution

Spark Overview
Goal: work with distributed collections as you
would with local ones
Concept: resilient distributed datasets (RDDs)

Immutable collections of objects spread across a cluster

Built through parallel transformations (map, lter, etc)
Automatically rebuilt on failure
Controllable persistence (e.g. caching in RAM) for reuse

Main Primitives
Resilient distributed datasets (RDDs)

Immutable, partitioned collections of objects

Transformations (e.g. map, lter, groupBy, join)

Lazy operations to build RDDs from other RDDs

Actions (e.g. count, collect, save)

Return a result or write it to storage

Example: Log Mining

Load error messages from a log into memory, then
interactively search for various patterns
val lines = spark.textFile(hdfs://...)

Base
RDD
Transformed
RDD
results

val errors = lines.filter(_.startsWith(ERROR))

val messages = errors.map(_.split(\t)(2))
messages.cache()
messages.filter(_.contains(foo)).count

Driver

Cac he 1

Worker

tasks Block 1

Action
Cache 2

messages.filter(_.contains(bar)).count

Worker

. . .
Cache 3

of Win
ikipedia
Result: sfull-text
caled to s1earch
TB data
5-7 sec
in <1 (vs
sec
(vs sec
20 fsor
ec ofn-disk
or on-disk
data)
170
data)

Worker
Block 3

Block 2

RDD Fault Tolerance

RDDs track the series of transformations used to
build them (their lineage) to recompute lost data
E.g: messages

= textFile(...).filter(_.contains(error))
.map(_.split(\t)(2))

HadoopRDD
path = hdfs://

FilteredRDD

func = _.contains(...)

MappedRDD
func = _.split()

Iteratrion time (s)

Fault Recovery Test

140
120
100
80
60
40
20
0

119

Failure happens
81

5
6
Iteration

Behavior with Less RAM

Iteration time (s)

100

20
0
Cache
disabled

25%

50%

75%

% of working set in cache

Fully
cached

How it Looks in Java

lines.filter(_.contains(error)).count()

JavaRDD<String> lines = ...;

lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(error);
}
}).count();

More examples in the next talk

Outline
Introduction to Scala & functional programming
Spark concepts
Tour of Spark operations
Job execution

Learning Spark
Easiest way: Spark interpreter (spark-shell)

Modied version of Scala interpreter for cluster use

Runs in local mode on 1 thread by default, but

can control through MASTER environment var:
MASTER=local
./spark-shell
MASTER=local[2] ./spark-shell
MASTER=host:port ./spark-shell

# local, 1 thread
# local, 2 threads
# run on Mesos

First Stop: SparkContext

Main entry point to Spark functionality
Created for you in spark-shell as variable sc
In standalone programs, youd make your own
(see later for details)

Creating RDDs
// Turn a Scala collection into an RDD
sc.parallelize(List(1, 2, 3))
// Load text file from local FS, HDFS, or S3
sc.textFile(file.txt)
sc.textFile(directory/*.txt)
sc.textFile(hdfs://namenode:9000/path/file)
// Use any existing Hadoop InputFormat
sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Basic Transformations
val nums = sc.parallelize(List(1, 2, 3))
// Pass each element through a function
val squares = nums.map(x => x*x)
// {1, 4, 9}
// Keep elements passing a predicate
val even = squares.filter(_ % 2 == 0)

// {4}

// Map each element to zero or more others

nums.flatMap(x => 1 to x) // => {1, 1, 2, 1, 2, 3}
Range object (sequence
of numbers 1, 2, , x)

Basic Actions
val nums = sc.parallelize(List(1, 2, 3))
// Retrieve RDD contents as a local collection
nums.collect() // => Array(1, 2, 3)
// Return first K elements
nums.take(2)
// => Array(1, 2)
// Count number of elements
nums.count()
// => 3
// Merge elements with an associative function
nums.reduce(_ + _) // => 6
// Write elements to a text file
nums.saveAsTextFile(hdfs://file.txt)

Working with Key-Value Pairs

Sparks distributed reduce transformations
operate on RDDs of key-value pairs
Scala pair syntax:
val pair = (a, b)

// sugar for new Tuple2(a, b)

Accessing pair elements:

pair._1
pair._2

// => a
// => b

Some Key-Value Operations

val pets = sc.parallelize(
List((cat, 1), (dog, 1), (cat, 2)))
pets.reduceByKey(_ + _) // => {(cat, 3), (dog, 1)}
pets.groupByKey() // => {(cat, Seq(1, 2)), (dog, Seq(1)}
pets.sortByKey()

// => {(cat, 1), (cat, 2), (dog, 1)}

reduceByKey also automatically implements

combiners on the map side

Example: Word Count

val lines = sc.textFile(hamlet.txt)
val counts = lines.flatMap(line => line.split( ))
.map(word => (word, 1))
.reduceByKey(_ + _)

to be or

to
be
or

(to, 1)
(be, 1)
(or, 1)

(be, 2)
(not, 1)

not to be

not
to
be

(not, 1)
(to, 1)
(be, 1)

(or, 1)
(to, 2)

Other Key-Value Operations

val visits = sc.parallelize(List(
(index.html, 1.2.3.4),
(about.html, 3.4.5.6),
(index.html, 1.3.3.1)))
val pageNames = sc.parallelize(List(
(index.html, Home), (about.html, About)))
visits.join(pageNames)
// (index.html, (1.2.3.4, Home))
// (index.html, (1.3.3.1, Home))
// (about.html, (3.4.5.6, About))
visits.cogroup(pageNames)
// (index.html, (Seq(1.2.3.4, 1.3.3.1), Seq(Home)))
// (about.html, (Seq(3.4.5.6), Seq(About)))

Controlling The Number of

Reduce Tasks
All the pair RDD operations take an optional
second parameter for number of tasks
words.reduceByKey(_ + _, 5)
words.groupByKey(5)
visits.join(pageViews, 5)

Can also set spark.default.parallelism property

Using Local Variables

Any external variables you use in a closure will
automatically be shipped to the cluster:
val query = Console.readLine()
pages.filter(_.contains(query)).count()

Some caveats:

Each task gets a new copy (updates arent sent back)

Variable must be Serializable
Dont use elds of an outer object (ships all of it!)

Closure Mishap Example

class MyCoolRddApp {
val param = 3.14
val log = new Log(...)
...

How to get around it:

class MyCoolRddApp {
...
def work(rdd: RDD[Int]) {
val param_ = param
rdd.map(x => x + param_)
.reduce(...)
}

def work(rdd: RDD[Int]) {

rdd.map(x => x + param)
.reduce(...)
}
}

NotSerializableException:
MyCoolRddApp (or Log)

References only local variable

instead of this.param

Other RDD Operations

sample(): deterministically sample a subset
union(): merge two RDDs
cartesian(): cross product
pipe(): pass through external program

See Programming Guide for more:

www.spark-project.org/documentation.html

Outline
Introduction to Scala & functional programming
Spark concepts
Tour of Spark operations
Job execution

Software Components
Spark runs as a library in your
program (1 instance per app)
Runs tasks locally or on Mesos
dev branch also supports YARN,
standalone deployment

Accesses storage systems via

Hadoop InputFormat API
Can use HBase, HDFS, S3,

Your application
SparkContext
Mesos
master
Slave

Slave

Spark
worker

Local
threads

HDFS or other storage

Task Scheduler
Runs general task
graphs
Pipelines functions
where possible

Stage 1
C:

groupBy
D:

Cache-aware data
reuse & locality
Partitioning-aware
to avoid shues

join
Stage 2 map

= RDD

lter

Stage 3

= cached partition

Data Storage
Cached RDDs normally stored as Java objects
Fastest access on JVM, but can be larger than ideal

Can also store in serialized format

Spark 0.5: spark.cache.class=spark.SerializingCache

Default serialization library is Java serialization

Very slow for large data!
Can customize through spark.serializer (see later)

How to Get Started

git clone git://github.com/mesos/spark
cd spark
sbt/sbt compile

./spark-shell

More Information
Scala resources:

www.artima.com/scalazine/articles/steps.html
(First Steps to Scala)
www.artima.com/pins1ed (free book)

Spark documentation:

www.spark-project.org/documentation.html

Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
Hive Tutorial PDF
0% (1)
Hive Tutorial PDF
14 pages
Oozie Workflow Guide
No ratings yet
Oozie Workflow Guide
84 pages
Apache Spark Internals Guide
No ratings yet
Apache Spark Internals Guide
90 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Apache Hive: Prashant Gupta
100% (1)
Apache Hive: Prashant Gupta
61 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
Mysql Interview Questions PDF
No ratings yet
Mysql Interview Questions PDF
5 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
AWS Data Lake Lab: Athena & QuickSight
No ratings yet
AWS Data Lake Lab: Athena & QuickSight
22 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
5 Kafka Producer Advanced
No ratings yet
5 Kafka Producer Advanced
152 pages
Testing in Python - Unit Test & Script
No ratings yet
Testing in Python - Unit Test & Script
5 pages
Bigdata Engineer Complete Syllabus: Presented by
No ratings yet
Bigdata Engineer Complete Syllabus: Presented by
21 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
No ratings yet
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
100 pages
Akka PDF
No ratings yet
Akka PDF
454 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
180 pages
Administrator Exercise Instructions 201306
No ratings yet
Administrator Exercise Instructions 201306
117 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
Hadoop Realtime Issues
100% (1)
Hadoop Realtime Issues
3 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Visual Guide to Spark API Transformations
No ratings yet
Visual Guide to Spark API Transformations
122 pages
Scala Interview Prep Guide
No ratings yet
Scala Interview Prep Guide
21 pages
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
Hive and HBase for Data Engineers
No ratings yet
Hive and HBase for Data Engineers
25 pages
BK Hdfs Administration
No ratings yet
BK Hdfs Administration
73 pages
Exploring Reactive Integrations With: Akka Streams
No ratings yet
Exploring Reactive Integrations With: Akka Streams
66 pages
Apache Kafka Confluent Enterprise Ref Architecture
No ratings yet
Apache Kafka Confluent Enterprise Ref Architecture
17 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
Deployment With Docker and Rancher and Continuous Integration and
No ratings yet
Deployment With Docker and Rancher and Continuous Integration and
47 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
Cloudurable Kafka Tutorial v1 PDF
No ratings yet
Cloudurable Kafka Tutorial v1 PDF
79 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
44 pages
Python Data Pipeline Guide
No ratings yet
Python Data Pipeline Guide
38 pages
JDBC
No ratings yet
JDBC
190 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Java Threads & Basics Explained
No ratings yet
Java Threads & Basics Explained
250 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Lec 9
No ratings yet
Lec 9
38 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Lec 9
No ratings yet
Lec 9
33 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Spark
No ratings yet
Spark
160 pages
Queries
100% (1)
Queries
10 pages
ICML20 GRL Workshop
No ratings yet
ICML20 GRL Workshop
5 pages
JDeveloper QA Framework Guide
No ratings yet
JDeveloper QA Framework Guide
78 pages
LaTeX Image Insertion Guide
No ratings yet
LaTeX Image Insertion Guide
4 pages
Procedural (C) Vs OOP (C++/Java/Python) : (Sys Version)
No ratings yet
Procedural (C) Vs OOP (C++/Java/Python) : (Sys Version)
15 pages
SAP ABAP Sapscripts
No ratings yet
SAP ABAP Sapscripts
11 pages
Java R20
No ratings yet
Java R20
1 page
Custom 501 Pre So
No ratings yet
Custom 501 Pre So
88 pages
HPC Assignments
No ratings yet
HPC Assignments
3 pages
Python ML Libraries Overview
No ratings yet
Python ML Libraries Overview
7 pages
Elective 1-2D 3D Game Development Syallabus
No ratings yet
Elective 1-2D 3D Game Development Syallabus
4 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
30 Basic and Core MCQs On OOP
No ratings yet
30 Basic and Core MCQs On OOP
9 pages
Monitor h8
No ratings yet
Monitor h8
28 pages
TBC 304 Computer Organization and Architecture
No ratings yet
TBC 304 Computer Organization and Architecture
2 pages
Object Oriented Analysis and Design: Using Unified Modeling Language (UML)
No ratings yet
Object Oriented Analysis and Design: Using Unified Modeling Language (UML)
24 pages
Anurag Parmar ML
No ratings yet
Anurag Parmar ML
1 page
CertiK Audit Report.2101b8f7
No ratings yet
CertiK Audit Report.2101b8f7
11 pages
Parallelism and Concurrency Guide
No ratings yet
Parallelism and Concurrency Guide
18 pages
Memory Management in Operating System
No ratings yet
Memory Management in Operating System
2 pages
C++ Pointers and Structures Guide
No ratings yet
C++ Pointers and Structures Guide
11 pages
Iar 300 MCQ
No ratings yet
Iar 300 MCQ
51 pages
Aspiring Software Developer Resume
No ratings yet
Aspiring Software Developer Resume
2 pages
Code Trap Programming Contest Fall 2024 Editorial
No ratings yet
Code Trap Programming Contest Fall 2024 Editorial
23 pages
Report 126
No ratings yet
Report 126
40 pages
Practical Node - Js (Building Real-World Scalable Web Apps) Real-Time Apps With WebSocket, Socket - IO, and DerbyJS (Mardan, Azat) (Z-Library)
No ratings yet
Practical Node - Js (Building Real-World Scalable Web Apps) Real-Time Apps With WebSocket, Socket - IO, and DerbyJS (Mardan, Azat) (Z-Library)
24 pages
Computer Programming 2 Midterm Lab Exam
No ratings yet
Computer Programming 2 Midterm Lab Exam
25 pages
Oracle PO Status Update Script
No ratings yet
Oracle PO Status Update Script
2 pages
C Sharp Presentation Slides
No ratings yet
C Sharp Presentation Slides
10 pages
06.ce.27 Human Resource Management System
No ratings yet
06.ce.27 Human Resource Management System
41 pages