Big Data Analysis with Scala and Spark
Heather Miller
Transformations and Actions
Recall transformers and accessors from Scala sequential and parallel
collections.
Transformations and Actions
Recall transformers and accessors from Scala sequential and parallel
collections.
Transformers. Return new collections as results. (Not single values.)
Examples: map, filter, flatMap, groupBy
map(f: A=> B): Traversable[BJ
Transformations and Actions
Recall transformers and accessors from Scala sequential and parallel
collections.
Transformers. Return new collections as results. (Not single values.)
Examples: map, filter, flatMap, groupBy
map(f: A=> B): Traversable[BJ
Accessors: Return single values as results. (Not collections.)
Examples: reduce, fold, aggre gate.
reduce(op: (A, A)=> A): A
,- A
Transformations and Actions
Sim ilarly, S park defines transformations and actions on RDDs.
They seem sim ilar to transformers and accessors, but there are some
im portant differences.
Transformations. Return new caW@ctio1,s RDDs as results.
Actions. Com pute a result based on an RDD, and either returned or
saved to an external storage sy stem (e. g., HDFS).
Transformations and Actions
Sim ilarly, S park defines transformations and actions on RDDs.
They seem sim ilar to transformers and accessors, but there are some
im portant differences.
Transformations. Return new collections RDDs as results.
�
They are laz , their result RDD is not immediately computed.
\\I• Actions. Com pute a result based on an RDD, and either returned or
•• saved to an external storage sy stem (e. g., HDFS).
They are eager, their result is immediately computed.
Transformations and Actions
Sim ilarly, S park defines transformations and actions on RDDs.
They seem sim ilar to transformers and accessors, but there are some
im portant differences.
Transformations. Return new collections RDDs as results.
They are lazy, their result RDD is not immediately computed.
Actions. Com pute a result based on an RDD, and either returned or
saved to an external storage sy stem (e. g., HDFS).
They are eager, their result is immediately computed.
Laziness/eagerness is how we can limit network •
• communication using the programming model.
Example
Consider the following sim ple exam ple:
val largelist: List[String] = ...
val wordsRdd = sc.Rarallelize(largelist) RDD [S--lrit'\j1
val lengthsRdd = wordsRdd.map(_.length) R.t)t> (lVl4:1
What has happened on the cluster at this point?
Example
Consider the following sim ple exam ple:
val lar gelist: List[String] = ...
val wordsRdd = sc.parallelize(lar gelist)
val len gthsRdd = wordsRdd.map(_.len gth)
What has happened on the cluster at this point?
Nothing. Execution of map (a transform ation) is deferred.
To kick off the com putation and wait for its resu It...
Example
Consider the following sim ple exam ple:
val largelist: List[String] = ...
val wordsRdd = sc.parallelize(largelist)
val lengthsRdd = wordsRdd.map(_.length)
val totalChars = lengthsRdd.reduce(_ + _)
... we can add an action
Common Transformations in the Wild
lPri:='11.. l.
map map[BJ(f: A=> B): RDD[BJ L C
Apply function to each element in the ROD and
retrun an ROD of the result.
flatMap flatMap[BJ(f: A=> TraversableOnce[BJ): RDD[BJ �
Apply a function to each element in the ROD and return
an ROD of the contents of the iterators returned.
-
filter filter(pred: A=> Boolean): RDD[AJ�
Apply predicate function to each element in the ROD and
return an ROD of elements that have passed the predicate
condition, pred.
distinct distinct(): RDD[BJ<
Return ROD with duplicates removed.
Common Actions in the Wild
tA&�Gt- 1
.., ii.?=-
collect collect(): Array[T] t.
Return all elements from RDD.
count count(): Long t
Return the number of elements in the RDD.
take take(num: Int): Array[T] E: -
Return the first num elements of the RDD.
reduce reduce(op: (A, A) => A): A""
Combine the elements in the RDD together using op
function and return result.
foreach foreach(f: T => Unit): Unit<
Apply function to each element in the RDD.
Another Exampie
Let's assume that we have an RDD[String] which contains gigaby tes of
logs collected over the previous year. Each element of this ROD re presents
one line of logging.
Assuming that dates come in the form, YYYY-MM-DD:HH:MM:SS, and errors
are logged with a prefix that includes the word error" ...
11
How would you determine the number of errors that were logged in
December 2016?
val lastYearslogs: RDD[String] = ...
Another Exampie
Let's assume that we have an RDD[String] which contains gigaby tes of
logs collected over the previous year. Each element of this ROD re presents
one line of logging.
Assuming that dates come in the form, YYYY-MM-DD:HH:MM:SS, and errors
are logged with a prefix that includes the word error" ...
11
How would you determine the number of errors that were logged in
December 2016?
val lastYearslogs: RDD[String] = ...
val numDecErrorlogs
= lastYearslogs.filter(l g => l g.contains("2016-12") && l g.contains("error"))
.count()
Benefits of Laziness for Large-Scale Data
S park com putes RDDs the first time they are used in an action.
This helps when processing large amounts of data.
Example:
val lastYearslogs: RDD[String] = ...
val firstlogsWithErrors = lastYearslogs.filter(_.contains("ERROR") ) .take(10)
The execution of filter is deferred until the take action is applied.
Spark leverages this by analyzing and optimizing the chain of operations before
executing it.
Spark will not compute intermediate RDDs. Instead, as soon as 10 elements of the
filtered RDD have been computed, firstLogsWithErrors is done. At this point Spark
stops working, saving time and space computing elements of the unused result of filter.
Transformations on Two RDDs rtAdJ r tld. 1..
LA�i \_\_ \{ ,A_ y d J. 3 ::_ rJ.JJ . \A.t'\ i Oil (l-fAJ_ 2.)
RDDs also su pport set-like o perations, like union and intersection.
Two-RDD transformations com bine two RDDs are com bined into one.
union union(other: RDD[T]): RDD[T] '=--
Return an RDD containing elements from both RDDs.
intersection intersection(other: RDD[T]): RDD[T]'=
Return an RDD containing elements only found in
both RDDs.
subtract subtract(other: RDD[T]): RDD[T]< -
Return an RDD with the contents of the other RDD
removed.
cartesian cartesian[U](other: RDD[U]): RDD[(T, U)] < -
Cartesian product with the other RDD.
Other Useful ROD Actions
����I V
RDDs also contain other im portant actions unrelated to regular Scala
collections, but which are useful when dealing with distributed data.
takeSample takeSample(withRepl: Boolean, num: Int): Array[T] (::
r---
Return an array with a random sample of num elements of
the dataset, with or without replacement.
takeOrdered takeOrdered(num: Int)(implicit
ord: Ordering[T]): Array[T] ��-
Return the first n elements of the ROD using either
their natural order or a custom comparator.
saveAsTextFile saveAsTextFile(path: String): Unit:4:
Write the elements of the dataset as a text file in
the local filesystem or HDFS.
saveAsSequenceFile saveAsSequenceFile(path: String): Unit� -
Write the elements of the dataset as a Hadoop Se
quenceFile in the local filesystem or HDFS.