0% found this document useful (0 votes)
137 views17 pages

Big Data Analysis With Scala and Spark: Heather Miller

The document discusses transformations and actions that can be performed on Resilient Distributed Datasets (RDDs) in Apache Spark. It explains that transformations return new RDDs as results lazily without computing the results immediately, while actions trigger computation by returning a value to the driver program. Some common transformations are map, filter, flatMap, and union, while common actions include count, collect, reduce, and saveAsTextFile. The document also provides examples of how laziness allows for optimizations in Spark.

Uploaded by

dd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views17 pages

Big Data Analysis With Scala and Spark: Heather Miller

The document discusses transformations and actions that can be performed on Resilient Distributed Datasets (RDDs) in Apache Spark. It explains that transformations return new RDDs as results lazily without computing the results immediately, while actions trigger computation by returning a value to the driver program. Some common transformations are map, filter, flatMap, and union, while common actions include count, collect, reduce, and saveAsTextFile. The document also provides examples of how laziness allows for optimizations in Spark.

Uploaded by

dd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Big Data Analysis with Scala and Spark

Heather Miller
Transformations and Actions

Recall transformers and accessors from Scala sequential and parallel


collections.
Transformations and Actions

Recall transformers and accessors from Scala sequential and parallel


collections.

Transformers. Return new collections as results. (Not single values.)


Examples: map, filter, flatMap, groupBy

map(f: A=> B): Traversable[BJ


Transformations and Actions

Recall transformers and accessors from Scala sequential and parallel


collections.

Transformers. Return new collections as results. (Not single values.)


Examples: map, filter, flatMap, groupBy

map(f: A=> B): Traversable[BJ

Accessors: Return single values as results. (Not collections.)


Examples: reduce, fold, aggre gate.

reduce(op: (A, A)=> A): A


,- A
Transformations and Actions

Sim ilarly, S park defines transformations and actions on RDDs.


They seem sim ilar to transformers and accessors, but there are some
im portant differences.

Transformations. Return new caW@ctio1,s RDDs as results.

Actions. Com pute a result based on an RDD, and either returned or


saved to an external storage sy stem (e. g., HDFS).
Transformations and Actions

Sim ilarly, S park defines transformations and actions on RDDs.


They seem sim ilar to transformers and accessors, but there are some
im portant differences.

Transformations. Return new collections RDDs as results.



They are laz , their result RDD is not immediately computed.

\\I• Actions. Com pute a result based on an RDD, and either returned or
•• saved to an external storage sy stem (e. g., HDFS).
They are eager, their result is immediately computed.
Transformations and Actions

Sim ilarly, S park defines transformations and actions on RDDs.


They seem sim ilar to transformers and accessors, but there are some
im portant differences.

Transformations. Return new collections RDDs as results.


They are lazy, their result RDD is not immediately computed.

Actions. Com pute a result based on an RDD, and either returned or


saved to an external storage sy stem (e. g., HDFS).
They are eager, their result is immediately computed.

Laziness/eagerness is how we can limit network •


• communication using the programming model.
Example

Consider the following sim ple exam ple:

val largelist: List[String] = ...


val wordsRdd = sc.Rarallelize(largelist) RDD [S--lrit'\j1
val lengthsRdd = wordsRdd.map(_.length) R.t)t> (lVl4:1
What has happened on the cluster at this point?
Example

Consider the following sim ple exam ple:

val lar gelist: List[String] = ...


val wordsRdd = sc.parallelize(lar gelist)
val len gthsRdd = wordsRdd.map(_.len gth)

What has happened on the cluster at this point?

Nothing. Execution of map (a transform ation) is deferred.


To kick off the com putation and wait for its resu It...
Example

Consider the following sim ple exam ple:

val largelist: List[String] = ...


val wordsRdd = sc.parallelize(largelist)
val lengthsRdd = wordsRdd.map(_.length)
val totalChars = lengthsRdd.reduce(_ + _)

... we can add an action


Common Transformations in the Wild
lPri:='11.. l.

map map[BJ(f: A=> B): RDD[BJ L C

Apply function to each element in the ROD and


retrun an ROD of the result.

flatMap flatMap[BJ(f: A=> TraversableOnce[BJ): RDD[BJ �


Apply a function to each element in the ROD and return
an ROD of the contents of the iterators returned.
-
filter filter(pred: A=> Boolean): RDD[AJ�
Apply predicate function to each element in the ROD and
return an ROD of elements that have passed the predicate
condition, pred.

distinct distinct(): RDD[BJ<


Return ROD with duplicates removed.
Common Actions in the Wild
tA&�Gt- 1
.., ii.?=-

collect collect(): Array[T] t.


Return all elements from RDD.

count count(): Long t


Return the number of elements in the RDD.

take take(num: Int): Array[T] E: -


Return the first num elements of the RDD.

reduce reduce(op: (A, A) => A): A""


Combine the elements in the RDD together using op
function and return result.

foreach foreach(f: T => Unit): Unit<


Apply function to each element in the RDD.
Another Exampie

Let's assume that we have an RDD[String] which contains gigaby tes of


logs collected over the previous year. Each element of this ROD re presents
one line of logging.
Assuming that dates come in the form, YYYY-MM-DD:HH:MM:SS, and errors
are logged with a prefix that includes the word error" ...
11

How would you determine the number of errors that were logged in
December 2016?

val lastYearslogs: RDD[String] = ...


Another Exampie

Let's assume that we have an RDD[String] which contains gigaby tes of


logs collected over the previous year. Each element of this ROD re presents
one line of logging.
Assuming that dates come in the form, YYYY-MM-DD:HH:MM:SS, and errors
are logged with a prefix that includes the word error" ...
11

How would you determine the number of errors that were logged in
December 2016?

val lastYearslogs: RDD[String] = ...


val numDecErrorlogs
= lastYearslogs.filter(l g => l g.contains("2016-12") && l g.contains("error"))
.count()
Benefits of Laziness for Large-Scale Data

S park com putes RDDs the first time they are used in an action.
This helps when processing large amounts of data.
Example:

val lastYearslogs: RDD[String] = ...


val firstlogsWithErrors = lastYearslogs.filter(_.contains("ERROR") ) .take(10)

The execution of filter is deferred until the take action is applied.


Spark leverages this by analyzing and optimizing the chain of operations before
executing it.
Spark will not compute intermediate RDDs. Instead, as soon as 10 elements of the
filtered RDD have been computed, firstLogsWithErrors is done. At this point Spark
stops working, saving time and space computing elements of the unused result of filter.
Transformations on Two RDDs rtAdJ r tld. 1..
LA�i \_\_ \{ ,A_ y d J. 3 ::_ rJ.JJ . \A.t'\ i Oil (l-fAJ_ 2.)
RDDs also su pport set-like o perations, like union and intersection.
Two-RDD transformations com bine two RDDs are com bined into one.

union union(other: RDD[T]): RDD[T] '=--


Return an RDD containing elements from both RDDs.

intersection intersection(other: RDD[T]): RDD[T]'=


Return an RDD containing elements only found in
both RDDs.

subtract subtract(other: RDD[T]): RDD[T]< -


Return an RDD with the contents of the other RDD
removed.

cartesian cartesian[U](other: RDD[U]): RDD[(T, U)] < -


Cartesian product with the other RDD.
Other Useful ROD Actions
����I V
RDDs also contain other im portant actions unrelated to regular Scala
collections, but which are useful when dealing with distributed data.
takeSample takeSample(withRepl: Boolean, num: Int): Array[T] (::
r---­

Return an array with a random sample of num elements of


the dataset, with or without replacement.

takeOrdered takeOrdered(num: Int)(implicit


ord: Ordering[T]): Array[T] ��-
Return the first n elements of the ROD using either
their natural order or a custom comparator.

saveAsTextFile saveAsTextFile(path: String): Unit:4:


Write the elements of the dataset as a text file in
the local filesystem or HDFS.

saveAsSequenceFile saveAsSequenceFile(path: String): Unit� -


Write the elements of the dataset as a Hadoop Se­
quenceFile in the local filesystem or HDFS.

You might also like