0% found this document useful (0 votes)

137 views17 pages

Big Data Analysis With Scala and Spark: Heather Miller

The document discusses transformations and actions that can be performed on Resilient Distributed Datasets (RDDs) in Apache Spark. It explains that transformations return new RDDs as results lazily without computing the results immediately, while actions trigger computation by returning a value to the driver program. Some common transformations are map, filter, flatMap, and union, while common actions include count, collect, reduce, and saveAsTextFile. The document also provides examples of how laziness allows for optimizations in Spark.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

137 views17 pages

Big Data Analysis With Scala and Spark: Heather Miller

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Big Data Analysis with Scala and Spark

Heather Miller
Transformations and Actions

Recall transformers and accessors from Scala sequential and parallel

collections.
Transformations and Actions

Recall transformers and accessors from Scala sequential and parallel

collections.

Transformers. Return new collections as results. (Not single values.)

Examples: map, filter, flatMap, groupBy

map(f: A=> B): Traversable[BJ

Transformations and Actions

Recall transformers and accessors from Scala sequential and parallel

collections.

Transformers. Return new collections as results. (Not single values.)

Examples: map, filter, flatMap, groupBy

map(f: A=> B): Traversable[BJ

Accessors: Return single values as results. (Not collections.)

Examples: reduce, fold, aggre gate.

reduce(op: (A, A)=> A): A

,- A
Transformations and Actions

Sim ilarly, S park defines transformations and actions on RDDs.

They seem sim ilar to transformers and accessors, but there are some
im portant differences.

Transformations. Return new caW@ctio1,s RDDs as results.

Actions. Com pute a result based on an RDD, and either returned or

saved to an external storage sy stem (e. g., HDFS).
Transformations and Actions

Sim ilarly, S park defines transformations and actions on RDDs.

They seem sim ilar to transformers and accessors, but there are some
im portant differences.

Transformations. Return new collections RDDs as results.

�
They are laz , their result RDD is not immediately computed.

\\I• Actions. Com pute a result based on an RDD, and either returned or
•• saved to an external storage sy stem (e. g., HDFS).
They are eager, their result is immediately computed.
Transformations and Actions

Sim ilarly, S park defines transformations and actions on RDDs.

They seem sim ilar to transformers and accessors, but there are some
im portant differences.

Transformations. Return new collections RDDs as results.

They are lazy, their result RDD is not immediately computed.

Actions. Com pute a result based on an RDD, and either returned or

saved to an external storage sy stem (e. g., HDFS).
They are eager, their result is immediately computed.

Laziness/eagerness is how we can limit network •

• communication using the programming model.
Example

Consider the following sim ple exam ple:

val largelist: List[String] = ...

val wordsRdd = sc.Rarallelize(largelist) RDD [S--lrit'\j1
val lengthsRdd = wordsRdd.map(_.length) R.t)t> (lVl4:1
What has happened on the cluster at this point?
Example

Consider the following sim ple exam ple:

val lar gelist: List[String] = ...

val wordsRdd = sc.parallelize(lar gelist)
val len gthsRdd = wordsRdd.map(_.len gth)

What has happened on the cluster at this point?

Nothing. Execution of map (a transform ation) is deferred.

To kick off the com putation and wait for its resu It...
Example

Consider the following sim ple exam ple:

val largelist: List[String] = ...

val wordsRdd = sc.parallelize(largelist)
val lengthsRdd = wordsRdd.map(_.length)
val totalChars = lengthsRdd.reduce(_ + _)

... we can add an action

Common Transformations in the Wild
lPri:='11.. l.

map map[BJ(f: A=> B): RDD[BJ L C

Apply function to each element in the ROD and

retrun an ROD of the result.

flatMap flatMap[BJ(f: A=> TraversableOnce[BJ): RDD[BJ �

Apply a function to each element in the ROD and return
an ROD of the contents of the iterators returned.
-
filter filter(pred: A=> Boolean): RDD[AJ�
Apply predicate function to each element in the ROD and
return an ROD of elements that have passed the predicate
condition, pred.

distinct distinct(): RDD[BJ<

Return ROD with duplicates removed.
Common Actions in the Wild
tA&�Gt- 1
.., ii.?=-

collect collect(): Array[T] t.

Return all elements from RDD.

count count(): Long t

Return the number of elements in the RDD.

take take(num: Int): Array[T] E: -

Return the first num elements of the RDD.

reduce reduce(op: (A, A) => A): A""

Combine the elements in the RDD together using op
function and return result.

foreach foreach(f: T => Unit): Unit<

Apply function to each element in the RDD.
Another Exampie

Let's assume that we have an RDD[String] which contains gigaby tes of

logs collected over the previous year. Each element of this ROD re presents
one line of logging.
Assuming that dates come in the form, YYYY-MM-DD:HH:MM:SS, and errors
are logged with a prefix that includes the word error" ...
11

How would you determine the number of errors that were logged in
December 2016?

val lastYearslogs: RDD[String] = ...

Another Exampie

Let's assume that we have an RDD[String] which contains gigaby tes of

How would you determine the number of errors that were logged in
December 2016?

val lastYearslogs: RDD[String] = ...

val numDecErrorlogs
= lastYearslogs.filter(l g => l g.contains("2016-12") && l g.contains("error"))
.count()
Benefits of Laziness for Large-Scale Data

S park com putes RDDs the first time they are used in an action.
This helps when processing large amounts of data.
Example:

val lastYearslogs: RDD[String] = ...

val firstlogsWithErrors = lastYearslogs.filter(_.contains("ERROR") ) .take(10)

The execution of filter is deferred until the take action is applied.

Spark leverages this by analyzing and optimizing the chain of operations before
executing it.
Spark will not compute intermediate RDDs. Instead, as soon as 10 elements of the
filtered RDD have been computed, firstLogsWithErrors is done. At this point Spark
stops working, saving time and space computing elements of the unused result of filter.
Transformations on Two RDDs rtAdJ r tld. 1..
LA�i \_\_ \{ ,A_ y d J. 3 ::_ rJ.JJ . \A.t'\ i Oil (l-fAJ_ 2.)
RDDs also su pport set-like o perations, like union and intersection.
Two-RDD transformations com bine two RDDs are com bined into one.

union union(other: RDD[T]): RDD[T] '=--

Return an RDD containing elements from both RDDs.

intersection intersection(other: RDD[T]): RDD[T]'=

Return an RDD containing elements only found in
both RDDs.

subtract subtract(other: RDD[T]): RDD[T]< -

Return an RDD with the contents of the other RDD
removed.

cartesian cartesian[U](other: RDD[U]): RDD[(T, U)] < -

Cartesian product with the other RDD.
Other Useful ROD Actions
��I V
RDDs also contain other im portant actions unrelated to regular Scala
collections, but which are useful when dealing with distributed data.
takeSample takeSample(withRepl: Boolean, num: Int): Array[T] (::
r---

Return an array with a random sample of num elements of

the dataset, with or without replacement.

takeOrdered takeOrdered(num: Int)(implicit

ord: Ordering[T]): Array[T] ��-
Return the first n elements of the ROD using either
their natural order or a custom comparator.

saveAsTextFile saveAsTextFile(path: String): Unit:4:

Write the elements of the dataset as a text file in
the local filesystem or HDFS.

saveAsSequenceFile saveAsSequenceFile(path: String): Unit� -

Write the elements of the dataset as a Hadoop Se
quenceFile in the local filesystem or HDFS.

Spark RDD Transformations Guide
No ratings yet
Spark RDD Transformations Guide
9 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Spark
No ratings yet
Spark
13 pages
PySpark Cheat Sheet
No ratings yet
PySpark Cheat Sheet
6 pages
Jnu Dbms Lab File
No ratings yet
Jnu Dbms Lab File
55 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
A - Learning - Oreilly.com-Preface Data Engineering With AWS
No ratings yet
A - Learning - Oreilly.com-Preface Data Engineering With AWS
6 pages
Handwritten Notes: Prepared by
No ratings yet
Handwritten Notes: Prepared by
109 pages
100 SQL Formulas Each Student Should Know
No ratings yet
100 SQL Formulas Each Student Should Know
10 pages
Pandas Cheatsheet 1743309413
No ratings yet
Pandas Cheatsheet 1743309413
11 pages
9-10 Spark Architecture
No ratings yet
9-10 Spark Architecture
25 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
25 Python Materials
No ratings yet
25 Python Materials
3 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Security and Privacy Issues in Recommender Systems
100% (1)
Security and Privacy Issues in Recommender Systems
15 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Comprehensive Azure SQL Training Guide
No ratings yet
Comprehensive Azure SQL Training Guide
6 pages
Data Modeling Techniques & Types
No ratings yet
Data Modeling Techniques & Types
2 pages
Num Py
No ratings yet
Num Py
46 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Snowflake Data Engineering
No ratings yet
Snowflake Data Engineering
12 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Spark Repartition1
No ratings yet
Spark Repartition1
7 pages
Azure Synapse Lab Guide
No ratings yet
Azure Synapse Lab Guide
21 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Spark Big Data Tuning Guide
100% (1)
Spark Big Data Tuning Guide
20 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Leetcode SQL QnA 1693149052
No ratings yet
Leetcode SQL QnA 1693149052
60 pages
SQL Most Asked Questions
No ratings yet
SQL Most Asked Questions
7 pages
Snowflake Training for IT Pros
No ratings yet
Snowflake Training for IT Pros
7 pages
BigQuery Questions+Answers
100% (1)
BigQuery Questions+Answers
5 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Big Data & Hadoop Overview
No ratings yet
Big Data & Hadoop Overview
44 pages
Py Spark
No ratings yet
Py Spark
427 pages
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
No ratings yet
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
12 pages
CNN Cheatsheet for Deep Learning
No ratings yet
CNN Cheatsheet for Deep Learning
5 pages
VBA MACRO Course Training Syllabus PDF
No ratings yet
VBA MACRO Course Training Syllabus PDF
1 page
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Spark SQL Built in Functions List 1666128345
No ratings yet
Spark SQL Built in Functions List 1666128345
143 pages
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
No ratings yet
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
15 pages
GCP Data Engineer
No ratings yet
GCP Data Engineer
8 pages
Azure Data Factory Workshop
No ratings yet
Azure Data Factory Workshop
26 pages
Charts in Tableau
No ratings yet
Charts in Tableau
48 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
Spark RDD Basics and Operations
No ratings yet
Spark RDD Basics and Operations
84 pages
Spark Rdds Operations
No ratings yet
Spark Rdds Operations
46 pages
Ticketing System User Manual - AY
No ratings yet
Ticketing System User Manual - AY
11 pages
ABAP 7.40 Quick Reference
No ratings yet
ABAP 7.40 Quick Reference
27 pages
Unit I Software Product and Process 1.0 Software
No ratings yet
Unit I Software Product and Process 1.0 Software
18 pages
Top 50 DSA and Coding Interview Program Questions
No ratings yet
Top 50 DSA and Coding Interview Program Questions
5 pages
MIPS Assembly Basics for Beginners
No ratings yet
MIPS Assembly Basics for Beginners
7 pages
Renderpeople Renderpoints Voucher
No ratings yet
Renderpeople Renderpoints Voucher
1 page
Lab1 - Keil MDK - ARM and FRDM - KL46Z
No ratings yet
Lab1 - Keil MDK - ARM and FRDM - KL46Z
19 pages
A Smart School Bus Tracking System
No ratings yet
A Smart School Bus Tracking System
6 pages
Robotic Process Automation
No ratings yet
Robotic Process Automation
56 pages
E-Tech - 1st QE
No ratings yet
E-Tech - 1st QE
4 pages
Computer Science Coursebook-132-161
No ratings yet
Computer Science Coursebook-132-161
30 pages
MCA Project Report ABES Engineering
No ratings yet
MCA Project Report ABES Engineering
22 pages
PHP 5 Cookies
No ratings yet
PHP 5 Cookies
6 pages
Revised Tle As CSS10 Q3 WK1
No ratings yet
Revised Tle As CSS10 Q3 WK1
4 pages
Python GTK 3 Tutorial
No ratings yet
Python GTK 3 Tutorial
131 pages
Abhigyan Misra - Blaze Advisor
No ratings yet
Abhigyan Misra - Blaze Advisor
8 pages
Getting Started With The TI Stellaris LaunchPad On Linux - Christian's Blog
No ratings yet
Getting Started With The TI Stellaris LaunchPad On Linux - Christian's Blog
14 pages
10-Unit P8 - Complex Data Structures
No ratings yet
10-Unit P8 - Complex Data Structures
114 pages
Lecture 14 - Methods Static VS. Non-Static, Defining and Calling Method, Argument Passing
No ratings yet
Lecture 14 - Methods Static VS. Non-Static, Defining and Calling Method, Argument Passing
34 pages
Associate Traditional Web Developer Certification Detail Sheet - EN
No ratings yet
Associate Traditional Web Developer Certification Detail Sheet - EN
5 pages
SAP S/4HANA - How To Create and Generate Backend Security Authorizations For SAP Fiori 2.0
100% (1)
SAP S/4HANA - How To Create and Generate Backend Security Authorizations For SAP Fiori 2.0
11 pages
LSMW To Update Customer Master Records With Standard Object
No ratings yet
LSMW To Update Customer Master Records With Standard Object
9 pages
MCQ Java
No ratings yet
MCQ Java
9 pages
Databricks Interview Question & Answers
No ratings yet
Databricks Interview Question & Answers
10 pages
Concepts of Programming Languages 11th Ed
No ratings yet
Concepts of Programming Languages 11th Ed
4 pages
Erp Implementation Life Cycle PDF
No ratings yet
Erp Implementation Life Cycle PDF
2 pages
Citrix Receiver Install and Use On Windows 7 10
No ratings yet
Citrix Receiver Install and Use On Windows 7 10
1 page
Resume 2
No ratings yet
Resume 2
1 page
ZyXEL NAS Utility Release Notes
No ratings yet
ZyXEL NAS Utility Release Notes
3 pages
1633 TranTuanThanh GCS210336 Assignment 2
No ratings yet
1633 TranTuanThanh GCS210336 Assignment 2
37 pages

Big Data Analysis With Scala and Spark: Heather Miller

Uploaded by

Big Data Analysis With Scala and Spark: Heather Miller

Uploaded by

Big Data Analysis with Scala and Spark

Recall transformers and accessors from Scala sequential and parallel

Recall transformers and accessors from Scala sequential and parallel

Transformers. Return new collections as results. (Not single values.)

map(f: A=> B): Traversable[BJ

Recall transformers and accessors from Scala sequential and parallel

Transformers. Return new collections as results. (Not single values.)

map(f: A=> B): Traversable[BJ

Accessors: Return single values as results. (Not collections.)

reduce(op: (A, A)=> A): A

Sim ilarly, S park defines transformations and actions on RDDs.

Transformations. Return new caW@ctio1,s RDDs as results.

Actions. Com pute a result based on an RDD, and either returned or

Sim ilarly, S park defines transformations and actions on RDDs.

Transformations. Return new collections RDDs as results.

Sim ilarly, S park defines transformations and actions on RDDs.

Transformations. Return new collections RDDs as results.

Actions. Com pute a result based on an RDD, and either returned or

Laziness/eagerness is how we can limit network •

Consider the following sim ple exam ple:

val largelist: List[String] = ...

Consider the following sim ple exam ple:

val lar gelist: List[String] = ...

What has happened on the cluster at this point?

Nothing. Execution of map (a transform ation) is deferred.

Consider the following sim ple exam ple:

val largelist: List[String] = ...

... we can add an action

map map[BJ(f: A=> B): RDD[BJ L C

Apply function to each element in the ROD and

flatMap flatMap[BJ(f: A=> TraversableOnce[BJ): RDD[BJ �

distinct distinct(): RDD[BJ<

collect collect(): Array[T] t.

count count(): Long t

take take(num: Int): Array[T] E: -

reduce reduce(op: (A, A) => A): A""

foreach foreach(f: T => Unit): Unit<

Let's assume that we have an RDD[String] which contains gigaby tes of

val lastYearslogs: RDD[String] = ...

Let's assume that we have an RDD[String] which contains gigaby tes of

val lastYearslogs: RDD[String] = ...

val lastYearslogs: RDD[String] = ...

The execution of filter is deferred until the take action is applied.

union union(other: RDD[T]): RDD[T] '=--

intersection intersection(other: RDD[T]): RDD[T]'=

subtract subtract(other: RDD[T]): RDD[T]< -

cartesian cartesian[U](other: RDD[U]): RDD[(T, U)] < -

Return an array with a random sample of num elements of

takeOrdered takeOrdered(num: Int)(implicit

saveAsTextFile saveAsTextFile(path: String): Unit:4:

saveAsSequenceFile saveAsSequenceFile(path: String): Unit� -

You might also like