0% found this document useful (0 votes)

159 views43 pages

Day 4-01-Spark

This document provides an overview of Apache Spark, a general purpose framework for big data processing. Some key points: - Spark is 100 times faster than Hadoop for in-memory computation. It can interface with various distributed file systems. - Spark supports programming in Java, Python, Scala and R. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, representing an immutable distributed collection of objects. - Transformations take an RDD and return a new one. Actions return values to the driver program. Functions passed to Spark must be serializable to distribute work across a cluster.

Uploaded by

aissamemi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views43 pages

Day 4-01-Spark

Uploaded by

aissamemi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Apache Spark

Lorenzo Di Gaetano

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Eurostat
What is Apache Spark?
• A general purpose framework for big data
processing

• It interfaces with many distributed file systems,

such as Hdfs (Hadoop Distributed File System),
Amazon S3, Apache Cassandra and many others

• 100 times faster than Hadoop for in-memory

computation
2
Eurostat
Multilanguage API
• You can write applications in various languages
• Java
• Python
• Scala
• R

• In the context of this course we will consider

Python

3
Eurostat
Built-in Libraries

4
Eurostat
Third party libraries

• Many third party libraries are available

• http://spark-packages.org/

• We used spark-csv in our examples

• We will see later how to use an external jar on
our application

5
Eurostat
Getting started with Spark!

• Spark can run on Hadoop, on the cloud and can

be also be installed as a standalone application
on your PC
• Spark comes pre-installed in all the major
Hadoop distributions
• We will consider the stand-alone installation in
this course

6
Eurostat
Prerequisites

• Python (for using pySpark) and Java must be

installed.

• All environment variables must be correctly set.

7
Eurostat
Local installation on Windows (1/2)
• Download Spark at
https://spark.apache.com/downloads.html
• Download “Pre-built for Hadoop 2.6 and later”
• Unzip it
• Set the SPARK_HOME environment variable to
point where you unzipped Spark
• Add %SPARK_HOME%/bin to your PATH

8
Eurostat
Local installation on Windows (2/2)
• Download http://public-repo-
1.hortonworks.com/hdp-win-alpha/winutils.exe
• Move it in %SPARK_HOME%/bin
• Create an enviroment variable:
HADOOP_HOME=%SPARK_HOME%
• Now test it: open a terminal and launch pySpark

9
Eurostat
PySpark up and running!

10
Eurostat
Running Spark
• Once you correctly installed spark you can use it
in two ways.

• spark-submit: it’s the CLI command you can use to

launch python spark applications

• pyspark: used to launch an interactive python shell.

11
Eurostat
Running Spark

• Using spark-submit you have to manually create

a SparkContext object in your code.

• Using the pyspark interactive shell a

SparkContext is automatically available in a
variable called sc.

12
Eurostat
What is a Spark Application?

• A Spark application is made of a Driver Program

• The Driver Program runs the the main function

which executes parallel operations on the cluster

13
Eurostat
RDD

• Spark works on RDD – Resilient Distributed

Filesystem

• A RDD is a collection of elements partitioned in

every cluster node. Spark operates in parallel on
them

14
Eurostat
RDD

• Every RDD is created from a file on Hadoop

filesystem

• They can be made persistent in memory

15
Eurostat
How to create a RDD
• There are two ways:

• Parallelizing a pre-existent collection on the driver

program

• Reading an external dataset on the filesystem

16
Eurostat
RDD from collections

• You can create a RDD from a collection using

parallelize() method on SparkContext

data = [1,2,3,4,5]
rdd = sc.parallelize(data)

• Then you can operate in parallel on rdd

17
Eurostat
RDD from external datasets

• You can create RDD from various kind of external

datasets like local filesystem, HDFS, Cassandra
etc…
• For example we can read a text file, obtaining a
collection of lines.

rdd = sc.textFile("textfile.txt")

18
Eurostat
Operations on RDD
• There are two types of operations:

• Transformations: Take an existing dataset and return

another dataset

• Actions: Take an existing dataset and return a value

19
Eurostat
Transformations

• For example, map is a transformation that takes

all elements of the dataset, pass them to a
function and returns another RDD with the
results

resultRDD = originalRDD.map(myFunction)

20
Eurostat
Actions

• For example, reduce is an action. It aggregates

all elements of the RDD using a function and
returns the result to the driver program

result = rdd.reduce(function)

21
Eurostat
Passing functions to Spark
• There are three ways

• lambda expressions: used to pass simple expressions

which return a value.

• Local function definition: for more complex code

• Functions inside a module

22
Eurostat
Example: word counting

def countWords(s):
words = s.split(" ")
return len(words)

sc = SparkContext(...)
sc.textFile("textFile.txt").map(countWords)

23
Eurostat
Shared Variables
• We always must keep in mind that when we pass
a function to a spark operation, this function is
executed on separate cluster nodes. Every node
receives a COPY of the variable inside the
function

• Every change to the local value of the variable is

not propagated to the driver program

24
Eurostat
Shared Variables
• To solve the problem spark offers two types of
shared variables:

• Broadcast variables

• Accumulators

25
Eurostat
Broadcast variables
• Instead of creating a copy of the variable for
each machine, broadcast variables allows the
programmer to keep a cached read-only variable
in every machine

• We can create a broadcast variable with the

SparkContext.broadcast(var) method, which
returns the reference of the broadcast variable

26
Eurostat
Accumulators
• Accumulators are used to keep some kind of
shared «counter» across the machines. It’s a
special variable which can be «added»

• We can create an accumulator with

SparkContext.accumulator(var) method

• Once the accumulator is created we can add

values to it with the add() method
27
Eurostat
Making RDD persistent
• Spark can persist (or cache) a dataset in memory
during operations
• If you persist an RDD, every node stores the
partitions it elaborates in memory and reuses
them in other actions which will run faster
• You can persist an RDD using the persist() or
cache() methods

28
Eurostat
Printing elements
• When working on a single machine we can simply
use rdd.foreach(println)
• But when we work in cluster mode, the println
will be executed on the executor stdout, so we
will not see anything on the driver node!
• Instead, we can rdd.take(n).foreach(println) to
print the first n elements of the collection, in
order to be sure we will not run out of memory.

29
Eurostat
Removing data

• Spark perfoms a sort of garbage collection and

automatically deletes old data partitions

• We can manually remove an RDD using the

unpersist() method

30
Eurostat
SparkSQL and DataFrames

• SparkSQL is the spark module for structured data

processing

• DataFrame API is one of the ways to interact with

SparkSQL

31
Eurostat
DataFrames
• A DataFrame is a collection of data organized into
columns

• Similar to tables in relational databases

• Can be created from various sources: structured

data files, Hive Tables, external db, csv etc…

32
Eurostat
Creating a DataFrame
• Given a SparkContext (sc), the first thing to do is
to create a SQLContext

sqlContext = SQLContext(sc)

• Then read the data, for example in JSON format:

df = sqlContext.read.json(‘file.json’)

33
Eurostat
Creating a DataFrame from csv file

• We now see how we can load a csv file into a

DataFrame using an external library called Spark
csv: https://github.com/databricks/spark-csv

• We need to download two files: spark-csv_2.11-

1.4.0.jar and commons-csv-1.2.jar

34
Eurostat
Creating a DataFrame from csv file

• When launching our code with spark-submit we

have to add external jars to the classpath,
assuming we put the jars into lib directory of our
project:

spark-submit --jars lib/spark-csv_2.11-1.4.0.jar,lib/commons-csv-1.2.jar myCode.py

35
Eurostat
Creating a DataFrame from csv file
• Then we read our file into a DataFrame

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
nullValue='NULL').load(myCsvFile, inferschema = ‘true’)

• Note the inferschema = true option. With this

option activated, Spark tries to guess the format
of every field
• We can specify the schema manually creating a
StructType array
36
Eurostat
Creating a DataFrame from csv file
• Create the schema structure
customSchema = StructType([ \
StructField("field1", IntegerType(), True),
StructField("field2", FloatType(), True), \
StructField("field3", TimestampType(), True), \
StructField("field4", StringType(), True) \
])

• And then pass it to our load function as an option

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
nullValue='NULL').load(myFile, schema = customSchema)

37
Eurostat
Example operations on DataFrames
• To show the content of the DataFrame
• df.show()
• To print the Schema of the DataFrame
• df.printSchema()
• To select a column
• df.select(‘columnName’).show()
• To filter by some parameter
• df.filter(df[‘columnName’] > N).show()

38
Eurostat
A complete example: group and avg
• We have a table like this

+----------+------+-------+--------------+------------------+-----------+-------+
|codice_pdv|prdkey| week|vendite_valore|vendite_confezioni|flag_sconto| sconto|
+----------+------+-------+--------------+------------------+-----------+-------+
| 4567|730716|2015009| 190.8500061| 196.0| 0.0|0.98991|
| 4567|730716|2013048| 174.6000061| 153.0| null| null|
| 4567|730716|2015008| 160.6000061| 165.0| 0.0|0.98951|
| 4567|730716|2015006| 171.92999268| 176.0| 0.0|0.99329|
| 4567|730716|2015005| 209.47999573| 213.0| 0.0|1.00447|
+----------+------+-------+--------------+------------------+-----------+-------+

• We want to group by prdkey and calculate the

average vendite_valore for every group

39
Eurostat
Preparing the schema
sc = SparkContext("local", "PrezzoMedio")

sqlContext = SQLContext(sc)

customSchema = StructType([ \
StructField("codice_pdv", IntegerType(), True), \
StructField("prdkey", IntegerType(), True), \
StructField("week", IntegerType(), True), \
StructField("vendite_valore", FloatType(), True), \
StructField("vendite_confezioni", FloatType(), True), \
StructField("data_ins", TimestampType(), True), \
StructField("num_riga", FloatType(), True), \
StructField("flag_sconto", FloatType(), True), \
StructField("sconto", FloatType(), True), \
StructField("idricefile", FloatType(), True), \
StructField("iddettzipfile", FloatType(), True), \
StructField("uteins", StringType(), True), \
StructField("datamod", TimestampType(), True), \
StructField("utemod", StringType(), True) \
])

40
Eurostat
Read data and do the job
myCsvFile = "C:\TransFatt1000.csv"

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
nullValue='NULL').load(myCsvFile, schema = customSchema)

t0 = time()

averageForProductKey = df.groupBy("prdkey").avg("vendite_valore").collect()

tt = time() - t0

print(averageForProductKey)

print ("Query performed in {:03.2f} seconds.".format(tt))

41
Eurostat
With parquet table on HDFS
myParquetTable = "hdfs://bigdata-mnn.hcl.istat.it:8020///user/hive/warehouse/scanner_data.db/trans_fatt_p«

df = sqlContext.read.load(myParquetTable, schema = customSchema)

t0 = time()

averageForProductKey = df.groupBy("prdkey").avg("vendite_valore").collect()

tt = time() - t0

print(averageForProductKey)

print ("Query performed in {:03.2f} seconds.".format(tt))

42
Eurostat
References

• http://spark.apache.org/

• http://spark.apache.org/sql/

• http://spark.apache.org/docs/latest/sql-
programming-guide.html

43
Eurostat

De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
3 pages
Visual Guide to Spark API Transformations
No ratings yet
Visual Guide to Spark API Transformations
122 pages
Azure Resource Group & SQL Setup Guide
No ratings yet
Azure Resource Group & SQL Setup Guide
73 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
PySpark SparkSession Guide
No ratings yet
PySpark SparkSession Guide
63 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Machine Learning with Spark Guide
No ratings yet
Machine Learning with Spark Guide
26 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
AWS Data Engineering Guide
No ratings yet
AWS Data Engineering Guide
2 pages
PySpark Cheat Sheet
No ratings yet
PySpark Cheat Sheet
6 pages
DBT Ebook
No ratings yet
DBT Ebook
143 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Complete DBT Bootcamp Slides
100% (1)
Complete DBT Bootcamp Slides
99 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Mongo DB
No ratings yet
Mongo DB
30 pages
Hadoop Data Transfer with Sqoop
No ratings yet
Hadoop Data Transfer with Sqoop
21 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
9-10 Spark Architecture
No ratings yet
9-10 Spark Architecture
25 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Pandas Cheatsheet 1743309413
No ratings yet
Pandas Cheatsheet 1743309413
11 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Unit - 1
No ratings yet
Unit - 1
104 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
Spark
No ratings yet
Spark
13 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
De Mod 3 Manage Data With Delta Lake
No ratings yet
De Mod 3 Manage Data With Delta Lake
16 pages
AWS Tools for Data Engineers
No ratings yet
AWS Tools for Data Engineers
24 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Kafka
No ratings yet
Kafka
23 pages
JavaScript for Developers
100% (10)
JavaScript for Developers
59 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
14 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
17.views and MaterializedViews
No ratings yet
17.views and MaterializedViews
13 pages
MySQL Cheatsheet - CodeWithHarry
100% (1)
MySQL Cheatsheet - CodeWithHarry
13 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
10 SparkBasics
No ratings yet
10 SparkBasics
45 pages
Databricks & PySpark Learning Day-10
No ratings yet
Databricks & PySpark Learning Day-10
4 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
DAY 3 - ITEM 10 - Overview of Big Data Tools
No ratings yet
DAY 3 - ITEM 10 - Overview of Big Data Tools
25 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
ER Diagram Rules
No ratings yet
ER Diagram Rules
5 pages
Logical Files
No ratings yet
Logical Files
12 pages
Phonepe Pulse Data Visualization and Exploration - A User-Friendly Tool Using Streamlit and Plotly
No ratings yet
Phonepe Pulse Data Visualization and Exploration - A User-Friendly Tool Using Streamlit and Plotly
3 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
40 pages
Xii Ip MS
No ratings yet
Xii Ip MS
6 pages
ML Notes (BCS602)
No ratings yet
ML Notes (BCS602)
186 pages
MS-Access Database Basics
No ratings yet
MS-Access Database Basics
25 pages
Practice Questions 0
No ratings yet
Practice Questions 0
19 pages
DISA Data Lifecycle Management Guidebook FINAL
No ratings yet
DISA Data Lifecycle Management Guidebook FINAL
29 pages
Worksheet - Sample SQL Server Inventory
No ratings yet
Worksheet - Sample SQL Server Inventory
5 pages
Progress OpenEdge To Fabric OneLake Transfer
No ratings yet
Progress OpenEdge To Fabric OneLake Transfer
13 pages
Final Project (FCDS)
No ratings yet
Final Project (FCDS)
2 pages
Data Engineering Syllabus Overview
No ratings yet
Data Engineering Syllabus Overview
6 pages
Java MCQ
No ratings yet
Java MCQ
762 pages
Ericsson GSM BSC Printout PDF
100% (1)
Ericsson GSM BSC Printout PDF
4 pages
Class XII Computer Science Exam
No ratings yet
Class XII Computer Science Exam
15 pages
მონაცემთა ბაზები - ბ. მეფარიშვილი PDF
No ratings yet
მონაცემთა ბაზები - ბ. მეფარიშვილი PDF
152 pages
Oracle RMAN
No ratings yet
Oracle RMAN
7 pages
Mcs - 45 Complete
No ratings yet
Mcs - 45 Complete
6 pages
Chapter 2-Converted BI
No ratings yet
Chapter 2-Converted BI
39 pages
Migrate ADMM Exam Specification 1
No ratings yet
Migrate ADMM Exam Specification 1
13 pages
Database Concurrency Control
No ratings yet
Database Concurrency Control
34 pages
SQL L
No ratings yet
SQL L
62 pages
Database Concepts Exam Prep
No ratings yet
Database Concepts Exam Prep
25 pages
1 DBMS Practical Data
No ratings yet
1 DBMS Practical Data
10 pages
Azure Data Factory Activities Guide
No ratings yet
Azure Data Factory Activities Guide
35 pages
Data & Business Intelligence
No ratings yet
Data & Business Intelligence
53 pages
Practical Issues of Database Application
No ratings yet
Practical Issues of Database Application
41 pages
Midterm Exam Key: CMPT 354
No ratings yet
Midterm Exam Key: CMPT 354
7 pages
1z0-1093-23 Exam - Free Actual Q&as, Page 1 _ ExamTopicsc
No ratings yet
1z0-1093-23 Exam - Free Actual Q&as, Page 1 _ ExamTopicsc
2 pages

Day 4-01-Spark

Uploaded by

Day 4-01-Spark

Uploaded by

Apache Spark

• It interfaces with many distributed file systems,

• 100 times faster than Hadoop for in-memory

• In the context of this course we will consider

• Many third party libraries are available

• We used spark-csv in our examples

• Spark can run on Hadoop, on the cloud and can

• Python (for using pySpark) and Java must be

• All environment variables must be correctly set.

• spark-submit: it’s the CLI command you can use to

• pyspark: used to launch an interactive python shell.

• Using spark-submit you have to manually create

• Using the pyspark interactive shell a

• A Spark application is made of a Driver Program

• The Driver Program runs the the main function

• Spark works on RDD – Resilient Distributed

• A RDD is a collection of elements partitioned in

• Every RDD is created from a file on Hadoop

• They can be made persistent in memory

• Parallelizing a pre-existent collection on the driver

• Reading an external dataset on the filesystem

• You can create a RDD from a collection using

• Then you can operate in parallel on rdd

• You can create RDD from various kind of external

• Transformations: Take an existing dataset and return

• Actions: Take an existing dataset and return a value

• For example, map is a transformation that takes

• For example, reduce is an action. It aggregates

• lambda expressions: used to pass simple expressions

• Local function definition: for more complex code

• Functions inside a module

• Every change to the local value of the variable is

• We can create a broadcast variable with the

• We can create an accumulator with

• Once the accumulator is created we can add

• Spark perfoms a sort of garbage collection and

• We can manually remove an RDD using the

• SparkSQL is the spark module for structured data

• DataFrame API is one of the ways to interact with

• Similar to tables in relational databases

• Can be created from various sources: structured

• Then read the data, for example in JSON format:

• We now see how we can load a csv file into a

• We need to download two files: spark-csv_2.11-

• When launching our code with spark-submit we

spark-submit --jars lib/spark-csv_2.11-1.4.0.jar,lib/commons-csv-1.2.jar myCode.py

• Note the inferschema = true option. With this

• And then pass it to our load function as an option

• We want to group by prdkey and calculate the

print ("Query performed in {:03.2f} seconds.".format(tt))

df = sqlContext.read.load(myParquetTable, schema = customSchema)

print ("Query performed in {:03.2f} seconds.".format(tt))

You might also like