0% found this document useful (0 votes)

490 views10 pages

Learning Apache Spark With Python

This document provides an overview of how to create RDDs (Resilient Distributed Datasets) in Apache Spark with Python. It discusses several common ways to create RDDs, including: 1. Using the parallelize() function to distribute an existing collection across a Spark cluster. 2. Using the createDataFrame() function to create an RDD from a list of tuples or a DataFrame. 3. Reading data from external sources like CSV files or a database using functions like read.format or read.jdbc to load the data as an RDD. Once created, RDDs can be operated on using transformations and actions. The document provides examples of creating RDDs using each of these common methods.

Uploaded by

dalalroshan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

490 views10 pages

Learning Apache Spark With Python

Uploaded by

dalalroshan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Learning Apache Spark with Python

28 Chapter 4. An Introduction to Apache Spark

Learning Apache Spark with Python

• separate process to execute user applications

• creates SparkContext to schedule jobs execution and negotiate with cluster manager
2. Executors
• run tasks scheduled by driver
• store computation results in memory, on disk or off-heap
• interact with storage systems
3. Cluster Manager
• Mesos
• YARN
• Spark Standalone
Spark Driver contains more components responsible for translation of user code into actual jobs executed
on cluster:

• SparkContext
– represents the connection to a Spark cluster, and can be used to create RDDs, accu-
mulators and broadcast variables on that cluster
• DAGScheduler
– computes a DAG of stages for each job and submits them to TaskScheduler deter-
mines preferred locations for tasks (based on cache status or shuffle files locations)
and finds minimum schedule to run the jobs
• TaskScheduler
– responsible for sending tasks to the cluster, running them, retrying if there are failures,
and mitigating stragglers
• SchedulerBackend

4.2. Spark Components 29

Learning Apache Spark with Python

– backend interface for scheduling systems that allows plugging in different implemen-
tations(Mesos, YARN, Standalone, local)
• BlockManager
– provides interfaces for putting and retrieving blocks both locally and remotely into
various stores (memory, disk, and off-heap)

4.3 Architecture

4.4 How Spark Works?

Spark has a small code base and the system is divided in various layers. Each layer has some responsibilities.
The layers are independent of each other.
The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications. As you enter
your code in spark console (creating RDD’s and applying operators), Spark creates a operator graph. When
the user runs an action (like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler
divides operator graph into (map and reduce) stages. A stage is comprised of tasks based on partitions of
the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map
operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final
result of a DAG scheduler is a set of stages. The stages are passed on to the Task Scheduler. The task
scheduler launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t
know about dependencies among stages.

30 Chapter 4. An Introduction to Apache Spark

CHAPTER

FIVE

PROGRAMMING WITH RDDS

Chinese proverb
If you only know yourself, but not your opponent, you may win or may lose. If you know neither
yourself nor your enemy, you will always endanger yourself – idiom, from Sunzi’s Art of War

RDD represents Resilient Distributed Dataset. An RDD in Spark is simply an immutable distributed
collection of objects sets. Each RDD is split into multiple partitions (similar pattern with smaller sets),
which may be computed on different nodes of the cluster.

5.1 Create RDD

Usually, there are two popular way to create the RDDs: loading an external dataset, or distributing a set
of collection of objects. The following examples show some simplest ways to create RDDs by using
parallelize() fucntion which takes an already existing collection in your program and pass the same
to the Spark Context.
1. By using parallelize( ) fucntion

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

df = spark.sparkContext.parallelize([(1, 2, 3, 'a b c'),

(4, 5, 6, 'd e f'),
(7, 8, 9, 'g h i')]).toDF(['col1', 'col2', 'col3','col4'])

Then you will get the RDD data:

df.show()

+----+----+----+-----+
(continues on next page)

31
Learning Apache Spark with Python

(continued from previous page)

|col1|col2|col3| col4|
+----+----+----+-----+
| 1| 2| 3|a b c|
| 4| 5| 6|d e f|
| 7| 8| 9|g h i|
+----+----+----+-----+

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

myData = spark.sparkContext.parallelize([(1,2), (3,4), (5,6), (7,8), (9,10)])

Then you will get the RDD data:

myData.collect()

[(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]

2. By using createDataFrame( ) function

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

Employee = spark.createDataFrame([
('1', 'Joe', '70000', '1'),
('2', 'Henry', '80000', '2'),
('3', 'Sam', '60000', '2'),
('4', 'Max', '90000', '1')],
['Id', 'Name', 'Sallary','DepartmentId']
)

Then you will get the RDD data:

+---+-----+-------+------------+
| Id| Name|Sallary|DepartmentId|
+---+-----+-------+------------+
| 1| Joe| 70000| 1|
| 2|Henry| 80000| 2|
| 3| Sam| 60000| 2|
| 4| Max| 90000| 1|
+---+-----+-------+------------+

32 Chapter 5. Programming with RDDs

Learning Apache Spark with Python

3. By using read and load functions

a. Read dataset from .csv file

## set up SparkSession
from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

df = spark.read.format('com.databricks.spark.csv').\
options(header='true', \
inferschema='true').\
load("/home/feng/Spark/Code/data/Advertising.csv",
˓→header=True)

df.show(5)
df.printSchema()

Then you will get the RDD data:

+---+-----+-----+---------+-----+
|_c0| TV|Radio|Newspaper|Sales|
+---+-----+-----+---------+-----+
| 1|230.1| 37.8| 69.2| 22.1|
| 2| 44.5| 39.3| 45.1| 10.4|
| 3| 17.2| 45.9| 69.3| 9.3|
| 4|151.5| 41.3| 58.5| 18.5|
| 5|180.8| 10.8| 58.4| 12.9|
+---+-----+-----+---------+-----+
only showing top 5 rows

Once created, RDDs offer two types of operations: transformations and actions.
b. Read dataset from DataBase

## set up SparkSession
from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
(continues on next page)

5.1. Create RDD 33

Learning Apache Spark with Python

(continued from previous page)

.getOrCreate()

## User information
user = 'your_username'
pw = 'your_password'

## Database information
table_name = 'table_name'
url = 'jdbc:postgresql://##.###.###.##:5432/dataset?user='+user+'&
˓→password='+pw
properties ={'driver': 'org.postgresql.Driver', 'password': pw,'user
˓→': user}

df = spark.read.jdbc(url=url, table=table_name,
˓→properties=properties)

df.show(5)
df.printSchema()

Then you will get the RDD data:

Note: Reading tables from Database needs the proper drive for the corresponding Database. For example,
the above demo needs org.postgresql.Driver and you need to download it and put it in jars folder
of your spark installation path. I download postgresql-42.1.1.jar from the official website and put
it in jars folder.

C. Read dataset from HDFS

from pyspark.conf import SparkConf

from pyspark.context import SparkContext
from pyspark.sql import HiveContext
(continues on next page)

34 Chapter 5. Programming with RDDs

Learning Apache Spark with Python

(continued from previous page)

sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("hdfs://cdhstltest/user/data/demo.CSV")
print(tf1.first())

hc.sql("use intg_cme_w")
spf = hc.sql("SELECT * FROM spf LIMIT 100")
print(spf.show(5))

5.2 Spark Operations

Warning: All the figures below are from Jeffrey Thompson. The interested reader is referred to pyspark
pictures

There are two main types of Spark operations: Transformations and Actions [Karau2015].

Note: Some people defined three types of operations: Transformations, Actions and Shuffles.

5.2. Spark Operations 35

Learning Apache Spark with Python

5.2.1 Spark Transformations

Transformations construct a new RDD from a previous one. For example, one common transformation is
filtering data that matches a predicate.

5.2.2 Spark Actions

Actions, on the other hand, compute a result based on an RDD, and either return it to the driver program or
save it to an external storage system (e.g., HDFS).

36 Chapter 5. Programming with RDDs

Learning Apache Spark with Python

5.3 rdd.DataFrame vs pd.DataFrame

5.3.1 Create DataFrame

1. From List

my_list = [['a', 1, 2], ['b', 2, 3],['c', 3, 4]]

col_name = ['A', 'B', 'C']

:: Python Code:

# caution for the columns=

pd.DataFrame(my_list,columns= col_name)
#
spark.createDataFrame(my_list, col_name).show()

:: Comparison:

+---+---+---+
| A| B| C|
A B C +---+---+---+
0 a 1 2 | a| 1| 2|
1 b 2 3 | b| 2| 3|
2 c 3 4 | c| 3| 4|
+---+---+---+

Attention: Pay attentation to the parameter columns= in pd.DataFrame. Since the default value
will make the list as rows.
:: Python Code:
# caution for the columns=
pd.DataFrame(my_list, columns= col_name)
#
pd.DataFrame(my_list, col_name)

:: Comparison:
A B C 0 1 2
0 a 1 2 A a 1 2
1 b 2 3 B b 2 3
2 c 3 4 C c 3 4

5.3. rdd.DataFrame vs pd.DataFrame 37

Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Spark Interview Prep for Telugu Speakers
100% (3)
Spark Interview Prep for Telugu Speakers
31 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Apache Spark Interview Questions Guide
100% (1)
Apache Spark Interview Questions Guide
7 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Pyspark Commands
No ratings yet
Pyspark Commands
12 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
PySpark Cheat Sheet
No ratings yet
PySpark Cheat Sheet
6 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Mastering Apache Spark PDF
No ratings yet
Mastering Apache Spark PDF
663 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
Spark Intreview FAQ
100% (2)
Spark Intreview FAQ
21 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Spark SQL
100% (1)
Spark SQL
25 pages
9-10 Spark Architecture
No ratings yet
9-10 Spark Architecture
25 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Spark and Scala Course
No ratings yet
Spark and Scala Course
5 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
No ratings yet
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
13 pages
Spark Repartition1
No ratings yet
Spark Repartition1
7 pages
PySpark DataFrame Merging Guide
No ratings yet
PySpark DataFrame Merging Guide
42 pages
Apache Spark Graph Processing - Sample Chapter
No ratings yet
Apache Spark Graph Processing - Sample Chapter
22 pages
De Mod 4 Build Data Pipelines With Delta Live Tables
No ratings yet
De Mod 4 Build Data Pipelines With Delta Live Tables
52 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Apache Spark Interview Questions Book
100% (1)
Apache Spark Interview Questions Book
15 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Position Upsert
No ratings yet
Position Upsert
2 pages
Shilakotad
No ratings yet
Shilakotad
5 pages
Salesforce ORM for Developers
No ratings yet
Salesforce ORM for Developers
8 pages
$10k Loan Amortization Schedule
No ratings yet
$10k Loan Amortization Schedule
11 pages
DSD
No ratings yet
DSD
11 pages
Progress Test 02
No ratings yet
Progress Test 02
21 pages
ID Theft Phishing Research
No ratings yet
ID Theft Phishing Research
20 pages
Accounting Information Systems, 6: Edition James A. Hall
No ratings yet
Accounting Information Systems, 6: Edition James A. Hall
41 pages
B7 Comp Tsol-1
100% (1)
B7 Comp Tsol-1
4 pages
Configuring AAA Servers and The Local Database
No ratings yet
Configuring AAA Servers and The Local Database
22 pages
Database Normalization Guide
No ratings yet
Database Normalization Guide
26 pages
ArmorPoint 360 Service Agreement FINAL 08.12.2022
No ratings yet
ArmorPoint 360 Service Agreement FINAL 08.12.2022
11 pages
SAP S-4HANA Asset Management. Configure, Equip, and Manage 2023
100% (7)
SAP S-4HANA Asset Management. Configure, Equip, and Manage 2023
400 pages
Cisco
No ratings yet
Cisco
17 pages
It Governance & The Cobit 5.0 Framework: Mcgladrey
No ratings yet
It Governance & The Cobit 5.0 Framework: Mcgladrey
24 pages
Concurrent and Real-Time Programming in Java: © Andy Wellings, 2004
No ratings yet
Concurrent and Real-Time Programming in Java: © Andy Wellings, 2004
35 pages
Authorization
No ratings yet
Authorization
6 pages
GCP ACE Last Minute Refresher
No ratings yet
GCP ACE Last Minute Refresher
2 pages
Robotics and Artificial Intelligence Project-1
No ratings yet
Robotics and Artificial Intelligence Project-1
21 pages
Splunk Ot Security Solution Technical Guide and Documentation
No ratings yet
Splunk Ot Security Solution Technical Guide and Documentation
101 pages
Difference Between Antivirus, Firewall and IDS
No ratings yet
Difference Between Antivirus, Firewall and IDS
4 pages
Bizom Boosts Bisleri's Bangalore Ops
No ratings yet
Bizom Boosts Bisleri's Bangalore Ops
14 pages
Process Flow (Organic)
No ratings yet
Process Flow (Organic)
14 pages
Ikram-Solutions Myjobs System Use Case Specification: Register Job Seeker
No ratings yet
Ikram-Solutions Myjobs System Use Case Specification: Register Job Seeker
7 pages
HSE IT Security Assessment Guide
No ratings yet
HSE IT Security Assessment Guide
33 pages
Class Test II Questions
No ratings yet
Class Test II Questions
2 pages
Siem Tools
No ratings yet
Siem Tools
8 pages
Computer Science p1 Notes With Marking Schemes
No ratings yet
Computer Science p1 Notes With Marking Schemes
29 pages
Restaurant Management System
No ratings yet
Restaurant Management System
21 pages
Himanshu Kumar: Functional Testing
No ratings yet
Himanshu Kumar: Functional Testing
5 pages
STANDARD Secure-Configuration-and-Hardening Template en
No ratings yet
STANDARD Secure-Configuration-and-Hardening Template en
12 pages
Comprehensive SEO Guide & Strategies
No ratings yet
Comprehensive SEO Guide & Strategies
1 page
CS8792 Cryptography & Network Security
No ratings yet
CS8792 Cryptography & Network Security
331 pages
Zara - FINAL AUDIT and EVALUATION
No ratings yet
Zara - FINAL AUDIT and EVALUATION
30 pages
Software Engineer's Resume
No ratings yet
Software Engineer's Resume
1 page

Learning Apache Spark With Python

Uploaded by

Learning Apache Spark With Python

Uploaded by

Learning Apache Spark with Python

28 Chapter 4. An Introduction to Apache Spark

• separate process to execute user applications

4.2. Spark Components 29

4.4 How Spark Works?

30 Chapter 4. An Introduction to Apache Spark

PROGRAMMING WITH RDDS

5.1 Create RDD

from pyspark.sql import SparkSession

df = spark.sparkContext.parallelize([(1, 2, 3, 'a b c'),

Then you will get the RDD data:

(continued from previous page)

from pyspark.sql import SparkSession

myData = spark.sparkContext.parallelize([(1,2), (3,4), (5,6), (7,8), (9,10)])

Then you will get the RDD data:

2. By using createDataFrame( ) function

from pyspark.sql import SparkSession

Then you will get the RDD data:

32 Chapter 5. Programming with RDDs

3. By using read and load functions

Then you will get the RDD data:

5.1. Create RDD 33

(continued from previous page)

Then you will get the RDD data:

C. Read dataset from HDFS

from pyspark.conf import SparkConf

34 Chapter 5. Programming with RDDs

(continued from previous page)

5.2 Spark Operations

5.2. Spark Operations 35

5.2.1 Spark Transformations

5.2.2 Spark Actions

36 Chapter 5. Programming with RDDs

5.3 rdd.DataFrame vs pd.DataFrame

5.3.1 Create DataFrame

my_list = [['a', 1, 2], ['b', 2, 3],['c', 3, 4]]

# caution for the columns=

5.3. rdd.DataFrame vs pd.DataFrame 37

You might also like