0% found this document useful (0 votes)

20 views4 pages

Spark SQLPDF 20 Jan

The document provides a tutorial on using PySpark for data manipulation, including reading a CSV file, displaying its schema, and performing SQL-like operations such as selecting, filtering, sorting, and grouping data. It demonstrates how to create temporary views for SQL queries and showcases examples of join operations between two dataframes. The tutorial covers essential functions in PySpark's DataFrame API and SQL syntax for data analysis.

Uploaded by

Anjali Sethi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views4 pages

Spark SQLPDF 20 Jan

Uploaded by

Anjali Sethi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

import pyspark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("nik").getOrCreate()

from pyspark import SparkContext

sc = SparkContext.getOrCreate()

Double-click (or enter) to edit

df = spark.read.csv("simple-zipcodes.csv", header="TRUE")

df.printSchema()

df.show()

+------------+-------+-------------------+-------+-----+
|RecordNumber|Country| City|Zipcode|State|
+------------+-------+-------------------+-------+-----+
| 1| US| PARC PARQUE| 704| PR|
| 2| US|PASEO COSTA DEL SUR| 704| PR|
| 10| US| BDA SAN LUIS| 709| PR|
| 49347| US| HOLT| 32564| FL|
| 49348| US| HOMOSASSA| 34487| FL|
| 61391| US| CINGULAR WIRELESS| 76166| TX|
| 61392| US| FORT WORTH| 76177| TX|
| 61393| US| FT WORTH| 76177| TX|
| 54356| US| SPRUCE PINE| 35585| AL|
| 76511| US| ASH HILL| 27007| NC|
| 4| US| URB EUGENE RICE| 704| PR|
| 39827| US| MESA| 85209| AZ|
| 39828| US| MESA| 85210| AZ|
| 49345| US| HILLIARD| 32046| FL|
| 49346| US| HOLDER| 34445| FL|
| 3| US| SECT LANAUSSE| 704| PR|
| 54354| US| SPRING GARDEN| 36275| AL|
| 54355| US| SPRINGVILLE| 35146| AL|
| 76512| US| ASHEBORO| 27203| NC|
| 76513| US| ASHEBORO| 27204| NC|
+------------+-------+-------------------+-------+-----+

# to work with SQL queries on spark we need to create a temporary view on the data frame
spark.read.csv("simple-zipcodes.csv", header="TRUE").createOrReplaceTempView("Zipcodes")

# Spark SQL to Select Columns

# The select() function of DataFrame API is used to select the specific columns from the DataFrame.

df.select("country","city","zipcode","state").show(5)

+-------+-------------------+-------+-----+
|country| city|zipcode|state|
+-------+-------------------+-------+-----+
| US| PARC PARQUE| 704| PR|
| US|PASEO COSTA DEL SUR| 704| PR|
| US| BDA SAN LUIS| 709| PR|
| US| HOLT| 32564| FL|
| US| HOMOSASSA| 34487| FL|
+-------+-------------------+-------+-----+
only showing top 5 rows

# In SQL, you can achieve the same using SELECT FROM clause as shown below.
# SQL Select query

spark.sql("SELECT country, city, zipcode, state FROM ZIPCODES").show(5)

# Both above examples yields the below output.

Filter Rows To filter the rows from the data, you can use where() function from the DataFrame API.

df.select("country","city","zipcode","state").where("state == 'AZ'").show(5)

+-------+----+-------+-----+
|country|city|zipcode|state|
+-------+----+-------+-----+
| US|MESA| 85209| AZ|
| US|MESA| 85210| AZ|
+-------+----+-------+-----+

Similarly, in SQL you can use WHERE clause as follows.

spark.sql(""" SELECT country, city, zipcode, state FROM ZIPCODES WHERE state = 'AZ' """).show(5)

+-------+----+-------+-----+
|country|city|zipcode|state|
+-------+----+-------+-----+
| US|MESA| 85209| AZ|
| US|MESA| 85210| AZ|
+-------+----+-------+-----+

Sorting To sort rows on a specific column use orderBy() function on DataFrame API.

df.select("country","city","zipcode","state")\
.where("state in ('PR','AZ','FL')") \
.orderBy("state") \
.show(10)

+-------+-------------------+-------+-----+
|country| city|zipcode|state|
+-------+-------------------+-------+-----+
| US| MESA| 85209| AZ|
| US| MESA| 85210| AZ|
| US| HOLT| 32564| FL|
| US| HOMOSASSA| 34487| FL|
| US| HILLIARD| 32046| FL|
| US| HOLDER| 34445| FL|
| US| PARC PARQUE| 704| PR|
| US|PASEO COSTA DEL SUR| 704| PR|
| US| BDA SAN LUIS| 709| PR|
| US| URB EUGENE RICE| 704| PR|
+-------+-------------------+-------+-----+
only showing top 10 rows

In SQL, you can achieve sorting by using ORDER BY clause.

spark.sql(""" SELECT country, city, zipcode, state FROM ZIPCODES

WHERE state in ('PR','AZ','FL') order by state """) \
.show(10)

Grouping The groupBy().count() is used to perform the group by on DataFrame.

df.groupBy("state").count() \
.show()

+-----+-----+
|state|count|
+-----+-----+
| AZ| 2|
| NC| 3|
| AL| 3|
| TX| 3|
| FL| 4|
| PR| 5|
+-----+-----+

You can achieve group by in Spark SQL by using GROUP BY clause.

spark.sql(""" SELECT state, count(*) as count FROM ZIPCODES

GROUP BY state""") \
.show()

+-----+-----+
|state|count|
+-----+-----+
| AZ| 2|
| NC| 3|
| AL| 3|
| TX| 3|
| FL| 4|
| PR| 5|
+-----+-----+

SQL Join Operations Similarly, if you have two tables, you can perform the Join operations in Spark. Here is an example

emp = ((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)

empColumns = ("emp_id","name","superior_emp_id","year_joined", "emp_dept_id","gender","salary")

from pyspark.sql import SQLContext

sqlc = SQLContext(sc)

C:\spark\python\pyspark\sql\context.py:113: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.

warnings.warn(

empDF = sqlc.createDataFrame(emp, empColumns)

empDF.show()

+------+--------+---------------+-----------+-----------+------+------+
|emp_id| name|superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
| 1| Smith| -1| 2018| 10| M| 3000|
| 2| Rose| 1| 2010| 20| M| 4000|
| 3|Williams| 1| 2010| 10| M| 1000|
| 4| Jones| 2| 2005| 10| F| 2000|
| 5| Brown| 2| 2010| 40| | -1|
| 6| Brown| 2| 2010| 50| | -1|
+------+--------+---------------+-----------+-----------+------+------+

dept = (("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
deptColumns = ("dept_name","dept_id")

deptDF = sqlc.createDataFrame(dept,deptColumns)

deptDF.show()

+---------+-------+
|dept_name|dept_id|
+---------+-------+
| Finance| 10|
|Marketing| 20|
| Sales| 30|
| IT| 40|
+---------+-------+

# to work with SQL queries on spark we need to create a temporary view on the data frame
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")\

.show()

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id| name|superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
| 1| Smith| -1| 2018| 10| M| 3000| Finance| 10|
| 3|Williams| 1| 2010| 10| M| 1000| Finance| 10|
| 4| Jones| 2| 2005| 10| F| 2000| Finance| 10|
| 2| Rose| 1| 2010| 20| M| 4000|Marketing| 20|
| 5| Brown| 2| 2010| 40| | -1| IT| 40|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

Start coding or generate with AI.

SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Ip Practical Shubham PDF
No ratings yet
Ip Practical Shubham PDF
20 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
D9 Lab+Day+9
No ratings yet
D9 Lab+Day+9
6 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
No ratings yet
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
12 pages
Journal
No ratings yet
Journal
47 pages
DP 700 Code Used 250701
No ratings yet
DP 700 Code Used 250701
47 pages
SQL To Pandas - Group Aggregations
No ratings yet
SQL To Pandas - Group Aggregations
6 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Anagh-Desai BigDataAssignments NYSE Airlines Using DF
No ratings yet
Anagh-Desai BigDataAssignments NYSE Airlines Using DF
9 pages
IP Imp Notes
No ratings yet
IP Imp Notes
5 pages
Spark SQL Optimization - Real Case Studies
No ratings yet
Spark SQL Optimization - Real Case Studies
18 pages
Solutions 1742312993
No ratings yet
Solutions 1742312993
14 pages
Window Functions in SQL and PySpark
No ratings yet
Window Functions in SQL and PySpark
5 pages
Quewtion SQL - Pyspark
No ratings yet
Quewtion SQL - Pyspark
4 pages
SQL and PySpark
No ratings yet
SQL and PySpark
80 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
SQL & Pyspark
No ratings yet
SQL & Pyspark
9 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Pandas Cheatsheet DF
No ratings yet
Pandas Cheatsheet DF
1 page
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
PDF&Rendition 1
No ratings yet
PDF&Rendition 1
47 pages
File Ip
No ratings yet
File Ip
22 pages
Practical File IP
No ratings yet
Practical File IP
27 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
2 Bda
No ratings yet
2 Bda
2 pages
Lab Record IP
No ratings yet
Lab Record IP
13 pages
Customer Data Outliers Pyspark
No ratings yet
Customer Data Outliers Pyspark
1 page
Panda
No ratings yet
Panda
39 pages
Ebook Comandos JesusG 1741221641
No ratings yet
Ebook Comandos JesusG 1741221641
7 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
17 pages
I.P File
No ratings yet
I.P File
20 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
SQL Tutorial1
No ratings yet
SQL Tutorial1
25 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
XII IP Model 1 Ans
No ratings yet
XII IP Model 1 Ans
8 pages
Go 168 3 PDF
No ratings yet
Go 168 3 PDF
36 pages
Fluid Mechanics Short Notes
No ratings yet
Fluid Mechanics Short Notes
67 pages
GA 18-22 VSD IPM Sales Lealfet en Pune
No ratings yet
GA 18-22 VSD IPM Sales Lealfet en Pune
4 pages
Work Sampling
No ratings yet
Work Sampling
25 pages
Daily Lesson Plan in MAPEH 8: Objectives
No ratings yet
Daily Lesson Plan in MAPEH 8: Objectives
4 pages
NUML University Calendar 2024-2025
No ratings yet
NUML University Calendar 2024-2025
30 pages
3d Game Development With LWJGL
No ratings yet
3d Game Development With LWJGL
299 pages
MALA EL Pro HDR WR V.1 16.08.18 1 PDF
No ratings yet
MALA EL Pro HDR WR V.1 16.08.18 1 PDF
46 pages
TDA1553
No ratings yet
TDA1553
11 pages
IIM Ranchi Placement Success 2014-16
No ratings yet
IIM Ranchi Placement Success 2014-16
11 pages
CH 15 Demand Manmagement & Forecasting-HK
No ratings yet
CH 15 Demand Manmagement & Forecasting-HK
20 pages
Personal Bank Statement Summary
No ratings yet
Personal Bank Statement Summary
7 pages
DESIGN 03 (Office Buildings)
No ratings yet
DESIGN 03 (Office Buildings)
64 pages
Japanese Innovation and Technology
No ratings yet
Japanese Innovation and Technology
23 pages
Tel Directory NITH
No ratings yet
Tel Directory NITH
19 pages
Difference Between Internal Audit and Internal Control
No ratings yet
Difference Between Internal Audit and Internal Control
1 page
PID Control Explained
No ratings yet
PID Control Explained
6 pages
ZCCT (550V-1000V) 英文维护保养手册2020年2月第1版 maintenance
No ratings yet
ZCCT (550V-1000V) 英文维护保养手册2020年2月第1版 maintenance
120 pages
Company Profile Desainas Hydro
No ratings yet
Company Profile Desainas Hydro
19 pages
Environmental Management Systems
No ratings yet
Environmental Management Systems
33 pages
1SVR730750R0400 CM Efs 2s Voltage Monitoring Relay 2c o B C 3 600vrms 24 240vac DC
No ratings yet
1SVR730750R0400 CM Efs 2s Voltage Monitoring Relay 2c o B C 3 600vrms 24 240vac DC
4 pages
Type of Incenerator: Inceneration Technology
No ratings yet
Type of Incenerator: Inceneration Technology
1 page
Versa Latch
No ratings yet
Versa Latch
2 pages
Refco Brochure Condensate Pumps GB
No ratings yet
Refco Brochure Condensate Pumps GB
8 pages
SM Pitanja
No ratings yet
SM Pitanja
160 pages
Qaida Noorania Sheikh Noor Muhammad Haqqani Markaz Al Furqan Taleem Ul Quran HTTPSWWW - Siraatalmustaqim.com
No ratings yet
Qaida Noorania Sheikh Noor Muhammad Haqqani Markaz Al Furqan Taleem Ul Quran HTTPSWWW - Siraatalmustaqim.com
1 page
Best Private Engineering College in Maharashtra
No ratings yet
Best Private Engineering College in Maharashtra
8 pages
Bajaj Coolers Spare Parts Price List
69% (13)
Bajaj Coolers Spare Parts Price List
135 pages
Electrical QA & Leadership Expert
No ratings yet
Electrical QA & Leadership Expert
4 pages
Benchmarking Questionnaire On Facility Management Costs and Business Practices
No ratings yet
Benchmarking Questionnaire On Facility Management Costs and Business Practices
9 pages

Spark SQLPDF 20 Jan

Uploaded by

Spark SQLPDF 20 Jan

Uploaded by

import pyspark

from pyspark.sql import SparkSession

from pyspark import SparkContext

Double-click (or enter) to edit

# Spark SQL to Select Columns

spark.sql("SELECT country, city, zipcode, state FROM ZIPCODES").show(5)

# Both above examples yields the below output.

Similarly, in SQL you can use WHERE clause as follows.

In SQL, you can achieve sorting by using ORDER BY clause.

spark.sql(""" SELECT country, city, zipcode, state FROM ZIPCODES

Grouping The groupBy().count() is used to perform the group by on DataFrame.

You can achieve group by in Spark SQL by using GROUP BY clause.

spark.sql(""" SELECT state, count(*) as count FROM ZIPCODES

empColumns = ("emp_id","name","superior_emp_id","year_joined", "emp_dept_id","gender","salary")

from pyspark.sql import SQLContext

C:\spark\python\pyspark\sql\context.py:113: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.

empDF = sqlc.createDataFrame(emp, empColumns)

spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")\

Start coding or generate with AI.

Start coding or generate with AI.

You might also like