0% found this document useful (0 votes)

25 views6 pages

Day 27

The document outlines a product recommendation system using basket analysis to identify products frequently bought together. It provides code examples in both PySpark and Spark SQL to demonstrate how to achieve this analysis through DataFrames and SQL queries. The author, Ganesh R, shares his contact information and links to his online portfolio and social media.

Uploaded by

Lapi Lapil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views6 pages

Day 27

Uploaded by

Lapi Lapil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Scenario Based Interview

Pyspark vs
Spark SQL

Ganesh. R
#Problem Statement Product recommendation. Just the basic type (“customers who bought this
also bought…”). That, in its simplest form, is an outcome of basket analysis. In this solution, i
will learn how to find products which are most frequently bought together using simple SQL.
Based on the history ecommerce website can recommend products to new user.

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, IntegerType,
StringType

# Initialize Spark session

spark = SparkSession.builder \
.appName("OrdersProducts") \
.getOrCreate()

# Define schema for orders

orders_schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("customer_id", IntegerType(), True),
StructField("product_id", IntegerType(), True)
])

# Define schema for products

products_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])

# Create data for orders

orders_data = [
(1, 1, 1),
(1, 1, 2),
(1, 1, 3),
(2, 2, 1),
(2, 2, 2),
(2, 2, 4),
(3, 1, 5)
]

# Create data for products

products_data = [
(1, 'A'),
(2, 'B'),
(3, 'C'),
(4, 'D'),
(5, 'E')
]

# Create DataFrame for orders

orders_df = spark.createDataFrame(orders_data, schema=orders_schema)
# Create DataFrame for products
products_df = spark.createDataFrame(products_data,
schema=products_schema)

# Show the result

orders_df.display()
products_df.display()

# Create temporary views for SQL queries

orders_df.createOrReplaceTempView("orders")
products_df.createOrReplaceTempView("products")

###Pyspark

from pyspark.sql.functions import col, concat,

monotonically_increasing_id, row_number, countDistinct
from pyspark.sql.window import Window

# Alias the orders DataFrame for joining

a = orders_df.alias("a")
b = orders_df.alias("b")

# Perform the join and necessary transformations

t1 = a.join(b, (a.order_id == b.order_id) & (a.product_id !=
b.product_id)) \
.join(products_df.alias("p1"), col("a.product_id") ==
col("p1.id"), "left") \
.join(products_df.alias("p2"), col("b.product_id") ==
col("p2.id"), "left") \
.select(
col("a.order_id").alias("order_id"),
col("a.customer_id").alias("customer_id"),
col("p1.name").alias("name1"),
col("p2.name").alias("name2"),
(col("p1.id") + col("p2.id")).alias("pair_sum"),
monotonically_increasing_id().alias("idf")
)

# Define window specification for row_number

window_spec = Window.partitionBy("order_id",
"pair_sum").orderBy("idf")

# Apply row_number function to filter duplicates

t2 = t1.withColumn("rnk", row_number().over(window_spec))

# Filter rows to keep only the first occurrence of each pair_sum

within each order
t3 = t2.filter(col("rnk") == 1) \
.withColumn("pair", concat(col("name1"), col("name2")))

# Perform final aggregation

result_df = t3.groupBy("pair") \
.agg(countDistinct("order_id").alias("frequency")) \
.orderBy(col("frequency").desc())

# Show the result

result_df.display()

###Spark SQL

%sql
with t1 as (
Select a.order_id,a.customer_id,p1.name as name1,p2.name as name2,
(p1.id+p2.id) as pair_sum,monotonically_increasing_id() as idf
from orders a
inner join orders b on a.order_id = b.order_id and
a.product_id<>b.product_id
left join products p1 on a.product_id = p1.id
left join products p2 on b.product_id = p2.id
)
, t2 as (
Select order_id,customer_id,name1,name2,pair_sum, row_number()
over(partition by order_id,pair_sum order by idf asc ) as rnk
from t1
), t3 as (
Select *,
concat(name1, ' ',name2) as pair
from t2 where rnk=1
)

Select
pair,count(distinct order_id) as frequency
from t3
group by pair
order by 2 desc
IF YOU FOUND
THIS POST
USEFUL, PLEASE
SAVE IT.

Ganesh. R
+91-9030485102. Hyderabad, Telangana. rganesh0203@gmail.com

https://medium.com/@rganesh0203 https://rganesh203.github.io/Portfolio/
https://github.com/rganesh203. https://www.linkedin.com/in/r-ganesh-a86418155/

https://www.instagram.com/rg_data_talks/ https://topmate.io/ganesh_r0203

Day 25
No ratings yet
Day 25
7 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Solutions SQL PseudoCode BIE Concepts
No ratings yet
Solutions SQL PseudoCode BIE Concepts
5 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Data Engineer Interview Prep
100% (1)
Data Engineer Interview Prep
16 pages
Join Practice Quries
No ratings yet
Join Practice Quries
6 pages
SCD 1,2,3
No ratings yet
SCD 1,2,3
4 pages
MRA Project Milestone2 PDF
100% (1)
MRA Project Milestone2 PDF
1 page
SQL Vs PySpark
No ratings yet
SQL Vs PySpark
7 pages
Sparktuning
No ratings yet
Sparktuning
10 pages
RDD - Mini - Project - 1 - 1707570179 2024-02-10 13 - 03 - 29
No ratings yet
RDD - Mini - Project - 1 - 1707570179 2024-02-10 13 - 03 - 29
10 pages
Day 12
No ratings yet
Day 12
5 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Product
No ratings yet
Product
3 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
DWM Lab1
No ratings yet
DWM Lab1
5 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Pyspark Hands On
No ratings yet
Pyspark Hands On
189 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
7 pages
E-commerce Order Data Analysis
No ratings yet
E-commerce Order Data Analysis
6 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
57 Pandas
No ratings yet
57 Pandas
7 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Pyspark SQL Practice Questions No Window
No ratings yet
Pyspark SQL Practice Questions No Window
2 pages
SQL & Pyspark
No ratings yet
SQL & Pyspark
9 pages
Biydaalt
No ratings yet
Biydaalt
4 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
Odd Queries
No ratings yet
Odd Queries
19 pages
Apriori Algorithm in SQL, PL/SQL and Spark SQL
No ratings yet
Apriori Algorithm in SQL, PL/SQL and Spark SQL
13 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
SQL VSPy Spark
No ratings yet
SQL VSPy Spark
16 pages
Pandas Vs SQL Concepts Final
No ratings yet
Pandas Vs SQL Concepts Final
13 pages
Porter Case Study
No ratings yet
Porter Case Study
27 pages
Abi Interview Questions
No ratings yet
Abi Interview Questions
2 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
BI - Analytics - Question 4
No ratings yet
BI - Analytics - Question 4
4 pages
PySpark DataFrame Merging Guide
No ratings yet
PySpark DataFrame Merging Guide
42 pages
TCS Rejected Many Due To Weak PySpark Logic!?
No ratings yet
TCS Rejected Many Due To Weak PySpark Logic!?
7 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Journal
No ratings yet
Journal
47 pages
Bakery Management
No ratings yet
Bakery Management
4 pages
Sales Analysis Using Python and SQL
No ratings yet
Sales Analysis Using Python and SQL
15 pages
Spark SQL Optimization - Real Case Studies
No ratings yet
Spark SQL Optimization - Real Case Studies
18 pages
Saprk
No ratings yet
Saprk
1 page
Section I - Setup: 2.1A - Scalar Subqueries
No ratings yet
Section I - Setup: 2.1A - Scalar Subqueries
32 pages
Pyspark SQL Transformation Cheat Sheet
No ratings yet
Pyspark SQL Transformation Cheat Sheet
3 pages
Code Logic
No ratings yet
Code Logic
6 pages
SQL Queries and Triggers Guide
No ratings yet
SQL Queries and Triggers Guide
8 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
SQL 1729830819
No ratings yet
SQL 1729830819
10 pages
Pandas PD Numpy NP Matplotlib - Pyplot PLT Seaborn Sns DF PD - Read - CSV (, Encoding ) DF - Head
No ratings yet
Pandas PD Numpy NP Matplotlib - Pyplot PLT Seaborn Sns DF PD - Read - CSV (, Encoding ) DF - Head
31 pages
Set 2
No ratings yet
Set 2
2 pages
2023kucp1088 Q1
No ratings yet
2023kucp1088 Q1
4 pages
Day 22
No ratings yet
Day 22
6 pages
Day 24
No ratings yet
Day 24
8 pages
Day 62
No ratings yet
Day 62
9 pages
Day 76
No ratings yet
Day 76
10 pages
Day 28
No ratings yet
Day 28
5 pages
Redshift DG
No ratings yet
Redshift DG
733 pages
AWS Learning Material
No ratings yet
AWS Learning Material
13 pages
Day 57
No ratings yet
Day 57
11 pages
Address Programmer
No ratings yet
Address Programmer
1 page
Ece 4219
No ratings yet
Ece 4219
2 pages
ACH580 - Modbus - Control Program Firmware Manual
No ratings yet
ACH580 - Modbus - Control Program Firmware Manual
28 pages
Operating System Commands
No ratings yet
Operating System Commands
4 pages
Data2vec: A General Framework For Self-Supervised Learning in Speech, Vision & Language
No ratings yet
Data2vec: A General Framework For Self-Supervised Learning in Speech, Vision & Language
20 pages
Read Me
No ratings yet
Read Me
5 pages
Wa0023
No ratings yet
Wa0023
10 pages
Credit Card Bill Payments - 01 - 20 - 21
No ratings yet
Credit Card Bill Payments - 01 - 20 - 21
1 page
Classic Data Structure D.Samanta
77% (35)
Classic Data Structure D.Samanta
404 pages
Tech Manuals HP M401pc
No ratings yet
Tech Manuals HP M401pc
44 pages
Kawaii Doggies Olive Yong PDF Download
100% (1)
Kawaii Doggies Olive Yong PDF Download
19 pages
CSEC IT - Formulas & Functions - L3
No ratings yet
CSEC IT - Formulas & Functions - L3
33 pages
Assignment 2 PDF
0% (1)
Assignment 2 PDF
4 pages
Australian Visitor Visa Guide
No ratings yet
Australian Visitor Visa Guide
1 page
Group Assignment 2 - Ibs301m - Ib1705 - Group 2
No ratings yet
Group Assignment 2 - Ibs301m - Ib1705 - Group 2
18 pages
High Bandwidth Memory (HBM)
No ratings yet
High Bandwidth Memory (HBM)
16 pages
User Manual For AN97 Series
No ratings yet
User Manual For AN97 Series
25 pages
Execution Plans The Secret To Query Tuning Success
No ratings yet
Execution Plans The Secret To Query Tuning Success
98 pages
Scratch Programming: Hungry Parrot Guide
No ratings yet
Scratch Programming: Hungry Parrot Guide
8 pages
IELTS Technology Vocabulary
No ratings yet
IELTS Technology Vocabulary
3 pages
Arabic Keyboard Guide for Windows
No ratings yet
Arabic Keyboard Guide for Windows
1 page
3D Intraoral Scanners
No ratings yet
3D Intraoral Scanners
19 pages
Infiniti Vision System - Service Manual PDF
100% (2)
Infiniti Vision System - Service Manual PDF
262 pages
Eng Scrubmaster B310R 01 (Manual)
No ratings yet
Eng Scrubmaster B310R 01 (Manual)
108 pages
Manifest
No ratings yet
Manifest
59 pages
HTML Func
No ratings yet
HTML Func
31 pages
Diccionario Historico Cronologico, Geografico y Universal de La Santa Biblia T2 - Joseph Armesto y Goyanes 1790
No ratings yet
Diccionario Historico Cronologico, Geografico y Universal de La Santa Biblia T2 - Joseph Armesto y Goyanes 1790
397 pages
Digital Integrated Circuits: Introduction To TTL
No ratings yet
Digital Integrated Circuits: Introduction To TTL
19 pages
2 - Consistency & Redundancy
No ratings yet
2 - Consistency & Redundancy
31 pages
Aoa May2024
No ratings yet
Aoa May2024
1 page

Day 27

Uploaded by

Day 27

Uploaded by

Scenario Based Interview

from pyspark.sql import SparkSession

# Initialize Spark session

# Define schema for orders

# Define schema for products

# Create data for orders

# Create data for products

# Create DataFrame for orders

# Show the result

# Create temporary views for SQL queries

from pyspark.sql.functions import col, concat,

# Alias the orders DataFrame for joining

# Perform the join and necessary transformations

# Define window specification for row_number

# Apply row_number function to filter duplicates

# Filter rows to keep only the first occurrence of each pair_sum

# Perform final aggregation

# Show the result

You might also like