0% found this document useful (0 votes)

159 views18 pages

PySpark Transformations

Uploaded by

swamisamarth934

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views18 pages

PySpark Transformations

Uploaded by

swamisamarth934

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Can you explain the

different Transformation
you’ve done in your
project?

Be Prepared
Learn 50 Pyspark
Transformation
to Stand Out
Abhishek Agrawal
Azure Data Engineer
1. Normalization
Scaling data to a range between 0 and 1.

from pyspark.ml.feature import MinMaxScaler

scaler = MinMaxScaler(inputCol="features", outputCol="scaled_features")
scaled_data = scaler.fit(data).transform(data)

2. Standardization
Transforming data to have zero mean and unit variance

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaled_data = scaler.fit(data).transform(data)

3. Log Transformation
Applying a logarithmic transformation to handle skewed data.

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaled_data = sfrom pyspark.ml.feature import StandardScaler

# Initialize the StandardScaler

scaler = StandardScaler(
inputCol="features",
outputCol="scaled_features"
)

# Fit the scaler to the dataset and transform the data

scaled_data = scaler.fit(data).transform(data)
caler.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer

4. Binning
Grouping continuous values into discrete bins.

from pyspark.sql.functions import when

# Add a new column 'bin_column' based on conditions

data = data.withColumn(
"bin_column",
when(data["value"] < 10, "Low")
.when(data["value"] < 20, "Medium")
.otherwise("High")
)

5. One-Hot Encoding
Converting categorical variables into binary columns.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

# Step 1: Indexing the categorical column

indexer = StringIndexer(inputCol="category", outputCol="category_index")
indexed_data = indexer.fit(data).transform(data)

# Step 2: One-hot encoding the indexed column

encoder = OneHotEncoder(inputCol="category_index", outputCol="category_onehot")
encoded_data = encoder.fit(indexed_data).transform(indexed_data)

6. Label Encoding
Converting categorical values into integer labels.

from pyspark.ml.feature import StringIndexer

# Step 1: Create a StringIndexer to index the 'category' column

indexer = StringIndexer(inputCol="category", outputCol="category_index")

# Step 2: Fit the indexer on the data and transform it

indexed_data = indexer.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer

7. Pivoting
Pivoting is the process of transforming long-format data (where each row
represents a single observation or record) into wide-format data (where
each column represents a different attribute or category). This
transformation is typically used when you want to turn a categorical
variable into columns and aggregate values accordingly.

# Pivoting the data to create a summary of sales by month for each ID

pivoted_data = data.groupBy("id") \
.pivot("month") \
.agg({"sales": "sum"})
= data.groupBy("id").pivot("month").agg({"sales": "sum"})

8. Unpivoting
Unpivoting is the opposite of pivoting. It transforms wide-format data
(where each column represents a different category or attribute) into
long-format data (where each row represents a single observation). This
is useful when you want to turn column headers back into values.

# Unpivoting the data to convert columns into rows

unpivoted_data = data.selectExpr(
"id",
"stack(2, 'Jan', Jan, 'Feb', Feb) as (month, sales)"
)

9. Aggregation
Summarizing data by applying functions like sum(), avg(), etc.

# Aggregating data by category to compute the sum of values

aggregated_data = data.groupBy("category") \
.agg({"value": "sum"})

Abhishek Agrawal | Azure Data Engineer

10. Feature Extraction
Extracting useful features from raw data.

from pyspark.sql.functions import year, month, dayofmonth

# Add year, month, and day columns to the DataFrame

data = (
data
.withColumn("year", year(data["timestamp"]))
.withColumn("month", month(data["timestamp"]))
.withColumn("day", dayofmonth(data["timestamp"]))
)

11. Outlier Removal

Filtering out extreme values (outliers).

# Filter rows where the 'value' column is less than 1000

filtered_data = data.filter(data["value"] < 1000)

12. Data Imputation

Filling missing values with the mean or median.

from pyspark.ml.feature import Imputer

# Create an Imputer instance

imputer = Imputer(inputCols=["column"], outputCols=["imputed_column"])

# Fit the imputer model and transform the data

imputed_data = imputer.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer

13. Date/Time Parsing
Converting string to datetime objects.

from pyspark.sql.functions import to_timestamp

# Convert the 'date_string' column to a timestamp with the specified format

data = data.withColumn("timestamp", to_timestamp(data["date_string"], "yyyy-MM-dd"))

14. Text Transformation

Converting text to lowercase.

from pyspark.sql.functions import lower

# Convert the text in 'text_column' to lowercase and store it in a new column

data = data.withColumn("lowercase_text", lower(data["text_column"]))

15. Data Merging

Combining two datasets based on a common column.

# Perform an inner join between data1 and data2 on the 'id' column
merged_data = data1.join(data2, data1["id"] == data2["id"], "inner")

16. Data Joining

Joining data using inner, left, or right joins.

# Perform a left join between data1 and data2 on the 'id' column
joined_data = data1.join(data2, on="id", how="left")

Abhishek Agrawal | Azure Data Engineer

17. Filtering Rows
Filtering rows based on a condition.

# Filter rows where the 'value' column is greater than 10

filtered_data = data.filter(data["value"] > 10)

18. Column Renaming

Renaming columns for clarity.

# Rename the column 'old_column' to 'new_column'

data = data.withColumnRenamed("old_column", "new_column")

19. Column Dropping

Removing unnecessary columns.

# Drop the 'unwanted_column' from the DataFrame

data = data.drop("unwanted_column")

20. Column Conversion

Converting a column from one data type to another.

from pyspark.sql.functions import col

# Convert 'column_string' to an integer and create a new column 'column_int'

data = data.withColumn("column_int", col("column_string").cast("int"))

Abhishek Agrawal | Azure Data Engineer

21. Type Casting
Changing the type of a column (e.g., from string to integer).

# Convert 'column_string' to an integer and create a new column 'column_int'

data = data.withColumn("column_int", data["column_string"].cast("int"))

22. Duplicate Removal

Removing duplicate rows based on specified columns.

# Remove duplicate rows based on 'column1' and 'column2'

data = data.dropDuplicates(["column1", "column2"])

23. Null Value Removal

Filtering rows with null values in specified columns.

# Filter rows where the 'column' is not null

cleaned_data = data.filter(data["column"].isNotNull())

24. Windowing Functions

Using window functions to rank or aggregate data.

from pyspark.sql.window import Window

from pyspark.sql.functions import rank

# Define a window specification partitioned by 'category' and ordered by 'value'

window_spec = Window.partitionBy("category").orderBy("value")

# Add a 'rank' column based on the window specification

data = data.withColumn("rank", rank().over(window_spec))

Abhishek Agrawal | Azure Data Engineer

25. Column Combination
Combining multiple columns into one.

from pyspark.sql.functions import concat

# Concatenate 'first_name' and 'last_name' columns to create 'full_name'

data = data.withColumn("full_name", concat(data["first_name"], data["last_name"])
)

26. Cumulative Sum

Calculating a running total of a column.

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

# Define a window specification ordered by 'date' with an unbounded

preceding frame
window_spec = Window.orderBy("date").
rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Add a 'cumulative_sum' column that computes the cumulative sum of 'value'

data = data.withColumn("cumulative_sum", sum("value").over(window_spec))

27. Rolling Average

Calculating a moving average over a window of rows.

from pyspark.sql.window import Window

from pyspark.sql.functions import avg

window_spec = Window.orderBy("date").rowsBetween(-2, 2)

data = data.withColumn("rolling_avg", avg("value").over(window_spec))

Abhishek Agrawal | Azure Data Engineer

28. Value Mapping
Mapping values of a column to new values.

from pyspark.sql.functions import when

# Map 'value' column: set 'mapped_column' to 'A' if 'value' is 1, otherwise 'B'

data = data.withColumn("mapped_column", when(data["value"] == 1, "A").
otherwise("B"))

29. Subsetting Columns

Calculating a moving average over a window of rows.

Selecting only a subset of columns from the dataset.

30. Column Operations

Performing arithmetic operations on columns.

# Create a new column 'new_column' as the sum of 'value1' and 'value2'

data = data.withColumn("new_column", data["value1"] + data["value2"])

31. String Splitting

Splitting a string column into multiple columns based on a delimiter.

from pyspark.sql.functions import split

# Split the values in 'column' by a comma and store the result in 'split_column'
data = data.withColumn("split_column", split(data["column"], ","))

Abhishek Agrawal | Azure Data Engineer

32. Data Flattening
Flattening nested structures (e.g., JSON) into a tabular format.

from pyspark.sql.functions import explode

# Flatten the array or map in 'nested_column' into multiple rows in

'flattened_column'
data = data.withColumn("flattened_column", explode(data["nested_column"]))

33. Sampling Data

Taking a random sample of the data.

# Sample 10% of the data

sampled_data = data.sample(fraction=0.1)

34. Stripping Whitespace

Removing leading and trailing whitespace from string columns.

from pyspark.sql.functions import trim

# Remove leading and trailing spaces from 'string_column' and create

'trimmed_column'
data = data.withColumn("trimmed_column", trim(data["string_column"]))

Abhishek Agrawal | Azure Data Engineer

35. String Replacing
Replacing substrings within a string column.

from pyspark.sql.functions import regexp_replace

# Replace occurrences of 'old_value' with 'new_value' in 'text_column' and

create 'updated_column'
data = data.withColumn("updated_column", regexp_replace(data["text_column"],
"old_value", "new_value"))

36. Date Difference

Calculating the difference between two date columns.

from pyspark.sql.functions import datediff

# Calculate the difference in days between 'end_date' and 'start_date', and

create 'date_diff' column
data = data.withColumn("date_diff", datediff(data["end_date"],
data["start_date"]))

37. Window Rank

Ranking rows based on a specific column.

from pyspark.sql.window import Window

from pyspark.sql.functions import rank

# Define a window specification ordered by 'value'

window_spec = Window.orderBy("value")

# Add a 'rank' column based on the window specification

data = data.withColumn("rank", rank().over(window_spec))

Abhishek Agrawal | Azure Data Engineer

38. Multi-Column Aggregation
Performing multiple aggregation operations on different columns.

# Group by 'category' and calculate the sum of 'value1' and the average of
'value2'
aggregated_data = data.groupBy("category").agg(
{"value1": "sum", "value2": "avg"}
)

39. Date Truncation

Truncating a date column to a specific unit (e.g., year, month).

from pyspark.sql.functions import trunc

# Truncate the date_column to the beginning of the month and add as a new
column
data = data.withColumn("truncated_date", trunc(data["date_column"], "MM"))

40. Repartitioning Data

Changing the number of partitions for better performance

# Repartition the DataFrame into 4 partitions

data = data.repartition(4)

Abhishek Agrawal | Azure Data Engineer

41. Adding Sequence Numbers
Assigning a unique sequence number to each row.

from pyspark.sql.functions import monotonically_increasing_id

# Add a new column 'row_id' with a unique, monotonically increasing ID

data = data.withColumn("row_id", monotonically_increasing_id())

42. Shuffling Data

Randomly shuffling rows in a dataset.

from pyspark.sql.functions import rand

# Shuffle the DataFrame by ordering rows randomly

shuffled_data = data.orderBy(rand())

43. Array Aggregation

Combining values into an array.

from pyspark.sql.functions import collect_list

# Group by 'id' and aggregate 'value' into a list, storing it in a new column
'values_array'
data = data.groupBy("id").agg(collect_list("value").alias("values_array"))

Abhishek Agrawal | Azure Data Engineer

44. Scaling
Scaling features by a specific factor.

from pyspark.ml.feature import QuantileDiscretizer

# Initialize the QuantileDiscretizer with input column, output column, and

number of buckets
scaler = QuantileDiscretizer(inputCol="value", outputCol="scaled_value",
numBuckets=10)

# Fit the discretizer to the data and transform the DataFrame

scaled_data = scaler.fit(data).transform(data)

45. Bucketing
Grouping continuous data into buckets.

from pyspark.ml.feature import Bucketizer

# Define split points for bucketing

splits = [0, 10, 20, 30, 40, 50]

# Initialize the Bucketizer with splits, input column, and output column
bucketizer = Bucketizer(splits=splits, inputCol="value",
outputCol="bucketed_value")

# Apply the bucketizer transformation to the DataFrame

bucketed_data = bucketizer.transform(data)

Abhishek Agrawal | Azure Data Engineer

46. Boolean Operations
Performing boolean operations on columns.

from pyspark.sql.functions import col

# Add a new column 'is_valid' indicating whether the 'value' column is greater
than 10
data = data.withColumn("is_valid", col("value") > 10)

47. Extracting Substrings

Extracting a portion of a string from a column.

from pyspark.sql.functions import substring

# Add a new column 'substring' containing the first 5 characters of

'text_column'
data = data.withColumn("substring", substring(col("text_column"), 1, 5))

48. JSON Parsing

Parsing JSON data into structured columns.

from pyspark.sql.functions import from_json

# Parse the JSON data in the 'json_column' into a structured column 'json_data'
using the specified schema
data = data.withColumn("json_data", from_json(col("json_column"), schema))

Abhishek Agrawal | Azure Data Engineer

49. String Length
Finding the length of a string column

from pyspark.sql.functions import length

# Add a new column 'string_length' containing the length of the strings in

'text_column'
data = data.withColumn("string_length", length(col("text_column")))

50. Row-wise Operations

Applying row-wise functions to a dataset by applying a custom function
to a column using a User-Defined Function (UDF).

from pyspark.sql.functions import udf

from pyspark.sql.types import IntegerType

# Define a function to add 2 to the input value

def add_two(value):
return value + 2

# Register the function as a UDF

add_two_udf = udf(add_two, IntegerType())

# Apply the UDF to the 'value' column and create a new column
'incremented_value'
data = data.withColumn("incremented_value", add_two_udf(col("value")))

Abhishek Agrawal | Azure Data Engineer

Follow for more
content like this

Abhishek Agrawal
Azure Data Engineer

SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
PySpark Cheat Sheet
No ratings yet
PySpark Cheat Sheet
6 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Spark Repartition1
No ratings yet
Spark Repartition1
7 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Pandas Cheatsheet 1743309413
No ratings yet
Pandas Cheatsheet 1743309413
11 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Azure Data Factory Workshop
No ratings yet
Azure Data Factory Workshop
26 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
2025 Pyspark Interview Questions Collections
No ratings yet
2025 Pyspark Interview Questions Collections
50 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
9-10 Spark Architecture
No ratings yet
9-10 Spark Architecture
25 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Spark QA
No ratings yet
Spark QA
34 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Spark Big Data Tuning Guide
100% (1)
Spark Big Data Tuning Guide
20 pages
Py Spark
No ratings yet
Py Spark
10 pages
53 SQL Questions-Answers
No ratings yet
53 SQL Questions-Answers
89 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
3 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Pandas vs PySpark: Data Operations
No ratings yet
Pandas vs PySpark: Data Operations
3 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
SCD Type-2 with Pandas in Spark
0% (1)
SCD Type-2 with Pandas in Spark
8 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Databricks & PySpark Learning Day-10
No ratings yet
Databricks & PySpark Learning Day-10
4 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Comprehensive Azure SQL Training Guide
No ratings yet
Comprehensive Azure SQL Training Guide
6 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
19 - Frames
No ratings yet
19 - Frames
21 pages
Python File Operations Guide
No ratings yet
Python File Operations Guide
6 pages
Black Friday 2020 Health Mom & Kids Wordpress Theme Beauty Fashion Technology Game Travel Pets Tips Reviews Others
No ratings yet
Black Friday 2020 Health Mom & Kids Wordpress Theme Beauty Fashion Technology Game Travel Pets Tips Reviews Others
19 pages
UNIX Shells: Core Features & Commands
No ratings yet
UNIX Shells: Core Features & Commands
61 pages
Unit-5 Complete PDF
No ratings yet
Unit-5 Complete PDF
73 pages
MIC Report UMAR
No ratings yet
MIC Report UMAR
15 pages
CDBM Mod03 Answers
No ratings yet
CDBM Mod03 Answers
20 pages
Apcs Lab03
No ratings yet
Apcs Lab03
5 pages
Setting Up SAPUI5 Development Environment V1 - 5
No ratings yet
Setting Up SAPUI5 Development Environment V1 - 5
10 pages
Bachelor of Science in Information Technology (Robotics) 3
100% (1)
Bachelor of Science in Information Technology (Robotics) 3
2 pages
Managing Console I/O Operations: Unit - Iv
100% (4)
Managing Console I/O Operations: Unit - Iv
31 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
Week 11 July 2023
No ratings yet
Week 11 July 2023
3 pages
2025 Summer Question Paper
No ratings yet
2025 Summer Question Paper
4 pages
.NET Developer Resume: Skills & Projects
No ratings yet
.NET Developer Resume: Skills & Projects
2 pages
Oop (Java) Unit-3 Question Bank
No ratings yet
Oop (Java) Unit-3 Question Bank
2 pages
Programming and SQL Interview Prep
No ratings yet
Programming and SQL Interview Prep
4 pages
Spring Life Cycle
No ratings yet
Spring Life Cycle
8 pages
Software Development Methodology
No ratings yet
Software Development Methodology
7 pages
KENDRIYA VIDYALAYA, KAVARATTI 682555 (U.T. of Lakshadweep) : Mcqs Class 12
No ratings yet
KENDRIYA VIDYALAYA, KAVARATTI 682555 (U.T. of Lakshadweep) : Mcqs Class 12
28 pages
Ex 15
No ratings yet
Ex 15
4 pages
Chapter 1
No ratings yet
Chapter 1
26 pages
Engineer Resume
No ratings yet
Engineer Resume
1 page
Dive Into Design Patterns en Demo
100% (2)
Dive Into Design Patterns en Demo
45 pages
PHP Assignment
0% (1)
PHP Assignment
37 pages
Cuda Program + Wait For User Input
No ratings yet
Cuda Program + Wait For User Input
2 pages
Treap: Manohar - Bhat.B 4JC07CS056 8 Sem C.S SJCE, Mysore
No ratings yet
Treap: Manohar - Bhat.B 4JC07CS056 8 Sem C.S SJCE, Mysore
33 pages
CVXMOD: Python Convex Optimization Tool
No ratings yet
CVXMOD: Python Convex Optimization Tool
31 pages
Top 50 Database Interview Questions
No ratings yet
Top 50 Database Interview Questions
10 pages
Cloud Computing MCQs for Exam Prep
75% (4)
Cloud Computing MCQs for Exam Prep
13 pages