0% found this document useful (0 votes)

44 views6 pages

Apache Spark

Uploaded by

vedaxew561

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views6 pages

Apache Spark

Uploaded by

vedaxew561

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Module – 5 Apache Spark

1. Explain Spark Operations with an example?

Solution:
Apache Spark is an open-source, distributed computing framework that provides a fast and
general-purpose cluster computing system for big data processing. Spark operates on
distributed datasets and performs various operations to process and transform data. These
operations can be categorized into two types: transformations and actions.

TRANSFORMATIONS:
Transformations are operations that create a new dataset from an existing one. They are
executed lazily, which means that they are not computed immediately but instead build a
lineage of transformations that will be executed when an action is called. Here are some
common transformations:

map: Applies a function to each element of the dataset, producing a new dataset. For
example, if you have a dataset of numbers and you want to square each number:
input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
squared_rdd = rdd.map(lambda x: x**2)

flatMap: Similar to map, but each input item can be mapped to zero or more output items.
For example, splitting lines of text into words:
lines = ["Hello world", "Spark is great", "Big Data"]
rdd = SparkContext.parallelize(lines)
words_rdd = rdd.flatMap(lambda line: line.split(" "))

distinct: Returns a new dataset with distinct elements.

input_data = [1, 2, 2, 3, 3, 4, 4, 5]
rdd = SparkContext.parallelize(input_data)
distinct_rdd = rdd.distinct()

ACTIONS:
Actions are operations that trigger the execution of transformations and return a value to
the driver program or write data to an external storage system. Actions are the operations
that actually perform computation. Here are some common actions:

collect: Retrieves all the elements of the dataset and returns them to the driver program.
Use this with caution as it can be memory-intensive for large datasets.
input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
result = rdd.collect()

count: Returns the number of elements in the dataset.

input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
count = rdd.count()

first: Returns the first element of the dataset.

input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
first_element = rdd.first()

2. Perform the below RDD

a. How to read multiple text files into RDD

SOLUTION:

from pyspark import SparkContext

# Create a SparkContext

sc = SparkContext("local", "ReadTextFilesExample")

# Specify the directory containing the text files

input_directory = "dbfs:/FileStore/txtfile/*.txt"

# Use textFile to read the text files

text_files_rdd = sc.textFile(input_directory)

# Print the entire contents of the RDD using collect()

all_elements = text_files_rdd.collect()

# Print each element in the RDD

for element in all_elements:

print(element)

# stop the SparkContext

sc.stop()
b. Read CSV file into RDD
Solution:
from pyspark.sql import SparkSession

# Create a SparkSession
# spark =
SparkSession.builder.appName("ReadCSVIntoDataFrameExample").getOrCreate()

# Specify the path to the CSV file

csv_file_path = "dbfs:/FileStore/Address.csv"

# Read the CSV file into a DataFrame

df = spark.read.csv(csv_file_path, header=True, inferSchema=True)

# Show the DataFrame

df.show()

# Stop the SparkSession

# spark.stop()

c. Ways to create an RDD

Solution:
There are 3 ways to create a rdd

1. Parallelizing an Existing Collection:

2. Reading from External Storage:

3. Creating an RDD from Another RDD:

d. Create an empty RDD

Solutions:
We can create an empty rdd using parallelize method

from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext("local", "EmptyRDDExample")

# Create an empty RDD

empty_rdd = sc.parallelize([])

# Perform operations on the empty RDD (for example, count the elements)
count = empty_rdd.count()
# Print the result
print(f"Number of elements in the empty RDD: {count}")

# Stop the SparkContext

sc.stop()

e. RDD Pair Functions

Solutions:

 reduceByKey(func)

Combines values for each key using a specified reduction function.

data = [("apple", 3), ("banana", 2), ("apple", 5), ("banana", 1)]

rdd = sc.parallelize(data)

reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)

reduced_rdd.collect()

 sortByKey(ascending=True)

Sorts the RDD by keys

data = [("apple", 3), ("banana", 2), ("orange", 5), ("grape", 1)]

rdd = sc.parallelize(data)

sorted_rdd = rdd.sortByKey()

sorted_rdd.collect()

 mapValues(func)

Applies a function to each value in the RDD without changing the keys.

data = [("apple", 3), ("banana", 2), ("orange", 5)]

rdd = sc.parallelize(data)

mapped_rdd = rdd.mapValues(lambda x: x * 2)

mapped_rdd.collect()
f. Generate Data Frame from RDD

solution:

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder.appName("RDDToDataFrameExample").getOrCreate()

# Create an RDD

data = [("apple", 3), ("banana", 2), ("apple", 5), ("banana", 1)]

rdd = spark.sparkContext.parallelize(data)

# Define the schema for the DataFrame

schema = ["fruit", "quantity"]

# Convert the RDD to a DataFrame

df = rdd.toDF(schema)

# Show the DataFrame

df.show()

# Stop the SparkSession

spark.stop()

3. How does spark shuffle works?

Spark shuffle is a crucial operation in Apache Spark that redistributes data across partitions to
facilitate certain transformations and operations. It plays a vital role in enabling efficient data
processing and execution of complex workloads.

Understanding Spark Shuffle

Spark shuffle typically occurs when data needs to be grouped or aggregated based on certain
keys. It involves several steps:
1. Map Phase: The input data is divided into partitions, and each partition is assigned to a Spark
executor. Each executor applies a mapper function to each record, generating key-value
pairs.

2. Sort Phase: The key-value pairs are sorted within each partition using a sorting algorithm.
This ensures that all records with the same key are grouped together.

3. Hash Partitioning: The sorted key-value pairs are hashed and distributed to a specified
number of shuffle partitions. This ensures that records with the same key end up in the same
partition, regardless of their original partition.

4. Reduce Phase: The shuffled key-value pairs are aggregated or reduced within each shuffle
partition using a reducer function. This consolidates the data associated with each key.

5. Write to Disk: The aggregated data is written to disk, typically in temporary files. This is
necessary when the data doesn't fit in memory or when multiple stages of a Spark job
require the same shuffled data.

BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
Pyspark
No ratings yet
Pyspark
31 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
External Video-En
No ratings yet
External Video-En
2 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
PySpark Notes
No ratings yet
PySpark Notes
4 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
PySpark RDD Transformations Guide
No ratings yet
PySpark RDD Transformations Guide
38 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
SPARK
No ratings yet
SPARK
35 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Spark Rdds Operations
No ratings yet
Spark Rdds Operations
46 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Introduction To RDD
No ratings yet
Introduction To RDD
39 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Big Data with Apache Spark Basics
No ratings yet
Big Data with Apache Spark Basics
43 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Spark RDD Guide for Developers
No ratings yet
Spark RDD Guide for Developers
7 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
Spark
No ratings yet
Spark
160 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Pyspart Iq
No ratings yet
Pyspart Iq
27 pages
RDD
No ratings yet
RDD
4 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Day 9
No ratings yet
Day 9
30 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
IoT Data Analysis for Students
No ratings yet
IoT Data Analysis for Students
17 pages
Filter Cell C9 (Big Data Role) On Your Desired Role For Your Training Curriculum and List of Certifications For That Role
No ratings yet
Filter Cell C9 (Big Data Role) On Your Desired Role For Your Training Curriculum and List of Certifications For That Role
7 pages
BDA Unit-5
No ratings yet
BDA Unit-5
44 pages
Big Data Processing: Jiaul Paik
No ratings yet
Big Data Processing: Jiaul Paik
47 pages
Spark Scala & Python Cheat Sheet
No ratings yet
Spark Scala & Python Cheat Sheet
10 pages
Azure Landing Zone: Kevin Harmer
No ratings yet
Azure Landing Zone: Kevin Harmer
60 pages
Lec 19
No ratings yet
Lec 19
23 pages
Machine Learning Foundations
No ratings yet
Machine Learning Foundations
20 pages
Flow-Based Programming For Machine Learning
No ratings yet
Flow-Based Programming For Machine Learning
30 pages
Analysis of Real Time Stream Processing Systems Considering Latency
No ratings yet
Analysis of Real Time Stream Processing Systems Considering Latency
7 pages
Ajeet Kumar Singh Springboot
No ratings yet
Ajeet Kumar Singh Springboot
2 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
Data Engineering PFE Candidate Profile
No ratings yet
Data Engineering PFE Candidate Profile
1 page
Ravi Updated Resume 2025
No ratings yet
Ravi Updated Resume 2025
2 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
Senior Data Engineer Profile
No ratings yet
Senior Data Engineer Profile
3 pages
Machine Learning With PySpark
No ratings yet
Machine Learning With PySpark
21 pages
Revised T.Y.B.Sc. Data Science Syllabus
No ratings yet
Revised T.Y.B.Sc. Data Science Syllabus
62 pages
Big Data 4th Assignment
No ratings yet
Big Data 4th Assignment
10 pages
Car Insurance Claim Prediction - First Seminar
No ratings yet
Car Insurance Claim Prediction - First Seminar
26 pages
Tiger Jobs List
No ratings yet
Tiger Jobs List
11 pages
Vidya Sagar
No ratings yet
Vidya Sagar
3 pages
Bhumika Motwani Resume
No ratings yet
Bhumika Motwani Resume
2 pages
Datamites Certified Data Analyst Brochure INDIA V9
No ratings yet
Datamites Certified Data Analyst Brochure INDIA V9
18 pages
Cassandra Data Modeling and Integration
No ratings yet
Cassandra Data Modeling and Integration
7 pages
Social Media & Big Data Tools
No ratings yet
Social Media & Big Data Tools
74 pages
AWS Data Analytics - Technical - Student
No ratings yet
AWS Data Analytics - Technical - Student
160 pages
Spa1 16merged
No ratings yet
Spa1 16merged
672 pages
Final Pitch 20may
No ratings yet
Final Pitch 20may
10 pages
Big Data in Online Retail Analysis
No ratings yet
Big Data in Online Retail Analysis
7 pages

Apache Spark

Uploaded by

Apache Spark

Uploaded by

Module – 5 Apache Spark

1. Explain Spark Operations with an example?

distinct: Returns a new dataset with distinct elements.

count: Returns the number of elements in the dataset.

first: Returns the first element of the dataset.

2. Perform the below RDD

from pyspark import SparkContext

# Specify the directory containing the text files

# Use textFile to read the text files

# Print the entire contents of the RDD using collect()

# Print each element in the RDD

for element in all_elements:

# stop the SparkContext

# Specify the path to the CSV file

# Read the CSV file into a DataFrame

# Show the DataFrame

# Stop the SparkSession

c. Ways to create an RDD

1. Parallelizing an Existing Collection:

2. Reading from External Storage:

3. Creating an RDD from Another RDD:

d. Create an empty RDD

from pyspark import SparkContext

# Create an empty RDD

# Stop the SparkContext

e. RDD Pair Functions

Combines values for each key using a specified reduction function.

data = [("apple", 3), ("banana", 2), ("apple", 5), ("banana", 1)]

Sorts the RDD by keys

data = [("apple", 3), ("banana", 2), ("orange", 5), ("grape", 1)]

data = [("apple", 3), ("banana", 2), ("orange", 5)]

from pyspark.sql import SparkSession

data = [("apple", 3), ("banana", 2), ("apple", 5), ("banana", 1)]

# Define the schema for the DataFrame

schema = ["fruit", "quantity"]

# Convert the RDD to a DataFrame

# Show the DataFrame

# Stop the SparkSession

3. How does spark shuffle works?

Understanding Spark Shuffle

You might also like