Module – 5 Apache Spark
1. Explain Spark Operations with an example?
Solution:
Apache Spark is an open-source, distributed computing framework that provides a fast and
general-purpose cluster computing system for big data processing. Spark operates on
distributed datasets and performs various operations to process and transform data. These
operations can be categorized into two types: transformations and actions.
TRANSFORMATIONS:
Transformations are operations that create a new dataset from an existing one. They are
executed lazily, which means that they are not computed immediately but instead build a
lineage of transformations that will be executed when an action is called. Here are some
common transformations:
map: Applies a function to each element of the dataset, producing a new dataset. For
example, if you have a dataset of numbers and you want to square each number:
input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
squared_rdd = rdd.map(lambda x: x**2)
flatMap: Similar to map, but each input item can be mapped to zero or more output items.
For example, splitting lines of text into words:
lines = ["Hello world", "Spark is great", "Big Data"]
rdd = SparkContext.parallelize(lines)
words_rdd = rdd.flatMap(lambda line: line.split(" "))
distinct: Returns a new dataset with distinct elements.
input_data = [1, 2, 2, 3, 3, 4, 4, 5]
rdd = SparkContext.parallelize(input_data)
distinct_rdd = rdd.distinct()
ACTIONS:
Actions are operations that trigger the execution of transformations and return a value to
the driver program or write data to an external storage system. Actions are the operations
that actually perform computation. Here are some common actions:
collect: Retrieves all the elements of the dataset and returns them to the driver program.
Use this with caution as it can be memory-intensive for large datasets.
input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
result = rdd.collect()
count: Returns the number of elements in the dataset.
input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
count = rdd.count()
first: Returns the first element of the dataset.
input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
first_element = rdd.first()
2. Perform the below RDD
a. How to read multiple text files into RDD
SOLUTION:
from pyspark import SparkContext
# Create a SparkContext
sc = SparkContext("local", "ReadTextFilesExample")
# Specify the directory containing the text files
input_directory = "dbfs:/FileStore/txtfile/*.txt"
# Use textFile to read the text files
text_files_rdd = sc.textFile(input_directory)
# Print the entire contents of the RDD using collect()
all_elements = text_files_rdd.collect()
# Print each element in the RDD
for element in all_elements:
print(element)
# stop the SparkContext
sc.stop()
b. Read CSV file into RDD
Solution:
from pyspark.sql import SparkSession
# Create a SparkSession
# spark =
SparkSession.builder.appName("ReadCSVIntoDataFrameExample").getOrCreate()
# Specify the path to the CSV file
csv_file_path = "dbfs:/FileStore/Address.csv"
# Read the CSV file into a DataFrame
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)
# Show the DataFrame
df.show()
# Stop the SparkSession
# spark.stop()
c. Ways to create an RDD
Solution:
There are 3 ways to create a rdd
1. Parallelizing an Existing Collection:
2. Reading from External Storage:
3. Creating an RDD from Another RDD:
d. Create an empty RDD
Solutions:
We can create an empty rdd using parallelize method
from pyspark import SparkContext
# Create a SparkContext
sc = SparkContext("local", "EmptyRDDExample")
# Create an empty RDD
empty_rdd = sc.parallelize([])
# Perform operations on the empty RDD (for example, count the elements)
count = empty_rdd.count()
# Print the result
print(f"Number of elements in the empty RDD: {count}")
# Stop the SparkContext
sc.stop()
e. RDD Pair Functions
Solutions:
reduceByKey(func)
Combines values for each key using a specified reduction function.
data = [("apple", 3), ("banana", 2), ("apple", 5), ("banana", 1)]
rdd = sc.parallelize(data)
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
reduced_rdd.collect()
sortByKey(ascending=True)
Sorts the RDD by keys
data = [("apple", 3), ("banana", 2), ("orange", 5), ("grape", 1)]
rdd = sc.parallelize(data)
sorted_rdd = rdd.sortByKey()
sorted_rdd.collect()
mapValues(func)
Applies a function to each value in the RDD without changing the keys.
data = [("apple", 3), ("banana", 2), ("orange", 5)]
rdd = sc.parallelize(data)
mapped_rdd = rdd.mapValues(lambda x: x * 2)
mapped_rdd.collect()
f. Generate Data Frame from RDD
solution:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("RDDToDataFrameExample").getOrCreate()
# Create an RDD
data = [("apple", 3), ("banana", 2), ("apple", 5), ("banana", 1)]
rdd = spark.sparkContext.parallelize(data)
# Define the schema for the DataFrame
schema = ["fruit", "quantity"]
# Convert the RDD to a DataFrame
df = rdd.toDF(schema)
# Show the DataFrame
df.show()
# Stop the SparkSession
spark.stop()
3. How does spark shuffle works?
Spark shuffle is a crucial operation in Apache Spark that redistributes data across partitions to
facilitate certain transformations and operations. It plays a vital role in enabling efficient data
processing and execution of complex workloads.
Understanding Spark Shuffle
Spark shuffle typically occurs when data needs to be grouped or aggregated based on certain
keys. It involves several steps:
1. Map Phase: The input data is divided into partitions, and each partition is assigned to a Spark
executor. Each executor applies a mapper function to each record, generating key-value
pairs.
2. Sort Phase: The key-value pairs are sorted within each partition using a sorting algorithm.
This ensures that all records with the same key are grouped together.
3. Hash Partitioning: The sorted key-value pairs are hashed and distributed to a specified
number of shuffle partitions. This ensures that records with the same key end up in the same
partition, regardless of their original partition.
4. Reduce Phase: The shuffled key-value pairs are aggregated or reduced within each shuffle
partition using a reducer function. This consolidates the data associated with each key.
5. Write to Disk: The aggregated data is written to disk, typically in temporary files. This is
necessary when the data doesn't fit in memory or when multiple stages of a Spark job
require the same shuffled data.