Unit – V
Spark: Installing Spark, Spark applications, Jobs, stages and Tasks, Resilient Distributed data sets, Shared
Variables, Anatomy of a Spark job run.
SPARK
Spark has been proposed by Apache Software Foundation to speed up the software process of
Hadoop computational computing. Spark includes its cluster management, while Hadoop is only
one of the forms for implementing Spark.
Spark applies Hadoop in two forms. The first form is storage and another one is processing.
Thus, Spark includes its computation for cluster management and applies Hadoop for only
storage purposes.
Apache Spark is a distributed and open-source processing system. It is used for the workloads
of 'Big data'. Spark utilizes optimized query execution and in-memory caching for rapid queries
across any size of data. It is simply a general and fast engine for much large-scale processing of
data.
Spark is simple due to it could be used for more than one thing such as working with data streams
or graphs, Machine Learning algorithms, inhaling data into the database, building data pipelines,
executing distributed SQL, and others.
: Installing Spark
To install Apache Spark, first ensure that the Java Development Kit (JDK) is installed on your system
because Spark requires Java to run. You can download and install Java from the official Oracle website or
use terminal commands on Linux/Mac. After installing, verify it by typing java -version in the terminal.
Next, if you plan to use PySpark (Spark with Python), install Python 3.x from python.org and verify it by
typing python --version. After setting up Java and Python, visit the official Spark website
spark.apache.org and download the latest version of Spark. Choose a pre-built version compatible with
Hadoop or standalone, depending on your requirement. Once downloaded, extract the Spark folder
using a tool like WinRAR (for Windows) or the tar -xvzf command (for Linux/Mac).
After extracting, you must set environment variables. On Windows, go to System Properties →
Environment Variables and set SPARK_HOME to your Spark folder path and add %SPARK_HOME%\bin to
the system PATH. On Linux or Mac, open .bashrc or .bash_profile and add export commands for
SPARK_HOME and update the PATH, then run source ~/.bashrc. If you want Spark to work with Hadoop
Distributed File System (HDFS), you also need to install Hadoop, although Spark can run independently in
standalone mode.
To check if Spark has been installed successfully, open Command Prompt or Terminal and type spark-
shell. If the Spark interactive shell starts without any error, it means the installation is successful. If you
wish to use Spark with Python, install PySpark by running the command pip install pyspark. With this,
you will be ready to develop and run Spark applications either locally or on a cluster environment.
IN SIMPLE WORDS
There are some different things to use and install Spark. We can install Spark on our machine as
any stand-alone framework or use the images of Spark VM (Virtual Machine) available from
many vendors such as MapR, HortonWorks, and Cloudera. Also, we can use Spark configured
and installed inside the cloud (such as Databricks Clouds).
Spark applications
Spark applications are programs written using the Apache Spark framework, which is designed
for large-scale data processing. These applications can perform various tasks on big data sets,
such as data transformation, analysis, machine learning, and graph processing. Here are some
common types of Spark applications used in big data:
Data Processing: Spark can handle and process large datasets quickly, allowing you to filter, aggregate,
join, and transform data in parallel.
Machine Learning: With Spark's MLlib, you can perform scalable machine learning tasks like
classification, regression, clustering, and recommendation.
Graph Processing: Spark GraphX lets you work with graph data, applying algorithms like PageRank or
community detection to analyze relationships and structures.
Real-time Streaming: Spark Streaming processes live data streams (e.g., from Kafka or Twitter) for tasks
like detecting anomalies, analyzing sentiment, or providing real-time insights.
SQL and Analytics: Spark SQL allows you to run SQL queries on large datasets, making it easier to analyze
and process structured data using the DataFrame API.
ETL Operations: Spark is commonly used for ETL (Extract, Transform, Load) tasks, where data is pulled
from different sources, cleaned, and then stored in data lakes or warehouses for further analysis.
Text and Language Processing: Spark NLP helps with large-scale text processing tasks like sentiment
analysis, text classification, and entity recognition.
Jobs, stages and Tasks
In the data processing landscape, Apache Spark stands as one of the most popular and efficient
frameworks that handles big data analytics. Spark’s unique feature lies in its ability to process
large datasets with lightning speed, thanks to its in-memory computing capabilities. As a
programmer working with Spark and Scala, it’s important to understand its internal workings —
particularly the core concepts of Jobs, Stages, and Tasks. In this blog, we’ll delve deep into these
concepts.
Jobs in Spark:
Definition: A job in Spark represents the entire work that needs to be done on a given dataset and is
created when an action is triggered. Actions in Spark trigger the execution of transformations, causing
the job to be divided into stages and tasks.
Action vs. Transformation: Spark operations are divided into two types:
Transformations: These are operations like map(), filter(), and reduceByKey() that define the data
pipeline but are lazy, meaning they do not execute until an action is called.
Actions: These include operations like count(), collect(), and save(), and they trigger the execution of the
transformations that have been defined earlier.
Job Creation: When an action is invoked, Spark constructs a job. For instance, when you call count() on
an RDD, a job is created to perform this action. If another action like collect() is invoked, it triggers
another job. This means that a single Spark application may generate multiple jobs.
Jobs and Stages: Each job is divided into stages. The stages in a Spark job are determined by the
transformations that involve shuffling of data (e.g., groupByKey(), join()). A stage is a collection of tasks
that can be executed without shuffling data
Execution Flow: A job consists of one or more stages, and each stage contains tasks that perform a
specific operation on a partition of data. These tasks are executed in parallel across a cluster of
machines.
Fault Tolerance: If a task fails during execution, Spark will reattempt the failed task. This is possible due
to the concept of RDD lineage, which tracks the transformations applied to the RDD and allows Spark to
recompute lost data if necessary.
Job Monitoring: Jobs in Spark can be monitored through the Spark UI, which shows details like the
number of stages, tasks, time taken, and the amount of data shuffled. This is useful for performance
tuning and debugging.
Impact on Performance: The number of jobs and the complexity of their stages can affect the
performance of a Spark application. Efficient job and stage design can help optimize resource utilization
and execution time.
Stages in Spark:
Definition of Stage: A stage is a set of operations in Spark that can be executed without the need for data
shuffling between partitions. Stages are a lower-level abstraction within a job.
Stage Creation: Stages are created when Spark encounters wide transformations (like groupByKey(),
reduceByKey(), or join()), which require data shuffling. Each stage consists of a sequence of narrow
transformations (like map() or filter()) that can be executed independently on each partition.
Narrow vs. Wide Transformations:
Narrow transformations (e.g., map(), filter()) do not require reshuffling of data and can be executed
within a single partition. These transformations belong to the same stage.
Wide transformations (e.g., groupByKey(), join()) require reshuffling data between partitions, which
necessitates the creation of a new stage. These transformations define the boundaries between stages.
Shuffling: The boundary between two stages is drawn whenever Spark needs to shuffle data between
partitions. This can significantly affect performance, as shuffling is an expensive operation that involves
disk I/O and network communication.
Stage Execution: Each stage is divided into tasks, and each task is executed on a different partition of the
data. Stages can be executed in parallel across the cluster, but a stage must complete before the next
one can start, especially if the next stage depends on the output of the current stage.
Stage Dependencies: The execution of stages in a Spark job follows a directed acyclic graph (DAG) of
stages. The DAG allows Spark to understand the dependencies between stages and determine the order
of execution. If a stage fails, Spark can retry it without affecting other stages.
Optimizing Stages: Minimizing the number of stages and reducing the amount of data shuffled between
stages are key strategies for optimizing Spark job performance. This can be achieved by carefully
structuring the transformations used in the application.
Monitoring Stages: The Spark UI provides details about each stage's progress, including the number of
tasks, data shuffled, and time taken. This is important for identifying bottlenecks and optimizing the
overall execution.
Tasks in Spark:
Definition of Task: A task is a unit of execution in Spark that corresponds to a single computation on a
partition of an RDD. It is the basic unit of work that is scheduled and executed by Spark executors.
Task Division: When a stage is created, it is divided into tasks, and each task operates on a single
partition of the data. The number of tasks in a stage is equal to the number of partitions in the RDD
being processed.
Parallel Execution: Tasks within a stage are executed in parallel on different executors in a Spark cluster.
Each executor processes one task at a time, working on a subset of the data.
Task Scheduling: The Spark scheduler assigns tasks to available executors in the cluster. The tasks are
distributed based on the number of partitions in the RDD, and Spark tries to ensure that tasks are
balanced across available nodes for optimal performance.
Task Failures: If a task fails due to node failure or other reasons, Spark automatically retries the task.
Spark can also recompute the data using the RDD lineage, ensuring fault tolerance.
ask Granularity: Tasks are highly granular units of work, typically operating on a single partition of data.
The task performs the transformations (such as map(), filter(), etc.) that were specified in the job.
Task Execution Time: The time taken by a task to complete can be monitored using the Spark UI. If some
tasks take longer than others, it may indicate skewed data distribution or resource bottlenecks.
Task Dependencies: Tasks within a stage are independent of each other, meaning they can be executed in
parallel. However, tasks between stages are dependent, as the output of one stage must be available
before the next stage can start.
OVERALL IN In Spark, jobs, stages, and tasks are the building blocks that enable distributed data
processing. A job is the top-level unit of execution, consisting of one or more stages. Each stage is made
up of tasks that are executed in parallel across the cluster. The breakdown into jobs, stages, and tasks is
essential for Spark's performance and scalability, allowing it to process large datasets efficiently while
ensuring fault tolerance and parallel execution.
Resilient Distributed data sets,
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an
immutable distributed collection of elements of your data, partitioned across nodes in your
cluster that can be operated in parallel with a low-level API that offers transformations and
actions.
5 reasons of when to use RDDs
You want low-level transformation and actions and control on your dataset;
Your data is unstructured, such as media streams or streams of text;
You want to manipulate your data with functional programming constructs than domain
specific expressions;
You don’t care about imposing a schema, such as columnar format while processing or
accessing data attributes by name or column; and
You can forgo some optimization and performance benefits available with DataFrames
and Datasets for structured and semi-structured data.
Shared Variables
In Apache Spark, when tasks are distributed across different nodes in a cluster, each task typically gets its
own copy of the variables used in the function. This leads to inefficiencies, especially when large
amounts of data need to be shared among tasks. To solve this, Spark provides shared variables that allow
efficient sharing of data across tasks without the overhead of shipping data with every task. There are
two main types of shared variables: Broadcast Variables and Accumulators. Broadcast variables allow
large read-only data (like lookup tables or machine learning models) to be cached and reused across all
executors instead of being sent with every task. This greatly reduces network traffic and memory usage,
making the program run faster and more efficiently.
On the other hand, Accumulators are designed for tasks that need to update a shared variable, usually
for counting or summing information across tasks. Tasks can only add to the accumulator, and only the
driver program can read its final value once all tasks complete. This is very useful for debugging,
monitoring, or keeping track of statistics during a Spark job. Accumulators ensure consistency because
they prevent tasks from reading intermediate values during execution. Together, broadcast variables and
accumulators make Spark applications more powerful and scalable by providing ways to safely and
efficiently share and collect data across the distributed tasks.
Overall, shared variables are essential for performance tuning in Spark. They avoid unnecessary data
duplication and help maintain clean communication between the driver and executors. Without shared
variables, Spark would face heavy data transmission overheads, making jobs much slower and more
resource-intensive. Shared variables ensure that large datasets are efficiently distributed and collected,
leading to better cluster utilization and faster job completion. This makes them a core concept every
Spark developer must understand when building large-scale distributed data applications.
Anatomy of a Spark Job Run
When a Spark application runs, it mainly operates with two important entities: Driver and Executors. The
Driver is like the "brain" of the Spark application. It is responsible for maintaining all the information
about the application, responding to the user's program or UI requests, analyzing and distributing tasks
to executors, and combining the results back to the user. The driver hosts the SparkContext, which is the
entry point to the cluster and manages the overall job execution process.
On the other hand, Executors are the "workers" of the Spark system. These are separate processes that
run on cluster nodes and are responsible for executing the individual tasks that make up the application.
Executors also provide in-memory storage for RDDs (Resilient Distributed Datasets) that are cached by
user programs through Spark operations like persist() or cache(). Each application has its own executors,
and they stay alive throughout the lifetime of the application unless they are manually shut down or if
they fail.
Job Submission
A Spark Job is triggered when an action operation is performed on an RDD, DataFrame, or Dataset.
Actions are operations like count(), collect(), saveAsTextFile(), which produce a result or output.
Internally, when an action is called, Spark invokes the runJob() method on the SparkContext. This
method hands over the job to the schedulers for execution. The schedulers run inside the driver and are
responsible for breaking down and organizing the job into smaller units that can be executed efficiently
across a cluster. Hence, job submission is automatic and is tightly integrated into Spark’s lazy evaluation
model, where transformations build a logical plan but actions trigger actual computation.
Schedulers in Spark
Spark has a two-level scheduling system made up of the DAG Scheduler and the Task Scheduler.
The DAG (Directed Acyclic Graph) Scheduler is responsible for dividing a job into smaller sets called
stages. It analyzes the logical execution plan and identifies boundaries where shuffling of data
(movement of data between nodes) occurs. Based on these boundaries, it creates a Directed Acyclic
Graph (DAG) of stages, ensuring that each stage can be processed independently as much as possible.
DAG Scheduler also handles fault tolerance by re-running only the failed stages.
After the DAG is created, the Task Scheduler takes over. It schedules the individual tasks within each
stage to the appropriate executors. The task scheduler considers factors like data locality (whether the
data is available close to the computation) and resource availability while distributing tasks. It manages
retries in case of task failures and speculative task execution for tasks that seem to be running slower
than expected.
DAG Construction in Detail
When a job is submitted, the DAG Scheduler constructs a DAG of stages based on the computation logic
and data shuffles. A stage is a group of tasks that can be executed without any shuffling of data. The DAG
ensures that the computations are split into logical units which can be independently executed and re-
executed in case of failures.
There are two major types of tasks created during DAG construction:
Shuffle Map Tasks: These tasks perform computations on partitions of an RDD and output intermediate
files (shuffled data) that can be fetched by tasks in later stages. These occur in all stages except the final
stage.
Result Tasks: These are the tasks that perform the final computation and return results back to the
driver. Each result task processes one partition and the driver collects the results from all tasks to form
the final output.
The DAG scheduler ensures that child stages are only submitted after their parent stages have been
successfully completed. This dependency management guarantees correctness even in the presence of
task failures.
Task Scheduling
Once stages are ready, the Task Scheduler takes the set of tasks from each stage and maps them to
available executors for execution. It prioritizes data locality during this mapping process to minimize data
transfer overhead, thereby improving performance. The preference order is:
Process-local: When data is available within the same JVM process.
Node-local: When data is on the same machine but a different process.
Rack-local: When data is on a different machine but within the same network rack.
Any/Random: When data locality is not possible, Spark picks any available executor.
After tasks are assigned, executors continuously report their status back to the driver. If a task fails, the
Task Scheduler retries it on a different executor. Moreover, Spark can enable speculative execution — a
feature where slow-running tasks are duplicated on different executors. The first one to finish is
accepted, and the slower duplicates are killed, improving overall performance in cases of straggling
tasks.
Task Execution
Once a task reaches an executor, the executor first ensures that it has all the required JAR files, libraries,
and other files needed for the task. It caches these dependencies locally for efficiency. After that, the
executor deserializes the received task code — that is, it reconstructs the user's functions and logic from
a serialized (compact binary) format sent by the driver.
After deserialization, the executor executes the task logic on the data partition. If the task is a shuffle
map task, the output is stored locally for the next stage. If it’s a result task, the executor serializes the
result and sends it back to the driver. This whole process happens seamlessly and is repeated for all tasks
until the job finishes successfully.
Simple Summary Flow
User action triggers a job.
SparkContext calls runJob().
DAG Scheduler builds stages and handles dependencies.
Task Scheduler assigns tasks to executors based on data locality.
Executors fetch tasks, load dependencies, run the code, and return results.
Driver collects the results and finishes the job.