0% found this document useful (0 votes)

15 views11 pages

Bda Unit 5

The document provides an overview of Apache Spark, including its installation, application types, and core concepts such as jobs, stages, and tasks. It explains how Spark processes large datasets using Resilient Distributed Datasets (RDDs) and shared variables, emphasizing performance optimization and fault tolerance. Additionally, it details the anatomy of a Spark job run, including the roles of the driver and executors, job submission, and the scheduling process.

Uploaded by

vennala raavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views11 pages

Bda Unit 5

Uploaded by

vennala raavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

You are on page 1/ 11

Unit – V

Spark: Installing Spark, Spark applications, Jobs, stages and Tasks, Resilient Distributed data sets, Shared
Variables, Anatomy of a Spark job run.

SPARK

Spark has been proposed by Apache Software Foundation to speed up the software process of

Hadoop computational computing. Spark includes its cluster management, while Hadoop is only

one of the forms for implementing Spark.

Spark applies Hadoop in two forms. The first form is storage and another one is processing.

Thus, Spark includes its computation for cluster management and applies Hadoop for only

storage purposes.

Apache Spark is a distributed and open-source processing system. It is used for the workloads

of 'Big data'. Spark utilizes optimized query execution and in-memory caching for rapid queries

across any size of data. It is simply a general and fast engine for much large-scale processing of

data.

Spark is simple due to it could be used for more than one thing such as working with data streams

or graphs, Machine Learning algorithms, inhaling data into the database, building data pipelines,

executing distributed SQL, and others.

: Installing Spark

To install Apache Spark, first ensure that the Java Development Kit (JDK) is installed on your system
because Spark requires Java to run. You can download and install Java from the official Oracle website or
use terminal commands on Linux/Mac. After installing, verify it by typing java -version in the terminal.
Next, if you plan to use PySpark (Spark with Python), install Python 3.x from python.org and verify it by
typing python --version. After setting up Java and Python, visit the official Spark website
spark.apache.org and download the latest version of Spark. Choose a pre-built version compatible with
Hadoop or standalone, depending on your requirement. Once downloaded, extract the Spark folder
using a tool like WinRAR (for Windows) or the tar -xvzf command (for Linux/Mac).
After extracting, you must set environment variables. On Windows, go to System Properties →
Environment Variables and set SPARK_HOME to your Spark folder path and add %SPARK_HOME%\bin to
the system PATH. On Linux or Mac, open .bashrc or .bash_profile and add export commands for
SPARK_HOME and update the PATH, then run source ~/.bashrc. If you want Spark to work with Hadoop
Distributed File System (HDFS), you also need to install Hadoop, although Spark can run independently in
standalone mode.

To check if Spark has been installed successfully, open Command Prompt or Terminal and type spark-
shell. If the Spark interactive shell starts without any error, it means the installation is successful. If you
wish to use Spark with Python, install PySpark by running the command pip install pyspark. With this,
you will be ready to develop and run Spark applications either locally or on a cluster environment.

IN SIMPLE WORDS

There are some different things to use and install Spark. We can install Spark on our machine as

any stand-alone framework or use the images of Spark VM (Virtual Machine) available from

many vendors such as MapR, HortonWorks, and Cloudera. Also, we can use Spark configured

and installed inside the cloud (such as Databricks Clouds).

Spark applications

Spark applications are programs written using the Apache Spark framework, which is designed

for large-scale data processing. These applications can perform various tasks on big data sets,

such as data transformation, analysis, machine learning, and graph processing. Here are some

common types of Spark applications used in big data:

Data Processing: Spark can handle and process large datasets quickly, allowing you to filter, aggregate,
join, and transform data in parallel.

Machine Learning: With Spark's MLlib, you can perform scalable machine learning tasks like
classification, regression, clustering, and recommendation.

Graph Processing: Spark GraphX lets you work with graph data, applying algorithms like PageRank or
community detection to analyze relationships and structures.
Real-time Streaming: Spark Streaming processes live data streams (e.g., from Kafka or Twitter) for tasks
like detecting anomalies, analyzing sentiment, or providing real-time insights.

SQL and Analytics: Spark SQL allows you to run SQL queries on large datasets, making it easier to analyze
and process structured data using the DataFrame API.

ETL Operations: Spark is commonly used for ETL (Extract, Transform, Load) tasks, where data is pulled
from different sources, cleaned, and then stored in data lakes or warehouses for further analysis.

Text and Language Processing: Spark NLP helps with large-scale text processing tasks like sentiment
analysis, text classification, and entity recognition.

Jobs, stages and Tasks

In the data processing landscape, Apache Spark stands as one of the most popular and efficient

frameworks that handles big data analytics. Spark’s unique feature lies in its ability to process

large datasets with lightning speed, thanks to its in-memory computing capabilities. As a

programmer working with Spark and Scala, it’s important to understand its internal workings —

particularly the core concepts of Jobs, Stages, and Tasks. In this blog, we’ll delve deep into these

concepts.

Jobs in Spark:

Definition: A job in Spark represents the entire work that needs to be done on a given dataset and is
created when an action is triggered. Actions in Spark trigger the execution of transformations, causing
the job to be divided into stages and tasks.

Action vs. Transformation: Spark operations are divided into two types:

Transformations: These are operations like map(), filter(), and reduceByKey() that define the data
pipeline but are lazy, meaning they do not execute until an action is called.

Actions: These include operations like count(), collect(), and save(), and they trigger the execution of the
transformations that have been defined earlier.

Job Creation: When an action is invoked, Spark constructs a job. For instance, when you call count() on
an RDD, a job is created to perform this action. If another action like collect() is invoked, it triggers
another job. This means that a single Spark application may generate multiple jobs.

Jobs and Stages: Each job is divided into stages. The stages in a Spark job are determined by the
transformations that involve shuffling of data (e.g., groupByKey(), join()). A stage is a collection of tasks
that can be executed without shuffling data

Execution Flow: A job consists of one or more stages, and each stage contains tasks that perform a
specific operation on a partition of data. These tasks are executed in parallel across a cluster of
machines.

Fault Tolerance: If a task fails during execution, Spark will reattempt the failed task. This is possible due
to the concept of RDD lineage, which tracks the transformations applied to the RDD and allows Spark to
recompute lost data if necessary.

Job Monitoring: Jobs in Spark can be monitored through the Spark UI, which shows details like the
number of stages, tasks, time taken, and the amount of data shuffled. This is useful for performance
tuning and debugging.

Impact on Performance: The number of jobs and the complexity of their stages can affect the
performance of a Spark application. Efficient job and stage design can help optimize resource utilization
and execution time.

Stages in Spark:

Definition of Stage: A stage is a set of operations in Spark that can be executed without the need for data
shuffling between partitions. Stages are a lower-level abstraction within a job.

Stage Creation: Stages are created when Spark encounters wide transformations (like groupByKey(),
reduceByKey(), or join()), which require data shuffling. Each stage consists of a sequence of narrow
transformations (like map() or filter()) that can be executed independently on each partition.

Narrow vs. Wide Transformations:

Narrow transformations (e.g., map(), filter()) do not require reshuffling of data and can be executed
within a single partition. These transformations belong to the same stage.

Wide transformations (e.g., groupByKey(), join()) require reshuffling data between partitions, which
necessitates the creation of a new stage. These transformations define the boundaries between stages.

Shuffling: The boundary between two stages is drawn whenever Spark needs to shuffle data between
partitions. This can significantly affect performance, as shuffling is an expensive operation that involves
disk I/O and network communication.

Stage Execution: Each stage is divided into tasks, and each task is executed on a different partition of the
data. Stages can be executed in parallel across the cluster, but a stage must complete before the next
one can start, especially if the next stage depends on the output of the current stage.

Stage Dependencies: The execution of stages in a Spark job follows a directed acyclic graph (DAG) of
stages. The DAG allows Spark to understand the dependencies between stages and determine the order
of execution. If a stage fails, Spark can retry it without affecting other stages.

Optimizing Stages: Minimizing the number of stages and reducing the amount of data shuffled between
stages are key strategies for optimizing Spark job performance. This can be achieved by carefully
structuring the transformations used in the application.

Monitoring Stages: The Spark UI provides details about each stage's progress, including the number of
tasks, data shuffled, and time taken. This is important for identifying bottlenecks and optimizing the
overall execution.

Tasks in Spark:

Definition of Task: A task is a unit of execution in Spark that corresponds to a single computation on a
partition of an RDD. It is the basic unit of work that is scheduled and executed by Spark executors.

Task Division: When a stage is created, it is divided into tasks, and each task operates on a single
partition of the data. The number of tasks in a stage is equal to the number of partitions in the RDD
being processed.

Parallel Execution: Tasks within a stage are executed in parallel on different executors in a Spark cluster.
Each executor processes one task at a time, working on a subset of the data.

Task Scheduling: The Spark scheduler assigns tasks to available executors in the cluster. The tasks are
distributed based on the number of partitions in the RDD, and Spark tries to ensure that tasks are
balanced across available nodes for optimal performance.

Task Failures: If a task fails due to node failure or other reasons, Spark automatically retries the task.
Spark can also recompute the data using the RDD lineage, ensuring fault tolerance.

ask Granularity: Tasks are highly granular units of work, typically operating on a single partition of data.
The task performs the transformations (such as map(), filter(), etc.) that were specified in the job.

Task Execution Time: The time taken by a task to complete can be monitored using the Spark UI. If some
tasks take longer than others, it may indicate skewed data distribution or resource bottlenecks.

Task Dependencies: Tasks within a stage are independent of each other, meaning they can be executed in
parallel. However, tasks between stages are dependent, as the output of one stage must be available
before the next stage can start.

OVERALL IN In Spark, jobs, stages, and tasks are the building blocks that enable distributed data
processing. A job is the top-level unit of execution, consisting of one or more stages. Each stage is made
up of tasks that are executed in parallel across the cluster. The breakdown into jobs, stages, and tasks is
essential for Spark's performance and scalability, allowing it to process large datasets efficiently while
ensuring fault tolerance and parallel execution.

Resilient Distributed data sets,

RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an

immutable distributed collection of elements of your data, partitioned across nodes in your

cluster that can be operated in parallel with a low-level API that offers transformations and

actions.

5 reasons of when to use RDDs

You want low-level transformation and actions and control on your dataset;

Your data is unstructured, such as media streams or streams of text;

You want to manipulate your data with functional programming constructs than domain
specific expressions;

You don’t care about imposing a schema, such as columnar format while processing or

accessing data attributes by name or column; and

You can forgo some optimization and performance benefits available with DataFrames

and Datasets for structured and semi-structured data.

Shared Variables

In Apache Spark, when tasks are distributed across different nodes in a cluster, each task typically gets its
own copy of the variables used in the function. This leads to inefficiencies, especially when large
amounts of data need to be shared among tasks. To solve this, Spark provides shared variables that allow
efficient sharing of data across tasks without the overhead of shipping data with every task. There are
two main types of shared variables: Broadcast Variables and Accumulators. Broadcast variables allow
large read-only data (like lookup tables or machine learning models) to be cached and reused across all
executors instead of being sent with every task. This greatly reduces network traffic and memory usage,
making the program run faster and more efficiently.

On the other hand, Accumulators are designed for tasks that need to update a shared variable, usually
for counting or summing information across tasks. Tasks can only add to the accumulator, and only the
driver program can read its final value once all tasks complete. This is very useful for debugging,
monitoring, or keeping track of statistics during a Spark job. Accumulators ensure consistency because
they prevent tasks from reading intermediate values during execution. Together, broadcast variables and
accumulators make Spark applications more powerful and scalable by providing ways to safely and
efficiently share and collect data across the distributed tasks.

Overall, shared variables are essential for performance tuning in Spark. They avoid unnecessary data
duplication and help maintain clean communication between the driver and executors. Without shared
variables, Spark would face heavy data transmission overheads, making jobs much slower and more
resource-intensive. Shared variables ensure that large datasets are efficiently distributed and collected,
leading to better cluster utilization and faster job completion. This makes them a core concept every
Spark developer must understand when building large-scale distributed data applications.

Anatomy of a Spark Job Run

When a Spark application runs, it mainly operates with two important entities: Driver and Executors. The
Driver is like the "brain" of the Spark application. It is responsible for maintaining all the information
about the application, responding to the user's program or UI requests, analyzing and distributing tasks
to executors, and combining the results back to the user. The driver hosts the SparkContext, which is the
entry point to the cluster and manages the overall job execution process.

On the other hand, Executors are the "workers" of the Spark system. These are separate processes that
run on cluster nodes and are responsible for executing the individual tasks that make up the application.
Executors also provide in-memory storage for RDDs (Resilient Distributed Datasets) that are cached by
user programs through Spark operations like persist() or cache(). Each application has its own executors,
and they stay alive throughout the lifetime of the application unless they are manually shut down or if
they fail.

Job Submission

A Spark Job is triggered when an action operation is performed on an RDD, DataFrame, or Dataset.
Actions are operations like count(), collect(), saveAsTextFile(), which produce a result or output.
Internally, when an action is called, Spark invokes the runJob() method on the SparkContext. This
method hands over the job to the schedulers for execution. The schedulers run inside the driver and are
responsible for breaking down and organizing the job into smaller units that can be executed efficiently
across a cluster. Hence, job submission is automatic and is tightly integrated into Spark’s lazy evaluation
model, where transformations build a logical plan but actions trigger actual computation.

Schedulers in Spark

Spark has a two-level scheduling system made up of the DAG Scheduler and the Task Scheduler.

The DAG (Directed Acyclic Graph) Scheduler is responsible for dividing a job into smaller sets called
stages. It analyzes the logical execution plan and identifies boundaries where shuffling of data
(movement of data between nodes) occurs. Based on these boundaries, it creates a Directed Acyclic
Graph (DAG) of stages, ensuring that each stage can be processed independently as much as possible.
DAG Scheduler also handles fault tolerance by re-running only the failed stages.

After the DAG is created, the Task Scheduler takes over. It schedules the individual tasks within each
stage to the appropriate executors. The task scheduler considers factors like data locality (whether the
data is available close to the computation) and resource availability while distributing tasks. It manages
retries in case of task failures and speculative task execution for tasks that seem to be running slower
than expected.

DAG Construction in Detail

When a job is submitted, the DAG Scheduler constructs a DAG of stages based on the computation logic
and data shuffles. A stage is a group of tasks that can be executed without any shuffling of data. The DAG
ensures that the computations are split into logical units which can be independently executed and re-
executed in case of failures.

There are two major types of tasks created during DAG construction:

Shuffle Map Tasks: These tasks perform computations on partitions of an RDD and output intermediate
files (shuffled data) that can be fetched by tasks in later stages. These occur in all stages except the final
stage.

Result Tasks: These are the tasks that perform the final computation and return results back to the
driver. Each result task processes one partition and the driver collects the results from all tasks to form
the final output.

The DAG scheduler ensures that child stages are only submitted after their parent stages have been
successfully completed. This dependency management guarantees correctness even in the presence of
task failures.

Task Scheduling

Once stages are ready, the Task Scheduler takes the set of tasks from each stage and maps them to
available executors for execution. It prioritizes data locality during this mapping process to minimize data
transfer overhead, thereby improving performance. The preference order is:

Process-local: When data is available within the same JVM process.

Node-local: When data is on the same machine but a different process.

Rack-local: When data is on a different machine but within the same network rack.

Any/Random: When data locality is not possible, Spark picks any available executor.

After tasks are assigned, executors continuously report their status back to the driver. If a task fails, the
Task Scheduler retries it on a different executor. Moreover, Spark can enable speculative execution — a
feature where slow-running tasks are duplicated on different executors. The first one to finish is
accepted, and the slower duplicates are killed, improving overall performance in cases of straggling
tasks.

Task Execution

Once a task reaches an executor, the executor first ensures that it has all the required JAR files, libraries,
and other files needed for the task. It caches these dependencies locally for efficiency. After that, the
executor deserializes the received task code — that is, it reconstructs the user's functions and logic from
a serialized (compact binary) format sent by the driver.
After deserialization, the executor executes the task logic on the data partition. If the task is a shuffle
map task, the output is stored locally for the next stage. If it’s a result task, the executor serializes the
result and sends it back to the driver. This whole process happens seamlessly and is repeated for all tasks
until the job finishes successfully.

Simple Summary Flow

User action triggers a job.

SparkContext calls runJob().

DAG Scheduler builds stages and handles dependencies.

Task Scheduler assigns tasks to executors based on data locality.

Executors fetch tasks, load dependencies, run the code, and return results.

Driver collects the results and finishes the job.

Bda Unit5
No ratings yet
Bda Unit5
11 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Spark
No ratings yet
Spark
7 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Module 4
No ratings yet
Module 4
29 pages
Apache Spark IP Gemini 1 PDF
No ratings yet
Apache Spark IP Gemini 1 PDF
38 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
Apache Spark
No ratings yet
Apache Spark
113 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Unit V
No ratings yet
Unit V
35 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
8 PDFsam Apache Spark Tutorial
No ratings yet
8 PDFsam Apache Spark Tutorial
7 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Big Data With Spark Presentation
No ratings yet
Big Data With Spark Presentation
11 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Unit 4 (Big Data Analytics)
No ratings yet
Unit 4 (Big Data Analytics)
28 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
1 PDFsam Apache Spark Tutorial
No ratings yet
1 PDFsam Apache Spark Tutorial
7 pages
Apache Spark Tutorial
100% (1)
Apache Spark Tutorial
6 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Sspark
No ratings yet
Sspark
7 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Spark SQL
100% (1)
Spark SQL
25 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Spark RDD WithCode
No ratings yet
Spark RDD WithCode
34 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Unit 4
No ratings yet
Unit 4
8 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Spark Guide for Data Engineers
No ratings yet
Spark Guide for Data Engineers
12 pages
Top Spark Interview Q&A
No ratings yet
Top Spark Interview Q&A
21 pages
Apache Spark Defined
No ratings yet
Apache Spark Defined
14 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Apache Spark
No ratings yet
Apache Spark
100 pages
Spark 101
No ratings yet
Spark 101
25 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Apache Spark & Azure Databricks
No ratings yet
Apache Spark & Azure Databricks
25 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Bda U4
No ratings yet
Bda U4
49 pages
Job Description - Customer Service VXI HYD
No ratings yet
Job Description - Customer Service VXI HYD
2 pages
Fractal Analytics 29th Oct Exam Scheduled Students Details
No ratings yet
Fractal Analytics 29th Oct Exam Scheduled Students Details
116 pages
Interviews List
No ratings yet
Interviews List
58 pages
CD Unit 4 Answers
No ratings yet
CD Unit 4 Answers
26 pages
Ms Mid 1 Answers
No ratings yet
Ms Mid 1 Answers
25 pages
UNIT 2 Deep Learing Answers
No ratings yet
UNIT 2 Deep Learing Answers
42 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Unit 1 Daa
No ratings yet
Unit 1 Daa
15 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
<?xml version="1.0" encoding="UTF-8"?>
No ratings yet
<?xml version="1.0" encoding="UTF-8"?>
6 pages
05 High Availability
No ratings yet
05 High Availability
29 pages
Cs 12th
No ratings yet
Cs 12th
3 pages
Windward
No ratings yet
Windward
1 page
Unit-5: Security, Standards, and Applications
No ratings yet
Unit-5: Security, Standards, and Applications
50 pages
Class 10 Social Science Term 2 Question Bank Exam Guru LEARN VIBRANT - PDF
No ratings yet
Class 10 Social Science Term 2 Question Bank Exam Guru LEARN VIBRANT - PDF
236 pages
T32MZ Software Update: Stick Modes & Audio
No ratings yet
T32MZ Software Update: Stick Modes & Audio
1 page
OC Samarth Vohra RESUME Changed
No ratings yet
OC Samarth Vohra RESUME Changed
2 pages
Axiom 2 Pro Installation Instructions 87443 (Rev 5) (En)
No ratings yet
Axiom 2 Pro Installation Instructions 87443 (Rev 5) (En)
138 pages
Compilation Forms Using Mobaxterm
No ratings yet
Compilation Forms Using Mobaxterm
5 pages
Zebra ZD420 ZD620 Full PDF
No ratings yet
Zebra ZD420 ZD620 Full PDF
348 pages
SOSIOLOGI PROFETIK (Sufisme Transformatif Komunitas Maiyah Nusantara) - 1
No ratings yet
SOSIOLOGI PROFETIK (Sufisme Transformatif Komunitas Maiyah Nusantara) - 1
163 pages
Platform Based Approach For Automation of Workflows in A System of Systems
No ratings yet
Platform Based Approach For Automation of Workflows in A System of Systems
10 pages
Ip Lab Manual
No ratings yet
Ip Lab Manual
25 pages
Module Assignment 2 Data Types, Operators, Variables Assignment - Answer
No ratings yet
Module Assignment 2 Data Types, Operators, Variables Assignment - Answer
4 pages
A Project Report ON Speech Synthesizer: Tilak Maharashtra Vidyapeeth, Pune
No ratings yet
A Project Report ON Speech Synthesizer: Tilak Maharashtra Vidyapeeth, Pune
18 pages
CH 4 Scenarios & Goal Seek
No ratings yet
CH 4 Scenarios & Goal Seek
13 pages
Sekhar Resume - 1
No ratings yet
Sekhar Resume - 1
2 pages
Leaflet Dahua-DoLynk-Care V3.0 en 202404 - (6P)
No ratings yet
Leaflet Dahua-DoLynk-Care V3.0 en 202404 - (6P)
6 pages
LMAX API Specification
100% (1)
LMAX API Specification
56 pages
Project Final
No ratings yet
Project Final
8 pages
How To Install Aspen Hysys v9
No ratings yet
How To Install Aspen Hysys v9
21 pages
Fundamentals of CSS - Learn CSS - The Box Model Cheatsheet - Codecademy PDF
No ratings yet
Fundamentals of CSS - Learn CSS - The Box Model Cheatsheet - Codecademy PDF
3 pages
Apache and System Config Files
No ratings yet
Apache and System Config Files
18 pages
Internship Report
No ratings yet
Internship Report
9 pages
Diamond Color EQ-3 Manual
100% (1)
Diamond Color EQ-3 Manual
53 pages
Synopsis On Blogging System
100% (2)
Synopsis On Blogging System
12 pages
Solutions Manual - Chapter 1 Labs Lab 1-0 How To Complete Labs
No ratings yet
Solutions Manual - Chapter 1 Labs Lab 1-0 How To Complete Labs
15 pages
C Programming Lab Manual
100% (3)
C Programming Lab Manual
45 pages
Cas Lab
No ratings yet
Cas Lab
10 pages

Bda Unit 5

Uploaded by

Bda Unit 5

Uploaded by

Unit – V

one of the forms for implementing Spark.

executing distributed SQL, and others.

and installed inside the cloud (such as Databricks Clouds).

common types of Spark applications used in big data:

Jobs, stages and Tasks

Narrow vs. Wide Transformations:

Resilient Distributed data sets,

5 reasons of when to use RDDs

Your data is unstructured, such as media streams or streams of text;

accessing data attributes by name or column; and

and Datasets for structured and semi-structured data.

Anatomy of a Spark Job Run

DAG Construction in Detail

Process-local: When data is available within the same JVM process.

Node-local: When data is on the same machine but a different process.

Simple Summary Flow

User action triggers a job.

SparkContext calls runJob().

DAG Scheduler builds stages and handles dependencies.

Task Scheduler assigns tasks to executors based on data locality.

Driver collects the results and finishes the job.

You might also like