● The interpreter is the first layer, using a Scala interpreter, Spark interprets the
code with some modifications.
● Spark creates an operator graph when you enter your code in Spark console.
● When we call an Action on Spark RDD at a high level, Spark submits the
operator graph to the DAG Scheduler.
● Divide the operators into stages of the task in the DAG Scheduler. A stage
contains task based on the partition of the input data. The DAG scheduler
pipelines operators together. For example, map operators schedule in a single
stage.
● The stages pass on to the Task Scheduler. It launches task through cluster
manager. The dependencies of stages are unknown to the task scheduler.
● The Workers execute the task on the slave.
DAG a finite direct graph with no directed cycles. There are finitely many vertices and
edges, where each edge directed from one vertex to another. It contains a sequence of vertices
such that every edge is directed from earlier to later in the sequence. It is a strict
generalization of MapReduce model. DAG operations can do better global optimization than
other systems like MapReduce. The picture of DAG becomes clear in more complex jobs.
Apache Spark DAG allows the user to dive into the stage and expand on detail on any stage.
In the stage view, the details of all RDDs belonging to that stage are expanded. The
Scheduler splits the Spark RDD into stages based on various transformation applied. (You
can refer this link to learn RDD
Transformations and Actions in detail) Each stage is comprised of tasks, based on the
partitions of the RDD, which will perform same computation in parallel. The graph here
refers to navigation, and directed and acyclic refers to how it is done.
Need for DAG
The limitations of Hadoop MapReduce became a key point to introduce DAG in Spark. The
computation through MapReduce in three steps:
● The data is read from HDFS.
● Then apply Map and Reduce operations.
● The computed result is written back to HDFS.
Each MapReduce operation is independent of each other and HADOOP has no idea of which
Map reduce would come next. Sometimes for some iteration, it is irrelevant to read and write
back the immediate result between two map-reduce jobs. In such case, the memory in stable
storage (HDFS) or disk memory gets wasted.
In multiple-step, till the completion of the previous job all the jobs block from the beginning.
As a result, complex computation can require a long time with small data volume.
While in Spark, a DAG (Directed Acyclic Graph) of consecutive computation stages is
formed. In this way, we optimize the execution plan, e.g. to minimize shuffling data around.
In contrast, it is done manually in MapReduce by tuning each MapReduce step.