What is Spark?
Apache Spark is an open-source cluster computing framework. Its primary
purpose is to handle the real-time generated data.
Spark was built on the top of the Hadoop MapReduce. It was optimized to run in
memory whereas alternative approaches like Hadoop's MapReduce writes data
to and from computer hard drives. So, Spark process the data much quicker
than other alternatives.
History of Apache Spark
The Spark was initiated by Matei Zaharia at UC Berkeley's AMPLab in 2009. It
was open sourced in 2010 under a BSD license.
In 2013, the project was acquired by Apache Software Foundation. In 2014, the
Spark emerged as a Top-Level Apache Project.
                                          Object 1
                                         Object 2
                                      4
Features of Apache Spark
   •Fast - It provides high performance for both batch and streaming data,
   using a state-of-the-art DAG scheduler, a query optimizer, and a physical
   execution engine.
   •Easy to Use - It facilitates to write the application in Java, Scala, Python,
   R, and SQL. It also provides more than 80 high-level operators.
   •Generality - It provides a collection of libraries including SQL and
   DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.
   •Lightweight - It is a light unified analytics engine which is used for
   large scale data processing.Runs Everywhere - It can easily run on
   Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.
Uses of Spark
   •Data integration: The data generated by systems are not consistent
   enough to combine for analysis. To fetch consistent data from systems we
   can use processes like Extract, transform, and load (ETL). Spark is used to
   reduce the cost and time required for this ETL process.
   •Stream processing: It is always difficult to handle the real-time
   generated data such as log files. Spark is capable enough to operate
   streams of data and refuses potentially fraudulent operations.
   •Machine learning: Machine learning approaches become more feasible
   and increasingly accurate due to enhancement in the volume of data. As
   spark is capable of storing data in memory and can run repeated queries
   quickly, it makes it easy to work on machine learning algorithms.
   •Interactive analytics: Spark is able to generate the respond rapidly.
   So, instead of running pre-defined queries, we can handle the data
   interactively.
Spark Architecture
The Spark follows the master-slave architecture. Its cluster consists of a single
master and multiple slaves.
The Spark architecture depends upon two abstractions:
      •Resilient Distributed Dataset (RDD)
      •Directed Acyclic Graph (DAG)
Resilient Distributed Datasets (RDD)
The Resilient Distributed Datasets are the group of data items that can be
stored in-memory on worker nodes. Here,
      •Resilient: Restore the data on failure.
      •Distributed: Data is distributed among different nodes.
      •Dataset: Group of data.
We will learn about RDD later in detail.
                                             Object 3
                     How to find Nth Highest Salary in SQL
Directed Acyclic Graph (DAG)
Directed Acyclic Graph is a finite direct graph that performs a sequence of
computations on data. Each node is an RDD partition, and the edge is a
transformation on top of data. Here, the graph refers the navigation whereas
directed and acyclic refers to how it is done.
Let's understand the Spark architecture.
Driver Program
The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object. The purpose of SparkContext is to
coordinate the spark applications, running as independent sets of processes on
a cluster.
To run on a cluster, the SparkContext connects to a different type of cluster
managers and then perform the following tasks: -
     •It acquires executors on nodes in the cluster.
     •Then, it sends your application code to the executors. Here, the
     application code can be defined by JAR or Python files passed to the
     SparkContext.
     •At last, the SparkContext sends tasks to the executors to run.
Cluster Manager
     •The role of the cluster manager is to allocate resources across
     applications. The Spark is capable enough of running on a large number
     of clusters.
     •It consists of various types of cluster managers such as Hadoop YARN,
     Apache Mesos and Standalone Scheduler.
       •Here, the Standalone Scheduler is a standalone spark cluster manager
       that facilitates to install Spark on an empty set of machines.
Worker Node
       •The worker node is a slave node
       •Its role is to run the application code in the cluster.
Executor
       •An executor is a process launched for an application on a worker node.
       •It runs tasks and keeps data in memory or disk storage across them.
       •It read and write data to the external sources.
       •Every application contains its executor.
Task
       •A unit of work that will be sent to one executor.
Spark Components
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and
monitor multiple applications.
Let's understand each Spark component in detail.
Spark Core
   •The Spark Core is the heart of Spark and performs the core functionality.
   •It holds the components for task scheduling, fault recovery, interacting
   with storage systems and memory management.
Spark SQL
   •The Spark SQL is built on the top of Spark Core. It provides support for
   structured data.
   •It allows to query the data via SQL (Structured Query Language) as well
   as the Apache Hive variant of SQL?called the HQL (Hive Query
   Language).
   •It supports JDBC and ODBC connections that establish a relation between
   Java objects and existing databases, data warehouses and business
   intelligence tools.
   •It also supports various sources of data like Hive tables, Parquet, and
   JSON.
Spark Streaming
   •Spark Streaming is a Spark component that supports scalable and fault-
   tolerant processing of streaming data.
   •It uses Spark Core's fast scheduling capability to perform streaming
   analytics.
   •It accepts data in mini-batches and performs RDD transformations on
   that data.
     •Its design ensures that the applications written for streaming data can
     be reused to analyze batches of historical data with little modification.
     •The log files generated by web servers can be considered as a real-time
     example of a data stream.
MLlib
     •The MLlib is a Machine Learning library that contains various machine
     learning algorithms.
     •These include correlations and hypothesis testing, classification and
     regression, clustering, and principal component analysis.
     •It is nine times faster than the disk-based implementation used by
     Apache Mahout.
GraphX
     •The GraphX is a library that is used to manipulate graphs and perform
     graph-parallel computations.
     •It facilitates to create a directed graph with arbitrary properties attached
     to each vertex and edge.
     •To manipulate graph, it supports various fundamental operators like
     subgraph, join Vertices, and aggregate Messages.
What is RDD?
The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a
collection of elements, partitioned across the nodes of the cluster so that we
can execute various parallel operations on it.
There are two ways to create RDDs:
     •Parallelizing an existing data in the driver program
        •Referencing a dataset in an external storage system, such as a shared
        filesystem, HDFS, HBase, or any data source offering a Hadoop
        InputFormat.
  Parallelized Collections
  To create parallelized collection, call SparkContext's parallelize method on an
  existing collection in the driver program. Each element of collection is copied to
  form a distributed dataset that can be operated on in parallel.
1. val info = Array(1, 2, 3, 4)
2. val distinfo = sc.parallelize(info)
   Now, we can operate the distributed dataset (distinfo) parallel such like
   distinfo.reduce((a, b) => a + b).
                                              Object 4
                                             Object 5
                                 HTML Tutorial
External Datasets
In Spark, the distributed datasets can be created from any type of storage
sources supported by Hadoop such as HDFS, Cassandra, HBase and even our
local file system. Spark provides the support for text files, SequenceFiles, and
other types of Hadoop InputFormat.
SparkContext's textFile method can be used to create RDD's text file. This
method takes a URI for the file (either a local path on the machine or a hdfs://)
and reads the data of the file.
Now, we can operate data on by dataset operations such as we can add up the
sizes of all the lines using the map and reduceoperations as follows:
data.map(s => s.length).reduce((a, b) => a + b).
 Next Topic RDD Operations
RDD Operations
The RDD provides the two types of operations:
     •Transformation
     •Action
Transformation
In Spark, the role of transformation is to create a new dataset from an existing
one. The transformations are considered lazy as they only computed when an
action requires a result to be returned to the driver program.
Let's see some of the frequently used RDD Transformations.
Transformation                         Description
                                        It returns a new distributed dataset
map(func)                               formed by passing each element of
                                        the source through a function func.
                                        It returns a new dataset formed by
filter(func)                            selecting    those       elements   of    the
                                        source on which func returns true.
                                        Here, each input item can be mapped
                                        to zero or more output items, so func
flatMap(func)
                                        should return a sequence rather than
                                        a single item.
mapPartitions(func)                     It   is   similar   to    map,   but     runs
                                        separately on each partition (block) of
                                        the RDD, so func must be of type
                                        Iterator<T> => Iterator<U> when
                                          running on an RDD of type T.
                                          It is similar to mapPartitions that
                                          provides func with an integer value
                                          representing     the    index   of     the
mapPartitionsWithIndex(func)
                                          partition, so func must be of type (Int,
                                          Iterator<T>) => Iterator<U> when
                                          running on an RDD of type T.
                                          It samples the fraction fraction of the
sample(withReplacement,       fraction,   data, with or without replacement,
seed)                                     using   a     given    random    number
                                          generator seed.
                                          It returns a new dataset that contains
union(otherDataset)                       the union of the elements in the
                                          source dataset and the argument.
                                          It returns a new RDD that contains the
intersection(otherDataset)                intersection of elements in the source
                                          dataset and the argument.
                                          It returns a new dataset that contains
distinct([numPartitions]))                the distinct elements of the source
                                          dataset.
groupByKey([numPartitions])               It returns a dataset of (K, Iterable
                                          When called on a dataset of (K, V)
                                          pairs, returns a dataset of (K, V) pairs
                                          where the values for each key are
reduceByKey(func, [numPartitions])
                                          aggregated using the given reduce
                                          function func, which must be of type
                                          (V,V) => V.
aggregateByKey(zeroValue)(seqOp,          When called on a dataset of (K, V)
combOp, [numPartitions])                  pairs, returns a dataset of (K, U) pairs
                                       where the values for each key are
                                       aggregated using the given combine
                                       functions and a neutral "zero" value.
                                       It returns a dataset of key-value pairs
sortByKey([ascending],                 sorted      by   keys   in   ascending   or
[numPartitions])                       descending order, as specified in the
                                       boolean ascending argument.
                                       When called on datasets of type (K, V)
                                       and (K, W), returns a dataset of (K, (V,
                                       W)) pairs with all pairs of elements for
join(otherDataset, [numPartitions])
                                       each key. Outer joins are supported
                                       through leftOuterJoin, rightOuterJoin,
                                       and fullOuterJoin.
                                       When called on datasets of type (K, V)
cogroup(otherDataset,
                                       and (K, W), returns a dataset of (K,
[numPartitions])
                                       (Iterable
                                       When called on datasets of types T
cartesian(otherDataset)                and U, returns a dataset of (T, U) pairs
                                       (all pairs of elements).
                                       Pipe     each    partition   of   the   RDD
pipe(command, [envVars])               through a shell command, e.g. a Perl
                                       or bash script.
                                       It decreases the number of partitions
coalesce(numPartitions)
                                       in the RDD to numPartitions.
                                       It reshuffles the data in the RDD
                                       randomly to create either more or
repartition(numPartitions)
                                       fewer partitions and balance it across
                                       them.
repartitionAndSortWithinPartitions(p   It repartition the RDD according to the
                                      given partitioner and, within each
artitioner)                           resulting partition, sort records by
                                      their keys.
Action
In Spark, the role of action is to return a value to the driver program after
running a computation on the dataset.
                                         Object 6
         Prime Ministers of India | List of Prime Minister of India (1947-2020)
Let's see some of the frequently used RDD Actions.
Action                     Description
                            It aggregate the elements of the dataset using a
                            function func (which takes two arguments and
reduce(func)                returns one). The function should be commutative
                            and associative so that it can be computed correctly
                            in parallel.
                            It returns all the elements of the dataset as an array
                            at the driver program. This is usually useful after a
collect()
                            filter or other operation that returns a sufficiently
                            small subset of the data.
count()                     It returns the number of elements in the dataset.
                            It returns the first element of the dataset (similar to
first()
                            take(1)).
                            It returns an array with the first n elements of the
take(n)
                            dataset.
takeSample(withRepla        It returns an array with a random sample of num
cement, num, [seed])        elements       of   the   dataset,   with   or   without
                            replacement, optionally pre-specifying a random
                        number generator seed.
takeOrdered(n,          It returns the first n elements of the RDD using either
[ordering])             their natural order or a custom comparator.
                        It is used to write the elements of the dataset as a
                        text file (or set of text files) in a given directory in
saveAsTextFile(path)    the local filesystem, HDFS or any other Hadoop-
                        supported file system. Spark calls toString on each
                        element to convert it to a line of text in the file.
                        It is used to write the elements of the dataset as a
saveAsSequenceFile(p
                        Hadoop SequenceFile in a given path in the local
ath)
                        filesystem, HDFS or any other Hadoop-supported file
(Java and Scala)
                        system.
saveAsObjectFile(path   It is used to write the elements of the dataset in a
)                       simple format using Java serialization, which can
(Java and Scala)        then be loaded usingSparkContext.objectFile().
                        It is only available on RDDs of type (K, V). Thus, it
countByKey()            returns a hashmap of (K, Int) pairs with the count of
                        each key.
                        It runs a function func on each element of the
                        dataset     for   side   effects   such   as   updating   an
foreach(func)
                        Accumulator or interacting with external storage
                        systems.
RDD Persistence
Spark provides a convenient way to work on the dataset by persisting it in
memory across operations. While persisting an RDD, each node stores any
partitions of it that it computes in memory. Now, we can also reuse them in
other tasks on that dataset.
We can use either persist() or cache() method to mark an RDD to be persisted.
Spark?s cache is fault-tolerant. In any case, if the partition of an RDD is lost, it
will automatically be recomputed using the transformations that originally
created it.
There is an availability of different storage levels which are used to store
persisted RDDs. Use these levels by passing a StorageLevel object (Scala,
Java, Python) to persist(). However, the cache() method is used for the default
storage level, which is StorageLevel.MEMORY_ONLY.
The following are the set of storage levels:
                                               Object 7
                          OOPs Concepts in Java
Storage Level       Description
                     It stores the RDD as deserialized Java objects in the
                     JVM. This is the default level. If the RDD doesn't fit in
 MEMORY_ONLY
                     memory, some partitions will not be cached and
                     recomputed each time they're needed.
                     It stores the RDD as deserialized Java objects in the
                     JVM. If the RDD doesn't fit in memory, store the
 MEMORY_AND_DISK
                     partitions that don't fit on disk, and read them from
                     there when they're needed.
                     It stores RDD as serialized Java objects ( i.e. one-byte
 MEMORY_ONLY_SER
                     array per partition). This is generally more space-
 (Java and Scala)
                     efficient than deserialized objects.
 MEMORY_AND_DISK     It is similar to MEMORY_ONLY_SER, but spill partitions
 _SER                that don't fit in memory to disk instead of recomputing
 (Java and Scala)      them.
 DISK_ONLY             It stores the RDD partitions only on disk.
 MEMORY_ONLY_2,
                       It is the same as the levels above, but replicate each
 MEMORY_AND_DISK
                       partition on two cluster nodes.
 _2, etc.
                       It is similar to MEMORY_ONLY_SER, but store the data in
 OFF_HEAP
                       off-heap memory. The off-heap memory must be
 (experimental)
                       enabled.
RDD Shared Variables
In Spark, when any function passed to a transformation operation, then it is
executed on a remote cluster node. It works on different copies of all the
variables used in the function. These variables are copied to each machine, and
no updates to the variables on the remote machine are revert to the driver
program.
Broadcast variable
The broadcast variables support a read-only variable cached on each machine
rather than providing a copy of it with tasks. Spark uses broadcast algorithms
to distribute broadcast variables for reducing communication cost.
The execution of spark actions passes through several stages, separated by
distributed "shuffle" operations. Spark automatically broadcasts the common
data required by tasks within each stage. The data broadcasted this way is
cached in serialized form and deserialized before running each task.
To create a broadcast variable (let say, v), call SparkContext.broadcast(v). Let's
understand with an example.
                                           Object 8
                                        21.8M
                                         507
                             Features of Java - Javatpoint
1. scala> val v = sc.broadcast(Array(1, 2, 3))
2. scala> v.value
  Accumulator
  The Accumulator are variables that are used to perform associative and
  commutative operations such as counters or sums. The Spark provides support
  for accumulators of numeric types. However, we can add support for new
  types.
  To create a numeric accumulator, call SparkContext.longAccumulator() or
  SparkContext.doubleAccumulator() to accumulate the values of Long or Double
  type.
1. scala> val a=sc.longAccumulator("Accumulator")
2. scala> sc.parallelize(Array(2,5)).foreach(x=>a.add(x))
3. scala> a.value
  Spark Map function
  In Spark, the Map passes each element of the source through a function and
  forms a new distributed dataset.
  Example of Map function
  In this example, we add a constant value 10 to each element.
        •To open the spark in Scala mode, follow the below command
1. $ spark-shell
        •Create an RDD using parallelized collection.
1. scala> val data = sc.parallelize(List(10,20,30))
        •Now, we can read the generated result by using the following command.
1. scala> data.collect
        •Apply the map function and pass the expression required to perform.
1. scala> val mapfunc = data.map(x => x+10)
        •Now, we can read the generated result by using the following command.
1. scala> mapfunc.collect
  Here, we got the desired output.
  Spark Filter Function
  In Spark, the Filter function returns a new dataset formed by selecting those
  elements of the source on which the function returns true. So, it retrieves only
  the elements that satisfy the given condition.
  Example of Filter function
  In this example, we filter the given data and retrieve all the values except 35.
        •To open the spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using parallelized collection.
1. scala> val data = sc.parallelize(List(10,20,35,40))
        •Now, we can read the generated result by using the following command.
1. scala> data.collect
        •Apply filter function and pass the expression required to perform.
1. scala> val filterfunc = data.filter(x => x!=35)
        •Now, we can read the generated result by using the following command.
1. scala> filterfunc.collect
  Here, we got the desired output.
  Spark Count Function
  In Spark, the Count function returns the number of elements present in the
  dataset.
  Example of Count function
  In this example, we count the number of elements exist in the dataset.
        •Create an RDD using parallelized collection.
1. scala> val data = sc.parallelize(List(1,2,3,4,5))
        •Now, we can read the generated result by using the following command.
1. scala> data.collect
        •Apply count() function to count number of elements.
1. scala> val countfunc = data.count()
  Here, we got the desired output.
  Spark Distinct Function
  In Spark, the Distinct function returns the distinct elements from the provided
  dataset.
  Example of Distinct function
  In this example, we ignore the duplicate elements and retrieves only the
  distinct elements.
        •To open the spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using parallelized collection.
1. scala> val data = sc.parallelize(List(10,20,20,40))
        •Now, we can read the generated result by using the following command.
1. scala> data.collect
        •Apply distinct() function to ignore duplicate elements.
1. scala> val distinctfunc = data.distinct()
        •Now, we can read the generated result by using the following command.
1. scala> distinctfunc.collect
  Here, we got the desired output.
  Spark Union Function
  In Spark, Union function returns a new dataset that contains the combination of
  elements present in the different datasets.
  Example of Union function
  In this example, we combine the elements of two datasets.
        •To open the spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using parallelized collection.
1. scala> val data1 = sc.parallelize(List(1,2))
        •Now, we can read the generated result by using the following command.
1. scala> data1.collect
        •Create another RDD using parallelized collection.
1. scala> val data2 = sc.parallelize(List(3,4,5))
        •Now, we can read the generated result by using the following command.
1. scala> data2.collect
        •Apply union() function to return the union of the elements.
1. scala> val unionfunc = data1.union(data2)
        •Now, we can read the generated result by using the following command.
1. scala> unionfunc.collect
  Here, we got the desired output.
  Spark Intersection Function
  In Spark, Intersection function returns a new dataset that contains the
  intersection of elements present in the different datasets. So, it returns only a
  single row. This function behaves just like the INTERSECT query in SQL.
  Example of Intersection function
  In this example, we intersect the elements of two datasets.
        •To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using the parallelized collection.
1. scala> val data1 = sc.parallelize(List(1,2,3))
        •Now, we can read the generated result by using the following command.
1. scala> data1.collect
        •Create another RDD using parallelized collection.
1. scala> val data2 = sc.parallelize(List(3,4,5))
        •Now, we can read the generated result by using the following command.
1. scala> data2.collect
        •Apply intersection() function to return the intersection of the elements.
1. scala> val intersectfunc = data1.intersection(data2)
        •Now, we can read the generated result by using the following command.
1. scala> intersectfunc.collect
  Here, we got the desired output.
  Spark Cartesian Function
  In Spark, the Cartesian function generates a Cartesian product of two datasets
  and returns all the possible combination of pairs. Here, each element of one
  dataset is paired with each element of another dataset.
  Example of Cartesian function
  In this example, we generate a Cartesian product of two datasets.
        •To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using the parallelized collection.
1. scala> val data1 = sc.parallelize(List(1,2,3))
        •Now, we can read the generated result by using the following command.
1. scala> data1.collect
        •Create another RDD using the parallelized collection.
1. scala> val data2 = sc.parallelize(List(3,4,5))
        •Now, we can read the generated result by using the following command.
1. scala> data2.collect
        •Apply cartesian() function to return the Cartesian product of the
        elements.
1. scala> val cartesianfunc = data1.cartesian(data2)
        •Now, we can read the generated result by using the following command.
1. scala> cartesianfunc.collect
  Here, we got the desired output.
  Spark sortByKey Function
  In Spark, the sortByKey function maintains the order of elements. It receives
  key-value pairs (K, V) as an input, sorts the elements in ascending or
  descending order and generates a dataset in an order.
  Example of sortByKey Function
  In this example, we arrange the elements of dataset in ascending and
  descending order.
        •To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using the parallelized collection.
1. scala> val data = sc.parallelize(Seq(("C",3),("A",1),("D",4),("B",2),("E",5)))
   Now, we can read the generated result by using the following command.
1. scala> data.collect
  For ascending,
                                               Object 9
               How to find Nth Highest Salary in SQL
•Apply sortByKey() function to ignore duplicate elements.
1. scala> val sortfunc = data.sortByKey()
        •Now, we can read the generated result by using the following command.
1. scala> sortfunc.collect
  Here, we got the desired output.
  For descending,
        •Apply sortByKey() function and pass Boolean type as parameter.
1. scala> val sortfunc = data.sortByKey(false)
        •Now, we can read the generated result by using the following command.
1. scala> sortfunc.collect
  Here, we got the desired output.
  Spark groupByKey Function
  In Spark, the groupByKey function is a frequently used transformation
  operation that performs shuffling of data. It receives key-value pairs (K, V) as
  an input, group the values based on key and generates a dataset of (K, Iterable
  Example of groupByKey Function
  In this example, we group the values based on the key.
        •To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using the parallelized collection.
1. scala> val data = sc.parallelize(Seq(("C",3),("A",1),("B",4),("A",2),("B",5)))
   Now, we can read the generated result by using the following command.
1. scala> data.collect
        •Apply groupByKey() function to group the values.
1. scala> val groupfunc = data.groupByKey()
        •Now, we can read the generated result by using the following command.
1. scala> groupfunc.collect
  Here, we got the desired output.
  Spark reduceByKey Function
  In Spark, the reduceByKey function is a frequently used transformation
  operation that performs aggregation of data. It receives key-value pairs (K, V)
  as an input, aggregates the values based on the key and generates a dataset
  of (K, V) pairs as an output.
  Example of reduceByKey Function
  In this example, we aggregate the values on the basis of key.
        •To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using the parallelized collection.
1. scala> val data = sc.parallelize(Array(("C",3),("A",1),("B",4),("A",2),("B",5)))
   Now, we can read the generated result by using the following command.
1. scala> data.collect
        •Apply reduceByKey() function to aggregate the values.
1. scala> val reducefunc = data.reduceByKey((value, x) => (value + x))
        •Now, we can read the generated result by using the following command.
1. scala> reducefunc.collect
  Here, we got the desired output.
                                           Object 10
HTML Tutorial
  Spark cogroup Function
  In Spark, the cogroup function performs on different datasets, let's say, (K, V)
  and (K, W) and returns a dataset of (K, (Iterable groupWith.
  Example of cogroup Function
  In this example, we perform the groupWith operation.
        •To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using the parallelized collection.
1. scala> val data1 = sc.parallelize(Seq(("A",1),("B",2),("C",3)))
   Now, we can read the generated result by using the following command.
1. scala> data1.collect
       •Create another RDD using the parallelized collection.
1. scala> val data2 = sc.parallelize(Seq(("B",4),("E",5)))
   Now, we can read the generated result by using the following command.
                                           Object 11
                                  History of Java
1. scala> data2.collect
        •Apply cogroup() function to group the values.
1. scala> val cogroupfunc = data1.cogroup(data2)
        •Now, we can read the generated result by using the following command.
1. scala> cogroupfunc.collect
  Here, we got the desired output.
  park First Function
  In Spark, the First function always returns the first element of the dataset. It is
  similar to take(1).
  Example of First function
  In this example, we retrieve the first element of the dataset.
        •To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using the parallelized collection.
1. scala> val data = sc.parallelize(List(10,20,30,40,50))
        •Now, we can read the generated result by using the following command.
1. scala> data.collect
        •Apply first() function to retrieve the first element of the dataset.
1. scala> val firstfunc = data.first()
  Here, we got the desired output.
  Spark Take Function
  In Spark, the take function behaves like an array. It receives an integer value
  (let say, n) as a parameter and returns an array of first n elements of the
  dataset.
  Example of Take function
  In this example, we return the first n elements of an existing dataset.
        •To open the Spark in Scala mode, follow the below command.
1. $ spark-shell
        •Create an RDD using the parallelized collection.
1. scala> val data = sc.parallelize(List(10,20,30,40,50))
        •Now, we can read the generated result by using the following command.
1. scala> data.collect
        •Apply take() function to return an array of elements.
1. scala> val takefunc = data.take(3)
Here, we got the desired output.
  Spark Word Count Example
  In Spark word count example, we find out the frequency of each word exists in
  a particular file. Here, we use Scala language to perform Spark operations.
  Steps to execute Spark word count example
  In this example, we find and display the number of occurrences of each word.
        •Create a text file in your local machine and write some text into it.
1. $ nano sparkdata.txt
        •Check the text written in the sparkdata.txt file.
1. $ cat sparkdata.txt
        •Create a directory in HDFS, where to kept text file.
1. $ hdfs dfs -mkdir /spark
        •Upload the sparkdata.txt file on HDFS in the specific directory.
1. $ hdfs dfs -put /home/codegyani/sparkdata.txt /spark
        •Now, follow the below command to open the spark in Scala mode.
1. $ spark-shell
        •Let's create an RDD by using the following command.
1. scala> val data=sc.textFile("sparkdata.txt")
   Here, pass any file name that contains the data.
        •Now, we can read the generated result by using the following command.
1. scala> data.collect;
        •Here, we split the existing data in the form of individual words by using
        the following command.
1. scala> val splitdata = data.flatMap(line => line.split(" "));
        •Now, we can read the generated result by using the following command.
1. scala> splitdata.collect;
        •Now, perform the map operation.
1. scala> val mapdata = splitdata.map(word => (word,1));
   Here, we are assigning a value 1 to each word.
                                               Object 12
                                  HTML Tutorial
        •Now, we can read the generated result by using the following command.
1. scala> mapdata.collect;
        •Now, perform the reduce operation
1. scala> val reducedata = mapdata.reduceByKey(_+_);
   Here, we are summarizing the generated data.
        •Now, we can read the generated result by using the following command.
1. scala> reducedata.collect;
  Here, we got the desired output.
  Spark Char Count Example
  In Spark char count example, we find out the frequency of each character
  exists in a particular file. Here, we use Scala language to perform Spark
  operations.
  Steps to execute Spark char count example
  In this example, we find and display the number of occurrences of each
  character.
        •Create a text file in your local machine and write some text into it.
1. $ nano sparkdata.txt
        •Check the text written in the sparkdata.txt file.
1. $ cat sparkdata.txt
        •Create a directory in HDFS, where to kept text file.
1. $ hdfs dfs -mkdir /spark
        •Upload the sparkdata.txt file on HDFS in the specific directory.
1. $ hdfs dfs -put /home/codegyani/sparkdata.txt /spark
        •Now, follow the below command to open the spark in Scala mode.
1. $ spark-shell
        •Let's create an RDD by using the following command.
1. scala> val data=sc.textFile("sparkdata.txt");
   Here, pass any file name that contains the data.
        •Now, we can read the generated result by using the following command.
1. scala> data.collect;
        •Here, we split the existing data in the form of individual words by using
        the following command.
1. scala> val splitdata = data.flatMap(line => line.split(""));
        •Now, we can read the generated result by using the following command.
1. scala> splitdata.collect;
        •Now, perform the map operation.
1. scala> val mapdata = splitdata.map(word => (word,1));
   Here, we are assigning a value 1 to each word.
                                               Object 13
                       Difference between JDK, JRE, and JVM
        •Now, we can read the generated result by using the following command.
1. scala> mapdata.collect;
        •Now, perform the reduce operation
1. scala> val reducedata = mapdata.reduceByKey(_+_);
   Here, we are summarizing the generated data.
        •Now, we can read the generated result by using the following command.
1. scala> reducedata.collect;
  Here, we got the desired output.