0% found this document useful (0 votes)

165 views65 pages

Apache Spark Overview & Features

Apache Spark is an open-source cluster computing framework that handles real-time data. It was built on Hadoop MapReduce and runs data in-memory for faster processing than alternatives like Hadoop. Spark was initiated in 2009 and became a top-level Apache project in 2014. It provides high performance for batch and streaming data using libraries for SQL, machine learning, graph processing, and more.

Uploaded by

Apurva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

165 views65 pages

Apache Spark Overview & Features

Uploaded by

Apurva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

What is Spark?

Apache Spark is an open-source cluster computing framework. Its primary

purpose is to handle the real-time generated data.
Spark was built on the top of the Hadoop MapReduce. It was optimized to run in
memory whereas alternative approaches like Hadoop's MapReduce writes data
to and from computer hard drives. So, Spark process the data much quicker
than other alternatives.

History of Apache Spark

The Spark was initiated by Matei Zaharia at UC Berkeley's AMPLab in 2009. It
was open sourced in 2010 under a BSD license.
In 2013, the project was acquired by Apache Software Foundation. In 2014, the
Spark emerged as a Top-Level Apache Project.

Object 1

Object 2
4

Features of Apache Spark

•Fast - It provides high performance for both batch and streaming data,
using a state-of-the-art DAG scheduler, a query optimizer, and a physical
execution engine.

•Easy to Use - It facilitates to write the application in Java, Scala, Python,

R, and SQL. It also provides more than 80 high-level operators.

•Generality - It provides a collection of libraries including SQL and

DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.

•Lightweight - It is a light unified analytics engine which is used for

large scale data processing.Runs Everywhere - It can easily run on
Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.

Uses of Spark
•Data integration: The data generated by systems are not consistent
enough to combine for analysis. To fetch consistent data from systems we
can use processes like Extract, transform, and load (ETL). Spark is used to
reduce the cost and time required for this ETL process.

•Stream processing: It is always difficult to handle the real-time

generated data such as log files. Spark is capable enough to operate
streams of data and refuses potentially fraudulent operations.

•Machine learning: Machine learning approaches become more feasible

and increasingly accurate due to enhancement in the volume of data. As
spark is capable of storing data in memory and can run repeated queries
quickly, it makes it easy to work on machine learning algorithms.

•Interactive analytics: Spark is able to generate the respond rapidly.

So, instead of running pre-defined queries, we can handle the data
interactively.
Spark Architecture
The Spark follows the master-slave architecture. Its cluster consists of a single
master and multiple slaves.
The Spark architecture depends upon two abstractions:

•Resilient Distributed Dataset (RDD)

•Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be
stored in-memory on worker nodes. Here,

•Resilient: Restore the data on failure.

•Distributed: Data is distributed among different nodes.

•Dataset: Group of data.

We will learn about RDD later in detail.

Object 3
How to find Nth Highest Salary in SQL

Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite direct graph that performs a sequence of
computations on data. Each node is an RDD partition, and the edge is a
transformation on top of data. Here, the graph refers the navigation whereas
directed and acyclic refers to how it is done.
Let's understand the Spark architecture.
Driver Program
The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object. The purpose of SparkContext is to
coordinate the spark applications, running as independent sets of processes on
a cluster.
To run on a cluster, the SparkContext connects to a different type of cluster
managers and then perform the following tasks: -

•It acquires executors on nodes in the cluster.

•Then, it sends your application code to the executors. Here, the

application code can be defined by JAR or Python files passed to the
SparkContext.

•At last, the SparkContext sends tasks to the executors to run.

Cluster Manager
•The role of the cluster manager is to allocate resources across
applications. The Spark is capable enough of running on a large number
of clusters.

•It consists of various types of cluster managers such as Hadoop YARN,

Apache Mesos and Standalone Scheduler.
•Here, the Standalone Scheduler is a standalone spark cluster manager
that facilitates to install Spark on an empty set of machines.

Worker Node
•The worker node is a slave node

•Its role is to run the application code in the cluster.

Executor
•An executor is a process launched for an application on a worker node.

•It runs tasks and keeps data in memory or disk storage across them.

•It read and write data to the external sources.

•Every application contains its executor.

Task
•A unit of work that will be sent to one executor.

Spark Components
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and
monitor multiple applications.
Let's understand each Spark component in detail.
Spark Core
•The Spark Core is the heart of Spark and performs the core functionality.

•It holds the components for task scheduling, fault recovery, interacting
with storage systems and memory management.

Spark SQL
•The Spark SQL is built on the top of Spark Core. It provides support for
structured data.

•It allows to query the data via SQL (Structured Query Language) as well
as the Apache Hive variant of SQL?called the HQL (Hive Query
Language).

•It supports JDBC and ODBC connections that establish a relation between
Java objects and existing databases, data warehouses and business
intelligence tools.

•It also supports various sources of data like Hive tables, Parquet, and
JSON.

Spark Streaming
•Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.

•It uses Spark Core's fast scheduling capability to perform streaming

analytics.

•It accepts data in mini-batches and performs RDD transformations on

that data.
•Its design ensures that the applications written for streaming data can
be reused to analyze batches of historical data with little modification.

•The log files generated by web servers can be considered as a real-time

example of a data stream.

MLlib
•The MLlib is a Machine Learning library that contains various machine
learning algorithms.

•These include correlations and hypothesis testing, classification and

regression, clustering, and principal component analysis.

•It is nine times faster than the disk-based implementation used by

Apache Mahout.

GraphX
•The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.

•It facilitates to create a directed graph with arbitrary properties attached

to each vertex and edge.

•To manipulate graph, it supports various fundamental operators like

subgraph, join Vertices, and aggregate Messages.

What is RDD?
The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a
collection of elements, partitioned across the nodes of the cluster so that we
can execute various parallel operations on it.
There are two ways to create RDDs:

•Parallelizing an existing data in the driver program

•Referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.

Parallelized Collections
To create parallelized collection, call SparkContext's parallelize method on an
existing collection in the driver program. Each element of collection is copied to
form a distributed dataset that can be operated on in parallel.

1. val info = Array(1, 2, 3, 4)

2. val distinfo = sc.parallelize(info)
Now, we can operate the distributed dataset (distinfo) parallel such like
distinfo.reduce((a, b) => a + b).

Object 4

Object 5
HTML Tutorial

External Datasets
In Spark, the distributed datasets can be created from any type of storage
sources supported by Hadoop such as HDFS, Cassandra, HBase and even our
local file system. Spark provides the support for text files, SequenceFiles, and
other types of Hadoop InputFormat.
SparkContext's textFile method can be used to create RDD's text file. This
method takes a URI for the file (either a local path on the machine or a hdfs://)
and reads the data of the file.

Now, we can operate data on by dataset operations such as we can add up the
sizes of all the lines using the map and reduceoperations as follows:
data.map(s => s.length).reduce((a, b) => a + b).
Next Topic RDD Operations

RDD Operations
The RDD provides the two types of operations:

•Transformation

•Action

Transformation
In Spark, the role of transformation is to create a new dataset from an existing
one. The transformations are considered lazy as they only computed when an
action requires a result to be returned to the driver program.
Let's see some of the frequently used RDD Transformations.

Transformation Description
It returns a new distributed dataset
map(func) formed by passing each element of
the source through a function func.

It returns a new dataset formed by

filter(func) selecting those elements of the
source on which func returns true.

Here, each input item can be mapped

to zero or more output items, so func
flatMap(func)
should return a sequence rather than
a single item.

mapPartitions(func) It is similar to map, but runs

separately on each partition (block) of
the RDD, so func must be of type
Iterator<T> => Iterator<U> when
running on an RDD of type T.

It is similar to mapPartitions that

provides func with an integer value
representing the index of the
mapPartitionsWithIndex(func)
partition, so func must be of type (Int,
Iterator<T>) => Iterator<U> when
running on an RDD of type T.

It samples the fraction fraction of the

sample(withReplacement, fraction, data, with or without replacement,
seed) using a given random number
generator seed.

It returns a new dataset that contains

union(otherDataset) the union of the elements in the
source dataset and the argument.

It returns a new RDD that contains the

intersection(otherDataset) intersection of elements in the source
dataset and the argument.

It returns a new dataset that contains

distinct([numPartitions])) the distinct elements of the source
dataset.

groupByKey([numPartitions]) It returns a dataset of (K, Iterable

When called on a dataset of (K, V)

pairs, returns a dataset of (K, V) pairs
where the values for each key are
reduceByKey(func, [numPartitions])
aggregated using the given reduce
function func, which must be of type
(V,V) => V.

aggregateByKey(zeroValue)(seqOp, When called on a dataset of (K, V)

combOp, [numPartitions]) pairs, returns a dataset of (K, U) pairs
where the values for each key are
aggregated using the given combine
functions and a neutral "zero" value.

It returns a dataset of key-value pairs

sortByKey([ascending], sorted by keys in ascending or
[numPartitions]) descending order, as specified in the
boolean ascending argument.

When called on datasets of type (K, V)

and (K, W), returns a dataset of (K, (V,
W)) pairs with all pairs of elements for
join(otherDataset, [numPartitions])
each key. Outer joins are supported
through leftOuterJoin, rightOuterJoin,
and fullOuterJoin.

When called on datasets of type (K, V)

cogroup(otherDataset,
and (K, W), returns a dataset of (K,
[numPartitions])
(Iterable

When called on datasets of types T

cartesian(otherDataset) and U, returns a dataset of (T, U) pairs
(all pairs of elements).

Pipe each partition of the RDD

pipe(command, [envVars]) through a shell command, e.g. a Perl
or bash script.

It decreases the number of partitions

coalesce(numPartitions)
in the RDD to numPartitions.

It reshuffles the data in the RDD

randomly to create either more or
repartition(numPartitions)
fewer partitions and balance it across
them.

repartitionAndSortWithinPartitions(p It repartition the RDD according to the

given partitioner and, within each
artitioner) resulting partition, sort records by
their keys.

Action
In Spark, the role of action is to return a value to the driver program after
running a computation on the dataset.

Object 6
Prime Ministers of India | List of Prime Minister of India (1947-2020)

Let's see some of the frequently used RDD Actions.

Action Description
It aggregate the elements of the dataset using a
function func (which takes two arguments and
reduce(func) returns one). The function should be commutative
and associative so that it can be computed correctly
in parallel.

It returns all the elements of the dataset as an array

at the driver program. This is usually useful after a
collect()
filter or other operation that returns a sufficiently
small subset of the data.

count() It returns the number of elements in the dataset.

It returns the first element of the dataset (similar to

first()
take(1)).

It returns an array with the first n elements of the

take(n)
dataset.

takeSample(withRepla It returns an array with a random sample of num

cement, num, [seed]) elements of the dataset, with or without
replacement, optionally pre-specifying a random
number generator seed.

takeOrdered(n, It returns the first n elements of the RDD using either

[ordering]) their natural order or a custom comparator.

It is used to write the elements of the dataset as a

text file (or set of text files) in a given directory in
saveAsTextFile(path) the local filesystem, HDFS or any other Hadoop-
supported file system. Spark calls toString on each
element to convert it to a line of text in the file.

It is used to write the elements of the dataset as a

saveAsSequenceFile(p
Hadoop SequenceFile in a given path in the local
ath)
filesystem, HDFS or any other Hadoop-supported file
(Java and Scala)
system.

saveAsObjectFile(path It is used to write the elements of the dataset in a

) simple format using Java serialization, which can
(Java and Scala) then be loaded usingSparkContext.objectFile().

It is only available on RDDs of type (K, V). Thus, it

countByKey() returns a hashmap of (K, Int) pairs with the count of
each key.

It runs a function func on each element of the

dataset for side effects such as updating an
foreach(func)
Accumulator or interacting with external storage
systems.
RDD Persistence
Spark provides a convenient way to work on the dataset by persisting it in
memory across operations. While persisting an RDD, each node stores any
partitions of it that it computes in memory. Now, we can also reuse them in
other tasks on that dataset.
We can use either persist() or cache() method to mark an RDD to be persisted.
Spark?s cache is fault-tolerant. In any case, if the partition of an RDD is lost, it
will automatically be recomputed using the transformations that originally
created it.
There is an availability of different storage levels which are used to store
persisted RDDs. Use these levels by passing a StorageLevel object (Scala,
Java, Python) to persist(). However, the cache() method is used for the default
storage level, which is StorageLevel.MEMORY_ONLY.
The following are the set of storage levels:

Object 7
OOPs Concepts in Java

Storage Level Description

It stores the RDD as deserialized Java objects in the
JVM. This is the default level. If the RDD doesn't fit in
MEMORY_ONLY
memory, some partitions will not be cached and
recomputed each time they're needed.

It stores the RDD as deserialized Java objects in the

JVM. If the RDD doesn't fit in memory, store the
MEMORY_AND_DISK
partitions that don't fit on disk, and read them from
there when they're needed.

It stores RDD as serialized Java objects ( i.e. one-byte

MEMORY_ONLY_SER
array per partition). This is generally more space-
(Java and Scala)
efficient than deserialized objects.

MEMORY_AND_DISK It is similar to MEMORY_ONLY_SER, but spill partitions

_SER that don't fit in memory to disk instead of recomputing
(Java and Scala) them.

DISK_ONLY It stores the RDD partitions only on disk.

MEMORY_ONLY_2,
It is the same as the levels above, but replicate each
MEMORY_AND_DISK
partition on two cluster nodes.
_2, etc.

It is similar to MEMORY_ONLY_SER, but store the data in

OFF_HEAP
off-heap memory. The off-heap memory must be
(experimental)
enabled.

RDD Shared Variables

In Spark, when any function passed to a transformation operation, then it is
executed on a remote cluster node. It works on different copies of all the
variables used in the function. These variables are copied to each machine, and
no updates to the variables on the remote machine are revert to the driver
program.

Broadcast variable
The broadcast variables support a read-only variable cached on each machine
rather than providing a copy of it with tasks. Spark uses broadcast algorithms
to distribute broadcast variables for reducing communication cost.
The execution of spark actions passes through several stages, separated by
distributed "shuffle" operations. Spark automatically broadcasts the common
data required by tasks within each stage. The data broadcasted this way is
cached in serialized form and deserialized before running each task.
To create a broadcast variable (let say, v), call SparkContext.broadcast(v). Let's
understand with an example.

Object 8
21.8M

507

Features of Java - Javatpoint

1. scala> val v = sc.broadcast(Array(1, 2, 3))

2. scala> v.value
Accumulator
The Accumulator are variables that are used to perform associative and
commutative operations such as counters or sums. The Spark provides support
for accumulators of numeric types. However, we can add support for new
types.
To create a numeric accumulator, call SparkContext.longAccumulator() or
SparkContext.doubleAccumulator() to accumulate the values of Long or Double
type.

1. scala> val a=sc.longAccumulator("Accumulator")

2. scala> sc.parallelize(Array(2,5)).foreach(x=>a.add(x))
3. scala> a.value

Spark Map function

In Spark, the Map passes each element of the source through a function and
forms a new distributed dataset.

Example of Map function

In this example, we add a constant value 10 to each element.

•To open the spark in Scala mode, follow the below command
1. $ spark-shell

•Create an RDD using parallelized collection.

1. scala> val data = sc.parallelize(List(10,20,30))

•Now, we can read the generated result by using the following command.

1. scala> data.collect

•Apply the map function and pass the expression required to perform.

1. scala> val mapfunc = data.map(x => x+10)

•Now, we can read the generated result by using the following command.
1. scala> mapfunc.collect

Here, we got the desired output.

Spark Filter Function

In Spark, the Filter function returns a new dataset formed by selecting those
elements of the source on which the function returns true. So, it retrieves only
the elements that satisfy the given condition.

Example of Filter function

In this example, we filter the given data and retrieve all the values except 35.

•To open the spark in Scala mode, follow the below command.

1. $ spark-shell
•Create an RDD using parallelized collection.

1. scala> val data = sc.parallelize(List(10,20,35,40))

•Now, we can read the generated result by using the following command.

1. scala> data.collect

•Apply filter function and pass the expression required to perform.

1. scala> val filterfunc = data.filter(x => x!=35)

•Now, we can read the generated result by using the following command.

1. scala> filterfunc.collect
Here, we got the desired output.

Spark Count Function

In Spark, the Count function returns the number of elements present in the
dataset.

Example of Count function

In this example, we count the number of elements exist in the dataset.

•Create an RDD using parallelized collection.

1. scala> val data = sc.parallelize(List(1,2,3,4,5))

•Now, we can read the generated result by using the following command.

1. scala> data.collect
•Apply count() function to count number of elements.

1. scala> val countfunc = data.count()

Here, we got the desired output.

Spark Distinct Function

In Spark, the Distinct function returns the distinct elements from the provided
dataset.

Example of Distinct function

In this example, we ignore the duplicate elements and retrieves only the
distinct elements.

•To open the spark in Scala mode, follow the below command.

1. $ spark-shell
•Create an RDD using parallelized collection.

1. scala> val data = sc.parallelize(List(10,20,20,40))

•Now, we can read the generated result by using the following command.

1. scala> data.collect

•Apply distinct() function to ignore duplicate elements.

1. scala> val distinctfunc = data.distinct()

•Now, we can read the generated result by using the following command.

1. scala> distinctfunc.collect
Here, we got the desired output.

Spark Union Function

In Spark, Union function returns a new dataset that contains the combination of
elements present in the different datasets.

Example of Union function

In this example, we combine the elements of two datasets.

•To open the spark in Scala mode, follow the below command.

1. $ spark-shell
•Create an RDD using parallelized collection.

1. scala> val data1 = sc.parallelize(List(1,2))

•Now, we can read the generated result by using the following command.

1. scala> data1.collect

•Create another RDD using parallelized collection.

1. scala> val data2 = sc.parallelize(List(3,4,5))

•Now, we can read the generated result by using the following command.

1. scala> data2.collect
•Apply union() function to return the union of the elements.

1. scala> val unionfunc = data1.union(data2)

•Now, we can read the generated result by using the following command.

1. scala> unionfunc.collect

Here, we got the desired output.

Spark Intersection Function

In Spark, Intersection function returns a new dataset that contains the
intersection of elements present in the different datasets. So, it returns only a
single row. This function behaves just like the INTERSECT query in SQL.
Example of Intersection function
In this example, we intersect the elements of two datasets.

•To open the Spark in Scala mode, follow the below command.

1. $ spark-shell

•Create an RDD using the parallelized collection.

1. scala> val data1 = sc.parallelize(List(1,2,3))

•Now, we can read the generated result by using the following command.

1. scala> data1.collect

•Create another RDD using parallelized collection.

1. scala> val data2 = sc.parallelize(List(3,4,5))

•Now, we can read the generated result by using the following command.

1. scala> data2.collect

•Apply intersection() function to return the intersection of the elements.

1. scala> val intersectfunc = data1.intersection(data2)

•Now, we can read the generated result by using the following command.

1. scala> intersectfunc.collect

Here, we got the desired output.

Spark Cartesian Function
In Spark, the Cartesian function generates a Cartesian product of two datasets
and returns all the possible combination of pairs. Here, each element of one
dataset is paired with each element of another dataset.

Example of Cartesian function

In this example, we generate a Cartesian product of two datasets.

•To open the Spark in Scala mode, follow the below command.

1. $ spark-shell

•Create an RDD using the parallelized collection.

1. scala> val data1 = sc.parallelize(List(1,2,3))

•Now, we can read the generated result by using the following command.

1. scala> data1.collect
•Create another RDD using the parallelized collection.

1. scala> val data2 = sc.parallelize(List(3,4,5))

•Now, we can read the generated result by using the following command.

1. scala> data2.collect

•Apply cartesian() function to return the Cartesian product of the

elements.

1. scala> val cartesianfunc = data1.cartesian(data2)

•Now, we can read the generated result by using the following command.

1. scala> cartesianfunc.collect
Here, we got the desired output.

Spark sortByKey Function

In Spark, the sortByKey function maintains the order of elements. It receives
key-value pairs (K, V) as an input, sorts the elements in ascending or
descending order and generates a dataset in an order.

Example of sortByKey Function

In this example, we arrange the elements of dataset in ascending and
descending order.

•To open the Spark in Scala mode, follow the below command.

1. $ spark-shell
•Create an RDD using the parallelized collection.

1. scala> val data = sc.parallelize(Seq(("C",3),("A",1),("D",4),("B",2),("E",5)))

Now, we can read the generated result by using the following command.

1. scala> data.collect

For ascending,

Object 9
How to find Nth Highest Salary in SQL

•Apply sortByKey() function to ignore duplicate elements.

1. scala> val sortfunc = data.sortByKey()
•Now, we can read the generated result by using the following command.

1. scala> sortfunc.collect

Here, we got the desired output.

For descending,

•Apply sortByKey() function and pass Boolean type as parameter.

1. scala> val sortfunc = data.sortByKey(false)

•Now, we can read the generated result by using the following command.

1. scala> sortfunc.collect

Here, we got the desired output.

Spark groupByKey Function
In Spark, the groupByKey function is a frequently used transformation
operation that performs shuffling of data. It receives key-value pairs (K, V) as
an input, group the values based on key and generates a dataset of (K, Iterable

Example of groupByKey Function

In this example, we group the values based on the key.

•To open the Spark in Scala mode, follow the below command.

1. $ spark-shell

•Create an RDD using the parallelized collection.

1. scala> val data = sc.parallelize(Seq(("C",3),("A",1),("B",4),("A",2),("B",5)))

Now, we can read the generated result by using the following command.

1. scala> data.collect
•Apply groupByKey() function to group the values.

1. scala> val groupfunc = data.groupByKey()

•Now, we can read the generated result by using the following command.

1. scala> groupfunc.collect

Here, we got the desired output.

Spark reduceByKey Function

In Spark, the reduceByKey function is a frequently used transformation
operation that performs aggregation of data. It receives key-value pairs (K, V)
as an input, aggregates the values based on the key and generates a dataset
of (K, V) pairs as an output.

Example of reduceByKey Function

In this example, we aggregate the values on the basis of key.

•To open the Spark in Scala mode, follow the below command.

1. $ spark-shell

•Create an RDD using the parallelized collection.

1. scala> val data = sc.parallelize(Array(("C",3),("A",1),("B",4),("A",2),("B",5)))

Now, we can read the generated result by using the following command.

1. scala> data.collect
•Apply reduceByKey() function to aggregate the values.

1. scala> val reducefunc = data.reduceByKey((value, x) => (value + x))

•Now, we can read the generated result by using the following command.

1. scala> reducefunc.collect

Here, we got the desired output.

Object 10
HTML Tutorial
Spark cogroup Function
In Spark, the cogroup function performs on different datasets, let's say, (K, V)
and (K, W) and returns a dataset of (K, (Iterable groupWith.

Example of cogroup Function

In this example, we perform the groupWith operation.

•To open the Spark in Scala mode, follow the below command.

1. $ spark-shell

•Create an RDD using the parallelized collection.

1. scala> val data1 = sc.parallelize(Seq(("A",1),("B",2),("C",3)))

Now, we can read the generated result by using the following command.

1. scala> data1.collect
•Create another RDD using the parallelized collection.

1. scala> val data2 = sc.parallelize(Seq(("B",4),("E",5)))

Now, we can read the generated result by using the following command.

Object 11
History of Java

1. scala> data2.collect

•Apply cogroup() function to group the values.

1. scala> val cogroupfunc = data1.cogroup(data2)

•Now, we can read the generated result by using the following command.

1. scala> cogroupfunc.collect
Here, we got the desired output.

park First Function

In Spark, the First function always returns the first element of the dataset. It is
similar to take(1).

Example of First function

In this example, we retrieve the first element of the dataset.

•To open the Spark in Scala mode, follow the below command.

1. $ spark-shell
•Create an RDD using the parallelized collection.

1. scala> val data = sc.parallelize(List(10,20,30,40,50))

•Now, we can read the generated result by using the following command.

1. scala> data.collect

•Apply first() function to retrieve the first element of the dataset.

1. scala> val firstfunc = data.first()

Here, we got the desired output.

Spark Take Function

In Spark, the take function behaves like an array. It receives an integer value
(let say, n) as a parameter and returns an array of first n elements of the
dataset.

Example of Take function

In this example, we return the first n elements of an existing dataset.

•To open the Spark in Scala mode, follow the below command.

1. $ spark-shell
•Create an RDD using the parallelized collection.

1. scala> val data = sc.parallelize(List(10,20,30,40,50))

•Now, we can read the generated result by using the following command.

1. scala> data.collect

•Apply take() function to return an array of elements.

1. scala> val takefunc = data.take(3)

Here, we got the desired output.
Spark Word Count Example
In Spark word count example, we find out the frequency of each word exists in
a particular file. Here, we use Scala language to perform Spark operations.

Steps to execute Spark word count example

In this example, we find and display the number of occurrences of each word.

•Create a text file in your local machine and write some text into it.

1. $ nano sparkdata.txt

•Check the text written in the sparkdata.txt file.

1. $ cat sparkdata.txt
•Create a directory in HDFS, where to kept text file.

1. $ hdfs dfs -mkdir /spark

•Upload the sparkdata.txt file on HDFS in the specific directory.

1. $ hdfs dfs -put /home/codegyani/sparkdata.txt /spark

•Now, follow the below command to open the spark in Scala mode.

1. $ spark-shell
•Let's create an RDD by using the following command.

1. scala> val data=sc.textFile("sparkdata.txt")

Here, pass any file name that contains the data.

•Now, we can read the generated result by using the following command.

1. scala> data.collect;
•Here, we split the existing data in the form of individual words by using
the following command.

1. scala> val splitdata = data.flatMap(line => line.split(" "));

•Now, we can read the generated result by using the following command.

1. scala> splitdata.collect;

•Now, perform the map operation.

1. scala> val mapdata = splitdata.map(word => (word,1));

Here, we are assigning a value 1 to each word.

Object 12
HTML Tutorial

•Now, we can read the generated result by using the following command.

1. scala> mapdata.collect;
•Now, perform the reduce operation

1. scala> val reducedata = mapdata.reduceByKey(_+_);

Here, we are summarizing the generated data.

•Now, we can read the generated result by using the following command.

1. scala> reducedata.collect;

Here, we got the desired output.

Spark Char Count Example
In Spark char count example, we find out the frequency of each character
exists in a particular file. Here, we use Scala language to perform Spark
operations.

Steps to execute Spark char count example

In this example, we find and display the number of occurrences of each
character.

•Create a text file in your local machine and write some text into it.

1. $ nano sparkdata.txt

•Check the text written in the sparkdata.txt file.

1. $ cat sparkdata.txt
•Create a directory in HDFS, where to kept text file.

1. $ hdfs dfs -mkdir /spark

•Upload the sparkdata.txt file on HDFS in the specific directory.

1. $ hdfs dfs -put /home/codegyani/sparkdata.txt /spark

•Now, follow the below command to open the spark in Scala mode.

1. $ spark-shell
•Let's create an RDD by using the following command.

1. scala> val data=sc.textFile("sparkdata.txt");

Here, pass any file name that contains the data.

•Now, we can read the generated result by using the following command.

1. scala> data.collect;
•Here, we split the existing data in the form of individual words by using
the following command.

1. scala> val splitdata = data.flatMap(line => line.split(""));

•Now, we can read the generated result by using the following command.

1. scala> splitdata.collect;

•Now, perform the map operation.

1. scala> val mapdata = splitdata.map(word => (word,1));

Here, we are assigning a value 1 to each word.

Object 13
Difference between JDK, JRE, and JVM

•Now, we can read the generated result by using the following command.

1. scala> mapdata.collect;
•Now, perform the reduce operation

1. scala> val reducedata = mapdata.reduceByKey(_+_);

Here, we are summarizing the generated data.

•Now, we can read the generated result by using the following command.

1. scala> reducedata.collect;

Here, we got the desired output.

Spark-Tutorial - IV - Python
No ratings yet
Spark-Tutorial - IV - Python
212 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
MySQL Cheatsheet - CodeWithHarry
100% (1)
MySQL Cheatsheet - CodeWithHarry
13 pages
Python Advanced - Threads and Threading
No ratings yet
Python Advanced - Threads and Threading
9 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Pandas vs PySpark: Data Operations
No ratings yet
Pandas vs PySpark: Data Operations
3 pages
A Deep Dive Into Query Execution Engine of Spark SQL
100% (2)
A Deep Dive Into Query Execution Engine of Spark SQL
88 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Acceleo User Guide
No ratings yet
Acceleo User Guide
56 pages
Pandas Cheatsheet 1743309413
No ratings yet
Pandas Cheatsheet 1743309413
11 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Extract Transform Load
No ratings yet
Extract Transform Load
80 pages
Testing in Python - Unit Test & Script
No ratings yet
Testing in Python - Unit Test & Script
5 pages
Teradata SQL for Data Analysts
100% (2)
Teradata SQL for Data Analysts
38 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Spark
No ratings yet
Spark
13 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Worksheet - Pandas
100% (1)
Worksheet - Pandas
16 pages
SQL With Python Guide
No ratings yet
SQL With Python Guide
17 pages
Spark & Scala for Developers
No ratings yet
Spark & Scala for Developers
40 pages
Python Material
No ratings yet
Python Material
13 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
No ratings yet
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
12 pages
Hadoop Interview Questions Guide
100% (1)
Hadoop Interview Questions Guide
34 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Unix and Shell Programming Solutions
100% (2)
Unix and Shell Programming Solutions
54 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
8888888888888888888
100% (1)
8888888888888888888
131 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
Apache Spark Internals Guide
No ratings yet
Apache Spark Internals Guide
90 pages
14 SparkParallelProcessing
No ratings yet
14 SparkParallelProcessing
51 pages
Nifi Expression Language Cheat Sheet
100% (1)
Nifi Expression Language Cheat Sheet
2 pages
Hive Tutorial PDF
0% (1)
Hive Tutorial PDF
14 pages
Apache Spark Quick Reference
No ratings yet
Apache Spark Quick Reference
71 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Python Basics for Beginners
100% (1)
Python Basics for Beginners
5 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Python Date Time
No ratings yet
Python Date Time
6 pages
Hadoop Training for IT Professionals
100% (2)
Hadoop Training for IT Professionals
393 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Bda 5
No ratings yet
Bda 5
21 pages
Unit 5
100% (1)
Unit 5
109 pages
Bda U4
No ratings yet
Bda U4
49 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Python Part2 New
No ratings yet
Python Part2 New
90 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
SQL All
No ratings yet
SQL All
122 pages
Hive
No ratings yet
Hive
65 pages
NIST Cloud Computing Architecture Guide
No ratings yet
NIST Cloud Computing Architecture Guide
45 pages
Year 11 Maths Higher Autumn 1
No ratings yet
Year 11 Maths Higher Autumn 1
8 pages
Code Vita - 12 Query Resolutions
No ratings yet
Code Vita - 12 Query Resolutions
3 pages
Eloquent JavaScript
No ratings yet
Eloquent JavaScript
18 pages
SAFETY HEALTH SECURITY AND ETHICS Notes
No ratings yet
SAFETY HEALTH SECURITY AND ETHICS Notes
22 pages
Job Shop Scheduling Using Ant Colony Optimization
No ratings yet
Job Shop Scheduling Using Ant Colony Optimization
2 pages
(Ebooks PDF) Download .NET MAUI Projects 1st Edition Michael Cummings Full Chapters
100% (10)
(Ebooks PDF) Download .NET MAUI Projects 1st Edition Michael Cummings Full Chapters
52 pages
Amber Nicole Garza: St. Edward's University
No ratings yet
Amber Nicole Garza: St. Edward's University
2 pages
Dissertation Help for EMR Students
100% (2)
Dissertation Help for EMR Students
6 pages
Prof. Kusuma Varanasi
No ratings yet
Prof. Kusuma Varanasi
8 pages
1550 Lect 14
No ratings yet
1550 Lect 14
10 pages
Report On 7 Segment
No ratings yet
Report On 7 Segment
8 pages
5 Loops
No ratings yet
5 Loops
8 pages
Holidays Homework
No ratings yet
Holidays Homework
5 pages
Azure Cloud Administrator
No ratings yet
Azure Cloud Administrator
1 page
CSD130SFC L 8,0 SC2
100% (1)
CSD130SFC L 8,0 SC2
7 pages
Salon POS System Project Report
No ratings yet
Salon POS System Project Report
65 pages
Mobile Based Attendance System
100% (1)
Mobile Based Attendance System
21 pages
Macro Preprocessor
No ratings yet
Macro Preprocessor
75 pages
Processor Number Feature Table: Desktop - Page 1
No ratings yet
Processor Number Feature Table: Desktop - Page 1
4 pages
How To Make Resume One Page
100% (1)
How To Make Resume One Page
4 pages
Inventories
No ratings yet
Inventories
94 pages
Brksec 2020
No ratings yet
Brksec 2020
189 pages
GIS Lecture Notes
No ratings yet
GIS Lecture Notes
11 pages
QC-Job Description-Senior Test Automation Engineer
No ratings yet
QC-Job Description-Senior Test Automation Engineer
3 pages
Ph.D. Thesis Help: Enterprise Risk Management
100% (3)
Ph.D. Thesis Help: Enterprise Risk Management
8 pages
DSL-2740u Internet Setup Guide
No ratings yet
DSL-2740u Internet Setup Guide
4 pages
Shirts - Mu Press Release English
No ratings yet
Shirts - Mu Press Release English
2 pages
Upk Bahasa Inggris
No ratings yet
Upk Bahasa Inggris
20 pages
HANA 2.0 System Replication
100% (1)
HANA 2.0 System Replication
19 pages