0% found this document useful (0 votes)

151 views8 pages

Week 3-1

Uploaded by

many many

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

151 views8 pages

Week 3-1

Uploaded by

many many

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

assignment 3

1. Which abstraction in Apache Spark allows for parallel execution and

distributed data processing?
a. DataFrame
b. RDD (Resilient Distributed Dataset)
c. Dataset
d. Spark SQL

Option a: DataFrame - Incorrect. DataFrames provide a higher-level API

for structured data, but they are not the fundamental abstraction for parallel
execution.
Option b: RDD (Resilient Distributed Dataset) - Correct. RDDs are the
fundamental abstraction in Spark for distributed, fault-tolerant, and parallel
processing of large datasets.
Option c: Dataset - Incorrect. Datasets are a more recent addition to
Spark, combining the benefits of RDDs and DataFrames. However, RDDs
are the core concept for parallel execution.
Option d: Spark SQL - Incorrect. Spark SQL is a SQL engine built on top
of Spark, providing SQL-like capabilities. While it uses RDDs internally, it's
not the direct abstraction for parallel execution.

2. What component resides on top of Spark Core?

A) Spark Streaming

B) Spark SQL

C) RDDs

D) None of the above

Ans - B) Spark SQL

Option A: Spark Streaming - Incorrect. Spark Streaming is a component
that provides stream processing capabilities and builds on top of Spark
Core, but it is not the answer to this specific question, as Spark SQL is
more directly related to structured data processing.

Option B: Spark SQL - Correct. Spark SQL is a component that resides

on top of Spark Core. It provides a higher-level API for querying structured
data using SQL syntax and integrates with Spark’s core functionalities
through DataFrames and Datasets.

Option C: RDDs - Incorrect. RDDs are the fundamental abstraction in

Spark Core for parallel execution and distributed processing. They are not
a component that resides on top of Spark Core but rather part of the core
abstraction.

Option D: None of the above - Incorrect. Spark SQL is indeed a

component that resides on top of Spark Core, making this option incorrect.

3. Which statements about Cassandra and its Snitches are correct?

Statement 1: In Cassandra, during a write operation, when a hinted

handoff is enabled and if any replica is down, the coordinator writes to all
other replicas and keeps the write locally until the down replica comes back
up.

Statement 2: In Cassandra, Ec2Snitch is an important snitch for

deployments, and it is a simple snitch for Amazon EC2 deployments where
all nodes are in a single region. In Ec2Snitch, the region name refers to the
data center, and the availability zone refers to the rack in a cluster.

A) Only Statement 1 is correct.

B) Only Statement 2 is correct.

C) Both Statement 1 and Statement 2 are correct.

D) Neither Statement 1 nor Statement 2 is correct.

Ans- C) Both Statement 1 and Statement 2 are correct.

Statement 1: Correct. In Cassandra, when a hinted handoff is enabled, if

any replica is down during a write operation, the coordinator writes to all
other available replicas and keeps a hint for the down replica. Once the
down replica comes back online, the coordinator will hand off the hinted
write to that replica.

Statement 2: Correct. Ec2Snitch is a snitch used in Cassandra for

Amazon EC2 deployments. It assumes that all nodes are within a single
region. In Ec2Snitch, the term "region" corresponds to the data center, and
"availability zone" corresponds to the rack within the cluster, helping to
optimize data placement and replication.

4.Which of the following is a module for Structured data processing?

a. GraphX
b. MLlib
c. Spark SQL
d. Spark R

Option a: GraphX- Incorrect. This module is for graph processing and

analytics, allowing for the manipulation and analysis of graph data
structures.
Option b: MLlib- Incorrect.This is Spark’s machine learning library,
providing algorithms and utilities for machine learning tasks, not specifically
for structured data processing.
Option c: Spark SQL- Correct. Spark SQL is the module designed
specifically for structured data processing. It provides a programming
interface for working with structured and semi-structured data. It allows
querying of data via SQL, integrates with DataFrames and Datasets, and
provides optimizations through the Catalyst optimizer and Tungsten
execution engine.
Option d: Spark R - Incorrect. This module provides support for using R
with Spark, primarily aimed at statistical computing and data analysis rather
than structured data processing specifically.

5. A healthcare provider wants to store and query patient records in a

NoSQL database with high write throughput and low-latency access. Which
Hadoop ecosystem technology is most suitable for this requirement?

A) Apache Hadoop

B) Apache Spark

C) Apache HBase

D) Apache Pig

Ans- C) Apache HBase

Option A: Apache Hadoop - Incorrect. Apache Hadoop is a framework

for distributed storage and processing of large data sets using the Hadoop
Distributed File System (HDFS) and MapReduce. It is not specifically
optimized for low-latency access or high write throughput.

Option B: Apache Spark - Incorrect. Apache Spark is a fast, in-memory

data processing engine that can handle large-scale data analytics and
processing. While it offers low-latency data processing, it is not a NoSQL
database and is not designed primarily for high write throughput.

Option C: Apache HBase - Correct. Apache HBase is a distributed,

scalable, NoSQL database that runs on top of HDFS. It is designed for high
write throughput and low-latency access to large volumes of data, making it
suitable for storing and querying patient records efficiently.
Option D: Apache Pig - Incorrect. Apache Pig is a high-level platform for
creating MapReduce programs used with Hadoop. It is primarily used for
data transformation and analysis, not for high write throughput or
low-latency NoSQL data storage.

6.The primary Machine Learning API for Spark is now the _____ based API

a. DataFrame
b. Dataset
c. RDD
d. All of the above

Option A: DataFrame - Correct. The primary Machine Learning API

for Spark is now based on DataFrames. Spark’s MLlib, the machine
learning library, has adopted DataFrames as the primary API for
building and training machine learning models. This approach
provides a higher-level API and better integration with Spark SQL,
offering optimized performance and ease of use.

Option B: Dataset - Incorrect. While Datasets are a powerful API in

Spark that provides type safety and functional programming
constructs, the primary Machine Learning API is not based on
Datasets. Instead, DataFrames are used.

Option C: RDD - Incorrect. RDDs (Resilient Distributed Datasets)

were the original abstraction in Spark and were used in earlier
versions of MLlib. However, the primary Machine Learning API has
shifted to DataFrames for better integration and performance.

Option D: All of the above - Incorrect. While RDDs, DataFrames,

and Datasets are all important abstractions in Spark, the primary
Machine Learning API is now specifically based on DataFrames
7. How does Apache Spark's performance compare to Hadoop
MapReduce?

a) Apache Spark is up to 10 times faster in memory and up to 100

times faster on disk.

b) Apache Spark is up to 100 times faster in memory and up to 10

times faster on disk.

c) Apache Spark is up to 10 times faster both in memory and on disk

compared to Hadoop MapReduce.

d) Apache Spark is up to 100 times faster both in memory and on

disk compared to Hadoop MapReduce.

Ans- b) Apache Spark is up to 100 times faster in memory and up to 10

times faster on disk.

Option a: Incorrect. While Spark is indeed faster than Hadoop

MapReduce, the comparison numbers are not accurate. Spark can
be up to 100 times faster in memory, but the figure of 10 times faster
on disk is not the correct characterization.

Option b: Correct. Apache Spark is known for its significant

performance improvements over Hadoop MapReduce. It can be up to
100 times faster when processing data in memory, due to its
in-memory computation capabilities. On disk, Spark is up to 10 times
faster compared to MapReduce because of its efficient data
processing and reduced disk I/O.

Option c: Incorrect. Spark is not just 10 times faster both in memory

and on disk. It achieves up to 100 times faster performance in
memory and up to 10 times faster on disk.

Option d: Incorrect. While Spark can be up to 100 times faster in

memory, it is not typically characterized as being up to 100 times
faster on disk. The correct comparison indicates up to 10 times faster
on disk.
8.Which DAG action in Apache Spark triggers the execution of all
previously defined transformations in the DAG and returns the count of
elements in the resulting RDD or DataFrame?
a. collect()
b. count()
c. take()
d. first()

Option a: collect() - Incorrect. The collect() action triggers the

execution of all previously defined transformations and retrieves all
elements of the RDD or DataFrame to the driver program. It does not return
the count of elements but rather returns the complete dataset.

Option b: count() - Correct. The count() action triggers the execution of

all previously defined transformations in the DAG and returns the number
of elements in the resulting RDD or DataFrame. It is specifically designed
to return the count of elements.

Option c: take() - Incorrect. The take() action triggers execution and

retrieves a specified number of elements from the RDD or DataFrame, but
it does not return the count of all elements.

Option d: first() - Incorrect. The first() action triggers execution and

retrieves the first element of the RDD or DataFrame, but it does not return
the count of elements.
9. What is Apache Spark Streaming primarily used for?
a. Real-time processing of streaming data
b. Batch processing of static datasets
c. Machine learning model training
d. Graph processing

Option a: Real-time processing of streaming data - Correct. Spark

Streaming is designed for processing continuous streams of data in
real-time.
Option b: Batch processing of static datasets - Incorrect. Batch
processing is better suited for static datasets.
Option c: Machine learning model training - Incorrect. While Spark can
be used for machine learning, Spark Streaming is specifically for streaming
data.
Option d: Graph processing - Incorrect. Graph processing is another
area where Spark can be used, but Spark Streaming is focused on
streaming data.

10.Which of the following represents the smallest unit of data processed by

Apache Spark Streaming?
a. Batch
b. Window
c. Micro-batch
d. Record

Option a: Batch - Incorrect. A batch is a collection of micro-batches.

Option b: Window - Incorrect. A window is a time interval used for
processing data.
Option c: Micro-batch - Correct. Micro-batches are the smallest unit of
data processed in Spark Streaming.
Option d: Record - Incorrect. A record is a single unit of data within a
micro-batch.

Assignment 03 BigData Computing Noc23-Cs112
No ratings yet
Assignment 03 BigData Computing Noc23-Cs112
6 pages
Week - 5
No ratings yet
Week - 5
7 pages
Week 0 To 8 Assignment
No ratings yet
Week 0 To 8 Assignment
31 pages
Midterm Exam Practice: Distributed Systems & Apache Spark
No ratings yet
Midterm Exam Practice: Distributed Systems & Apache Spark
24 pages
Tarea 8
0% (2)
Tarea 8
13 pages
Pyspark Dumps
No ratings yet
Pyspark Dumps
10 pages
Apache Spark - Practices 2nd
No ratings yet
Apache Spark - Practices 2nd
26 pages
PySpark Comprehensive Notes
No ratings yet
PySpark Comprehensive Notes
59 pages
Apache Spark IQ
No ratings yet
Apache Spark IQ
15 pages
PDF Apache Spark
No ratings yet
PDF Apache Spark
15 pages
Nptel Assignment 1
No ratings yet
Nptel Assignment 1
4 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Spark & Big Data Assignment Answers
No ratings yet
Spark & Big Data Assignment Answers
3 pages
SP 3
No ratings yet
SP 3
18 pages
Bigdata Bits
No ratings yet
Bigdata Bits
2 pages
Bda MCQ Set
No ratings yet
Bda MCQ Set
8 pages
ABD Exame PDF
No ratings yet
ABD Exame PDF
17 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
Spark MCQ
No ratings yet
Spark MCQ
3 pages
PracticeExam DCADAS3 Scala 1
No ratings yet
PracticeExam DCADAS3 Scala 1
27 pages
Spark Interview Q&A: Key Insights
No ratings yet
Spark Interview Q&A: Key Insights
10 pages
Sparksql
No ratings yet
Sparksql
3 pages
BD Question Bank MCQ Answered
No ratings yet
BD Question Bank MCQ Answered
8 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
DS QCM BigData 2021
No ratings yet
DS QCM BigData 2021
6 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
DSA307 Lecture 2 Final Out
No ratings yet
DSA307 Lecture 2 Final Out
3 pages
Question Bank Answers BDA
No ratings yet
Question Bank Answers BDA
8 pages
Spark Intreview FAQ
100% (2)
Spark Intreview FAQ
21 pages
Full MCQ Questions With Answers AnswerOnly
No ratings yet
Full MCQ Questions With Answers AnswerOnly
10 pages
BIG DATA ANALYTICS MCQs
No ratings yet
BIG DATA ANALYTICS MCQs
8 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
A1
No ratings yet
A1
33 pages
MCQ Big
No ratings yet
MCQ Big
7 pages
Questions Certif BigData
No ratings yet
Questions Certif BigData
12 pages
Bigdata MCQ QA Part2
No ratings yet
Bigdata MCQ QA Part2
9 pages
Full MCQ Questions With Answers Cleaned
No ratings yet
Full MCQ Questions With Answers Cleaned
9 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Unit V
No ratings yet
Unit V
35 pages
CIS4130 Mock Exam
No ratings yet
CIS4130 Mock Exam
9 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
Machine Learning & Graph Processing
No ratings yet
Machine Learning & Graph Processing
9 pages
Spark Preliminaries
No ratings yet
Spark Preliminaries
4 pages
coursBUTONLYQA Merged
No ratings yet
coursBUTONLYQA Merged
52 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Big Data Analytics 2M Definitions
No ratings yet
Big Data Analytics 2M Definitions
3 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Big Data and Hadoop - Semester Exam - 6th Sem-Set 01
No ratings yet
Big Data and Hadoop - Semester Exam - 6th Sem-Set 01
3 pages
IBM Cloud and Big Data Quiz
100% (1)
IBM Cloud and Big Data Quiz
206 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Final MCQ Questions Styled Cleaned
No ratings yet
Final MCQ Questions Styled Cleaned
9 pages
Big Data QCM 1 PDF
100% (1)
Big Data QCM 1 PDF
7 pages
Big Data Course: Key Concepts & Tools
No ratings yet
Big Data Course: Key Concepts & Tools
66 pages
Aws Pyspark
No ratings yet
Aws Pyspark
1 page
Software Testing - Unit 5 - Week 2
100% (1)
Software Testing - Unit 5 - Week 2
3 pages
Week 8-2-9-Copy-0
No ratings yet
Week 8-2-9-Copy-0
1 page
Software Testing - Unit 4 - Week 1
100% (1)
Software Testing - Unit 4 - Week 1
3 pages
Week - 4-1
No ratings yet
Week - 4-1
7 pages
Database Normalization
No ratings yet
Database Normalization
12 pages
57.8 - SELECT - mp4
No ratings yet
57.8 - SELECT - mp4
5 pages
Pavani Senior Data Engineer Professional Summary
No ratings yet
Pavani Senior Data Engineer Professional Summary
6 pages
DCM Module 2 MCQS
No ratings yet
DCM Module 2 MCQS
15 pages
Super Block
No ratings yet
Super Block
2 pages
DP 18 1 Practice
No ratings yet
DP 18 1 Practice
2 pages
Security With Apache Ranger: 2 Days - Subject Matter Expert
No ratings yet
Security With Apache Ranger: 2 Days - Subject Matter Expert
4 pages
UFO 50 Mod Update Guide
No ratings yet
UFO 50 Mod Update Guide
3 pages
How HPUX Works Concepts For The System Administrator
No ratings yet
How HPUX Works Concepts For The System Administrator
569 pages
Informatica Tutorial
No ratings yet
Informatica Tutorial
5 pages
Ignou Synopsis
No ratings yet
Ignou Synopsis
11 pages
BI Apps 11.1.1.7.1 Installation and Configuration
No ratings yet
BI Apps 11.1.1.7.1 Installation and Configuration
117 pages
Web App Security for Students
No ratings yet
Web App Security for Students
5 pages
SQL Joins Lab Manual: Inner & Self
No ratings yet
SQL Joins Lab Manual: Inner & Self
14 pages
TSS Assignment-1t
No ratings yet
TSS Assignment-1t
4 pages
DBMS, Mysql Project
No ratings yet
DBMS, Mysql Project
6 pages
Chapter Ii
No ratings yet
Chapter Ii
4 pages
Power BI Analyst Career Profile
No ratings yet
Power BI Analyst Career Profile
3 pages
SHS CSS 2 Programming - Week 02
No ratings yet
SHS CSS 2 Programming - Week 02
7 pages
Voucher-Zahra Wifi-7 Jam-Up-787-07.15.22-Rawin
100% (1)
Voucher-Zahra Wifi-7 Jam-Up-787-07.15.22-Rawin
10 pages
Comprehensive Guide to Storage Devices
No ratings yet
Comprehensive Guide to Storage Devices
8 pages
KNVV SAP Table - Customer Master Sales Data: Field Data Element Data Type Length Checktable
No ratings yet
KNVV SAP Table - Customer Master Sales Data: Field Data Element Data Type Length Checktable
3 pages
Excel 2016 Advanced Users Guide
No ratings yet
Excel 2016 Advanced Users Guide
8 pages
Region-City Picklist Constraint Guide
No ratings yet
Region-City Picklist Constraint Guide
2 pages
Architectural Issues in Adopting Distributed Shared Memory For Distributed Object Management Systems
No ratings yet
Architectural Issues in Adopting Distributed Shared Memory For Distributed Object Management Systems
7 pages
Data Science For Agriculture
No ratings yet
Data Science For Agriculture
5 pages
Deployment Server Installation Guide
No ratings yet
Deployment Server Installation Guide
12 pages
Introduction To Oracle: SQL Plus
100% (1)
Introduction To Oracle: SQL Plus
6 pages
Enus214 371 PDF
No ratings yet
Enus214 371 PDF
51 pages
Lob Example
No ratings yet
Lob Example
4 pages

Week 3-1

Uploaded by

Week 3-1

Uploaded by

assignment 3

1. Which abstraction in Apache Spark allows for parallel execution and

Option a: DataFrame - Incorrect. DataFrames provide a higher-level API

2. What component resides on top of Spark Core?

D) None of the above

Ans - B) Spark SQL

Option B: Spark SQL - Correct. Spark SQL is a component that resides

Option C: RDDs - Incorrect. RDDs are the fundamental abstraction in

Option D: None of the above - Incorrect. Spark SQL is indeed a

3. Which statements about Cassandra and its Snitches are correct?

Statement 1: In Cassandra, during a write operation, when a hinted

Statement 2: In Cassandra, Ec2Snitch is an important snitch for

A) Only Statement 1 is correct.

B) Only Statement 2 is correct.

C) Both Statement 1 and Statement 2 are correct.

D) Neither Statement 1 nor Statement 2 is correct.

Statement 1: Correct. In Cassandra, when a hinted handoff is enabled, if

Statement 2: Correct. Ec2Snitch is a snitch used in Cassandra for

4.Which of the following is a module for Structured data processing?

Option a: GraphX- Incorrect. This module is for graph processing and

5. A healthcare provider wants to store and query patient records in a

Ans- C) Apache HBase

Option A: Apache Hadoop - Incorrect. Apache Hadoop is a framework

Option B: Apache Spark - Incorrect. Apache Spark is a fast, in-memory

Option C: Apache HBase - Correct. Apache HBase is a distributed,

Option A: DataFrame - Correct. The primary Machine Learning API

Option B: Dataset - Incorrect. While Datasets are a powerful API in

Option C: RDD - Incorrect. RDDs (Resilient Distributed Datasets)

Option D: All of the above - Incorrect. While RDDs, DataFrames,

a) Apache Spark is up to 10 times faster in memory and up to 100

b) Apache Spark is up to 100 times faster in memory and up to 10

c) Apache Spark is up to 10 times faster both in memory and on disk

d) Apache Spark is up to 100 times faster both in memory and on

Ans- b) Apache Spark is up to 100 times faster in memory and up to 10

Option a: Incorrect. While Spark is indeed faster than Hadoop

Option b: Correct. Apache Spark is known for its significant

Option c: Incorrect. Spark is not just 10 times faster both in memory

Option d: Incorrect. While Spark can be up to 100 times faster in

Option a: collect() - Incorrect. The collect() action triggers the

Option b: count() - Correct. The count() action triggers the execution of

Option c: take() - Incorrect. The take() action triggers execution and

Option d: first() - Incorrect. The first() action triggers execution and

Option a: Real-time processing of streaming data - Correct. Spark

10.Which of the following represents the smallest unit of data processed by

Option a: Batch - Incorrect. A batch is a collection of micro-batches.

You might also like