0% found this document useful (0 votes)

52 views46 pages

QB Bda Solution

Big data is characterized by its volume, velocity, and variety. It refers to large and complex datasets that are difficult to process using traditional techniques. Big data is described using three key dimensions: volume refers to the vast amounts of data generated; velocity refers to the speed at which data is generated and processed; and variety refers to the different types of structured, semi-structured, and unstructured data from various sources. Big data analytics involves collecting, processing, analyzing large and complex datasets to uncover patterns and insights that can help organizations make better decisions.

Uploaded by

Avinash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views46 pages

QB Bda Solution

Uploaded by

Avinash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Q1) What is Big Data ? Explain Characteristics of Big Data .

Ans :

Big data is a term used in the field of Big Data Analytics (BDA) to describe large and complex
datasets that are difficult to process and analyze using traditional data processing techniques. Big
data is characterized by its volume, velocity, and variety.

Big data is used to describe datasets that are so large and complex that traditional data
processing techniques are not sufficient to handle them. These datasets can be generated from a
variety of sources, including social media, Internet of Things (IoT) devices, sensors, and other
sources.

Big data is typically characterized by three main characteristics, often referred to as the three Vs:
volume, velocity, and variety.

1) Volume: Refers to the vast amount of data generated every day. Big data often refers to
datasets that are too large to be managed and analyzed using traditional database
management systems. The volume of data is usually measured in terms of the size of the
dataset, typically in petabytes, exabytes, or zettabytes.

OR YOU CAN WRITE THIS ALSO OR ADD ACCORDINGLY IN UR ANS.

“One of the defining characteristics of big data is its volume. Big data sets can range from
terabytes to petabytes and beyond, and they can contain millions or even billions of records.
Traditional data processing tools may struggle to handle this amount of data, but big data
technologies such as Hadoop and Spark are designed to handle it with ease.

2) Velocity: Refers to the speed at which data is generated and processed. Big data is often
generated in real-time or near real-time, making it difficult to manage and process using
traditional systems. Streaming data, social media data, and IoT data are examples of data
that is generated at high velocity.
3) Variety: Refers to the different types of data that exist, including structured, semi-
structured, and unstructured data. Structured data is typically found in databases and is
easily searchable and analyzed using traditional tools. Semi-structured data is often found in
XML, JSON, or other formats that have some structure but are not as easily searchable as
structured data. Unstructured data includes things like text, images, audio, and video files
that have no inherent structure and require specialized tools to search and analyze.

In addition to the three Vs, there are two other characteristics of big

data that are becoming increasingly important:

4) Veracity: Refers to the accuracy and quality of the data. Big data often includes data from a
wide range of sources, and it can be difficult to ensure that the data is accurate and
trustworthy. Veracity refers to the process of ensuring that the data is accurate and can be
used to make informed decisions.
5) Value: Refers to the potential value that can be extracted from big data. While big data can
be overwhelming, it also offers tremendous opportunities to gain insights and make better
decisions. By using advanced analytics tools, organizations can extract valuable insights
from big data that can help them improve their business processes, products, and services.

Q2 ) What are the benefits of Big Data? Discuss challenges under Big Data. How Big Data
Analytics can be useful in the development of smart cities.(Discuss one application).

ANS :

There are many benefits to utilizing big data in business and other fields. Here are some of the key
benefits:

1) Better decision-making: Big data provides insights that can help decision-makers make
more informed decisions. By analyzing large datasets, organizations can identify patterns
and trends that might not be apparent with smaller datasets, and make data-driven
decisions based on these insights.
2) Improved operational efficiency: Big data analytics can help organizations optimize their
operations by identifying areas where they can improve efficiency, reduce waste, and lower
costs. For example, analyzing sensor data from manufacturing equipment can help identify
potential maintenance issues before they become costly problems.
3) Enhanced customer experience: Big data analytics can help organizations understand their
customers better and provide them with more personalized experiences. By analyzing
customer data, organizations can identify trends in customer behavior, preferences, and
needs, and tailor their products and services accordingly.
4) Increased revenue: Big data analytics can help organizations identify new revenue
opportunities and improve their sales and marketing efforts. For example, analyzing
customer data can help identify cross-selling and upselling opportunities, and improve
customer retention rates.
5) Improved risk management: Big data analytics can help organizations identify potential
risks and mitigate them before they become major problems. For example, analyzing
financial data can help identify potential fraud or other financial irregularities.

Overall, big data offers tremendous opportunities to gain insights and make better
decisions, improve efficiency, and drive innovation and growth in organizations. By
leveraging big data analytics, organizations can stay ahead of the competition and make
data-driven decisions that lead to better outcomes.

Challenges to Big Data

While big data offers many benefits, there are also several challenges associated with
managing and analyzing large datasets. Here are some of the key challenges:
1) Data quality: With so much data being generated from a variety of sources, ensuring the
accuracy and quality of the data can be a challenge. Poor quality data can lead to inaccurate
insights and decisions, so it's important to ensure that the data being used is accurate and
reliable.
2) Data integration: Big data often comes from a variety of sources and in different formats,
making it difficult to integrate and analyze. Organizations need to be able to bring together
data from different sources and make it usable for analysis.
3) Data security: Big data can contain sensitive information that needs to be protected.
Organizations need to ensure that the data is secure and that appropriate security measures
are in place to prevent unauthorized access.
4) Lack of skilled personnel: Managing and analyzing big data requires specialized skills and
expertise that may not be readily available. Organizations may need to invest in training or
hire new personnel with the necessary skills to manage and analyze big data effectively.
5) Infrastructure: Storing and processing large datasets requires significant computing power
and storage capacity. Organizations need to have the appropriate infrastructure in place to
manage and analyze big data effectively.
6) Cost: Managing and analyzing big data can be expensive. Organizations need to invest in
the appropriate technology, personnel, and infrastructure, which can be costly.

How Big Data Analytics can be useful in the development of smart cities
Big data analytics can be a valuable tool for developing and managing smart cities. One specific
application of big data analytics in smart cities is in the management of traffic and transportation
systems.

Traffic congestion is a major problem in many cities around the world, leading to wasted time and
increased air pollution. By using big data analytics, cities can analyze traffic patterns and develop
more efficient transportation systems.

For example, cities can use data from traffic sensors, GPS devices, and other sources to analyze
traffic patterns and identify areas where congestion is most likely to occur. This data can be used
to develop more efficient traffic flow patterns, optimize traffic signals, and adjust speed limits to
reduce congestion.

Cities can also use big data analytics to improve public transportation systems. By analyzing data
from bus and train schedules, passenger flows, and other sources, cities can optimize routes and
schedules to reduce wait times and improve service.
Q3) What is Big data? Discuss it in terms of four dimensions, volume, velocity,variety and
veracity.

ANS :

ANU ANSWER Q1 JEVUJ CHE SO POTE JOINE BHANVU

Q3 ) What is big data analytics? Explain four ‘V’s of Big data. Briefly discuss applications of
big data

ANS.

BIG DATA ANALYTICS

Big data analytics refers to the process of analyzing and extracting insights from large and complex
datasets. It involves using advanced analytical techniques and tools to process and analyze data
that is too large and complex to be handled by traditional data processing systems.

The goal of big data analytics is to uncover patterns, trends, and insights that can be used to make
informed decisions and drive business value. This involves processing and analyzing large volumes
of structured and unstructured data from a variety of sources, including social media, sensors, and
other digital devices.

Big data analytics typically involves four key stages:

1) Data collection: Collecting and aggregating data from multiple sources.

2) Data processing: Cleaning and preparing the data for analysis.
3) Data analysis: Applying statistical and analytical techniques to uncover patterns and
insights in the data.
4) Data visualization: Communicating the insights in a visual format that is easy to
understand.

Some of the key tools and technologies used in big data analytics include Hadoop, Spark,
NoSQL databases, machine learning algorithms, and data visualization tools.

Overall, big data analytics offers tremendous opportunities to gain insights and make better
decisions, but it also presents significant challenges related to data quality, integration,
security, and infrastructure. Effective big data analytics requires a combination of advanced
technology, skilled personnel, and sound data management practices.

Four V’s of Big Data .

The four V's of Big Data are Volume, Velocity, Variety, and Veracity. These four V's help to
define the characteristics of Big Data and provide a framework for understanding how it
differs from traditional data.
1) Volume: Refers to the sheer amount of data that is generated and collected. Big Data is
characterized by its massive volume, which is typically too large to be handled by
traditional data processing methods. This volume includes structured data, like customer
data and sales figures, as well as unstructured data, like social media posts, images, and
videos.

2) Velocity: Refers to the speed at which data is generated and must be processed. Big
Data is often generated in real-time or near real-time, and businesses need to be able to
analyze this data quickly to make timely decisions. The velocity of Big Data is increasing
due to the proliferation of devices and sensors that are constantly generating data.
3) Variety: Refers to the different types of data that are generated and collected. Big Data
includes structured, semi-structured, and unstructured data, each of which requires
different processing methods. Structured data is organized and easily searchable, while
unstructured data is not. Semi-structured data is a combination of both.
4) Veracity: Refers to the accuracy and reliability of the data. Big Data can come from many
different sources, and not all of it is trustworthy. 5) Veracity refers to the ability to
determine the accuracy of the data and ensure that it is reliable.

Together, these four V's provide a framework for understanding the characteristics of Big
Data and the challenges associated with managing and analyzing it. Big Data Analytics tools
and technologies, such as Hadoop, Spark, and NoSQL databases, are designed to help
businesses manage and analyze Big Data efficiently, extract valuable insights, and make
data-driven decisions.

APPLICATIONS OF BIG DATA .

Big Data has many applications across various industries and domains. Here are some
examples of how Big Data is being used in different fields:

1) Healthcare: Big Data is being used to improve patient outcomes and reduce costs. By
analyzing patient data from electronic health records, medical devices, and wearables,
healthcare providers can identify patterns and trends that can inform treatment
decisions and improve patient care.
2) Marketing: Big Data is used in marketing to target customers more effectively. By
analyzing data from social media, online searches, and purchase history, businesses can
identify consumer preferences and behavior to deliver more personalized and targeted
advertising.
3) Finance: Big Data is used in finance to identify fraud, assess risk, and improve customer
service. By analyzing transactional data, social media sentiment, and other data sources,
financial institutions can make better-informed decisions about lending, investing, and
risk management.
4) Manufacturing: Big Data is used in manufacturing to optimize production and reduce
downtime. By analyzing data from sensors and other sources, manufacturers can identify
inefficiencies in production processes and implement changes to improve efficiency and
reduce costs.
5) Transportation: Big Data is used in transportation to improve safety and efficiency. By
analyzing data from GPS devices, traffic sensors, and other sources, transportation
companies can optimize routes, reduce congestion, and improve safety on the roads.

Q4 ) Describe Traditional vs. Big Data business approach. Explain Challenges of

Conventional System.
ANS .

Traditional business approach involves using structured data from internal systems such as
enterprise resource planning (ERP) and customer relationship management (CRM) to support
decision-making processes. This approach relies on databases that can be queried and analyzed
using traditional data analysis tools such as SQL. This type of data is often limited in volume,
velocity, and variety, which means that businesses may miss out on valuable insights if they rely
solely on this type of data.

On the other hand, the Big Data business approach involves using both structured and
unstructured data from various sources such as social media, mobile devices, sensors, and other
IoT devices to support decision-making processes. This approach relies on Big Data Analytics tools
and technologies to store, process, and analyze large volumes of data, often in real-time. This type
of data is characterized by high volume, velocity, and variety, which makes it difficult to analyze
using traditional data analysis tools.

The Big Data approach provides businesses with more comprehensive insights into customer
behavior, market trends, and operational efficiencies. It allows businesses to identify patterns and
trends that were previously impossible to detect, enabling them to make more informed decisions.
For example, a retailer can use Big Data to analyze customer browsing and purchasing behavior to
identify which products are popular and why, and then adjust their inventory and pricing strategy
accordingly.

OR AVI REETE PAN LAKHI SHAKO CHO IN TABULAR FORM

Aspect Traditional Business Approach Big Data Business Approach

Data collection is extensive, involving

Data Data collection is limited and may multiple sources, including structured
Collection only involve a few sources and unstructured data

Data processing is done using

traditional methods like Data processing is done using advanced
Data spreadsheets, SQL databases, and technologies like Hadoop, Spark, and
Processing simple statistical tools other big data frameworks

Data analysis is sophisticated, using

Data analysis is limited, focusing advanced analytical techniques like
on historical data and basic machine learning, artificial intelligence,
Data Analysis statistical analysis and predictive modeling

Decision- Decision-making is based on Decision-making is data-driven, with the

Making intuition and experience help of real-time analytics and insights

Limited scalability, mainly

dependent on physical resources High scalability, leveraging cloud
like employees, hardware, and computing, virtualization, and
Scale software distributed computing

Low agility, as changes take time High agility, as changes can be

to implement and require implemented quickly and without much
Agility significant resources disruption

Limited customer insight due to Deep customer insights, including

Customer limited data and analysis behavioral analysis, sentiment analysis,
Insight capabilities and social media analytics

Competitive advantage is based Competitive advantage is based on

Competitive on brand reputation, price, and data-driven insights, innovation, and
Advantage product quality agility
CHALLENGES OF CONVENTIONAL SYSTEM .

Conventional systems, also known as legacy systems, often face a range of challenges that can
limit their effectiveness and hinder business operations. Here are some of the main challenges of
conventional systems:

Limited functionality: Conventional systems often have limited functionality, as they were
designed to perform specific tasks or functions. They may lack the flexibility to adapt to changing
business needs or integrate with other systems.
Data silos: Conventional systems may create data silos, where data is stored in isolated systems
that are not connected to other systems or applications. This can make it difficult to share
information across different departments or business units, leading to inefficiencies and delays.
Security vulnerabilities: Conventional systems may have security vulnerabilities due to outdated
technology or lack of updates. These vulnerabilities can make them susceptible to cyber-attacks or
data breaches, which can result in significant financial and reputational damage.
Limited scalability: Conventional systems may not be scalable, meaning that they cannot easily
accommodate increased demand or growth. This can result in bottlenecks and slowdowns, leading
to reduced productivity and customer dissatisfaction.
Lack of integration: Conventional systems may not be designed to integrate with other systems
or applications, making it difficult to streamline processes and improve efficiency. This can result in
manual processes and data entry, leading to errors and delays.
Costly maintenance: Conventional systems can be costly to maintain, as they may require
specialized skills and knowledge to keep them running effectively. This can result in high IT costs
and limited resources for other business needs.
Q.5). What are the benefits of Big Data? Discuss challenges under Big Data. How Big Data Analytics can be
useful in the development of smart cities
ANS :
Q.6). Discuss big data in healthcare, transportation and medicine.
ANS:

Big Data in Transportation.

Big Data in Medicine.
Q.7) . What are the advantages of Hadoop? Explain Hadoop Architecture and its Components with proper
diagram.
ANS :
Q. 8) . Explain working of following phases of Map Reduce with one common example.
(i) Map Phase
(ii) Combiner Phase
(iii) Shuffle and Sort Phase
(iv) Reducer Phase
Q.9 ) . Write Map Reduce code for counting occurrences of specific words in the
input text file(s). Also write the commands to compile and run the code.

ANS : ATTENTION! !!!!!!!!!!!!!!!,

IS ANS KE BAREME MEREKO PATA NAHI HAI SO KRIPIYA APKO PATA HO TO
HAME SAMPARK KARE.
Q. 10 ). What is Name node & Data node in Hadoop Architecture.
ANS : What is NameNode
Metadata refers to a small amount of data, and it requires a minimum amount of memory to store. Namenode
stores this metadata of all the files in HDFS. Metadata includes file permission, names, and location of each
block. A block is a minimum amount of data that can be read or write. Moreover, NameNode maps these
blocks to dataNodes. Furthermore, nameNode manages all other dataNodes. Master node is an alternative
name for nameNode.
What is DataNode
The nodes other than the nameNode are called dataNodes. Slave node is another name for dataNode. The
data nodes store and retrieve blocks as instructed by the nameNode.

All dataNodes continuously communicate with the name node. They also inform the nameNode about the
blocks they are storing. Furthermore, the dataNodes also perform block creation, deletion, and replication as
instructed by the nameNode.
Relationship Between NameNode and DataNode
Namenode and Datanode operate according to master-slave architecture in Hadoop Distributed File System
(HDFS).
Difference Between NameNode and DataNode
Definition
NameNode is the controller and manager of HDFS whereas DataNode is a node other than the NameNode in
HDFS that is controlled by the NameNode. Thus, this is the main difference between NameNode and
DataNode in Hadoop.
Synonyms
Moreover, Master node is another name for NameNode while Slave node is another name for DataNode.
Main Functionality
While nameNode handles the metadata of all the files in HDFS and controls the dataNodes, Datanode store
and retrieve blocks according to the master node’s instructions. Hence, this is another difference between
NameNode and DataNode in Hadoop.
Conclusion
The main difference between NameNode and DataNode in Hadoop is that the NameNode is the master node
in HDFS that manages the file system metadata while the DataNode is a slave node in HDFS that stores the
actual data as instructed by the NameNode. In brief, NameNode controls and manages a single or multiple
data nodes.

Q.11) . Describe Map Reduce Types and Formats ?

ANS :

INPUT FORMATS
● InputFormat describes the input-specification for execution of the Map-Reduce job.
● In MapReduce job execution, InputFormat is the first step. InputFormat describes how to split and read
input files.
● InputFormat is responsible for splitting the input data file into records which is used for map-reduce
operation.
○ InputFormat selects the files or other objects for input.
○ It defines the Data splits. It defines both the size of individual Map tasks and its potential execution
server.
○ InputFormat defines the RecordReader. It is also responsible for reading actual records from the
input files.

TYPES OF INPUT FORMAT

1. FileInputFormat: It is the base class for all file-based InputFormats. When we start a MapReduce job
execution, FileInputFormat provides a path containing files to read. This InputFormat will read all files and
divides these files into one or more InputSplits.

2. TextInputFormat: It is the default InputFormat. This InputFormat treats each line of each input file as a
separate record. It performs no parsing. TextInputFormat is useful for unformatted data or line-based
records like log files.

3. KeyValueTextInputFormat: It is similar to TextInputFormat. This InputFormat also treats each line of input as
a separate record. While the difference is that TextInputFormat treats entire line as the value, but the
KeyValueTextInputFormat breaks the line itself into key and value.

4. SequenceFileInputFormat: It is an InputFormat which reads sequence files. Sequence files are binary files.
These files also store sequences of binary key-value pairs. These are block-compressed and provide direct
serialization and deserialization of several arbitrary data.

5. N-lineInputFormat: It is another form of TextInputFormat where the keys are byte offset of the line. And
values are contents of the line. So, each mapper receives a variable number of lines of input with
TextInputFormat and KeyValueTextInputFormat. So, if want our mapper to receive a fixed number of lines of
input, then we use NLineInputFormat.

6. DBInputFormat: This InputFormat reads data from a relational database, using JDBC. It also loads small
datasets, perhaps for joining with large datasets from HDFS using MultipleInputs.

OUTPUT FORMATS
● The outputFormat decides the way the output key-value pairs are written in the output files by
RecordWriter.
The OutputFormat and InputFormat functions are similar. OutputFormat instances are used to write
to files on the local disk or in HDFS. In MapReduce job execution on the basis of output
specification;
● Hadoop MapReduce job checks that the output directory does not already present.

● OutputFormat in MapReduce job provides the RecordWriter implementation to be used to write the
output files of the job. Then the output files are stored in a FileSystem.

TYPES OF OUTPUT FORMAT

1. TextOutputFormat: The default OutputFormat is TextOutputFormat. It writes (key, value) pairs
on individual lines of text files. Its keys and values can be of any type. The reason behind is that
TextOutputFormat turns them to string by calling toString() on them. It separates key-value pair by a
tab character. By using MapReduce.output.textoutputformat.separator property we can also change it.
2. SequenceFileOutputFormat: This OutputFormat writes sequences files for its output.
SequenceFileInputFormat is also intermediate format use between MapReduce jobs. It serializes
arbitrary data types to the file. And the corresponding SequenceFileInputFormat will deserialize the
file into the same types. It presents the data to the next mapper in the same manner as it was emitted
by the previous reducer. Static methods also control the compression.
3. SequenceFileAsBinaryOutputFormat: It is another variant of SequenceFileInputFormat. It also
writes keys and values to sequence file in binary format.

4. MapFileOutputFormat: It is another form of FileOutputFormat. It also writes output as map files.

The framework adds a key in a MapFile in order. So we need to ensure that reducer emits keys in
sorted order.

5. MultipleOutputs: This format allows writing data to files whose names are derived from the output
keys and values.

6. LazyOutputFormat: In MapReduce job execution, FileOutputFormat sometimes create output

files, even if they are empty. LazyOutputFormat is also a wrapper OutputFormat.

7. DBOutputFormat: It is the OutputFormat for writing to relational databases and HBase. This
format also sends the reduce output to a SQL table. It also accepts key-value pairs. In this, the key
has a type extending DBwritable.
Q.12) . Explain Job Scheduling in Map Reduce. How it is done in case of

(i) The Fair Scheduler

(ii) The Capacity Scheduler

ANS :

Job scheduling in MapReduce is an important aspect of the Hadoop framework that involves
allocating resources to different jobs and tasks running on a cluster. There are several scheduling
policies available in Hadoop, but the two most commonly used policies are the Fair Scheduler and
the Capacity Scheduler.

(i) The Fair Scheduler:

The Fair Scheduler is a scheduling policy that allows multiple jobs to run on a Hadoop cluster in a
fair and efficient manner. In the Fair Scheduler, jobs are allocated resources based on their relative
priorities, and the available resources are distributed evenly across all jobs. This ensures that no
single job dominates the cluster and that all jobs receive a fair share of the available resources.

The Fair Scheduler works by dividing the available resources into several pools, each with its own
set of allocation rules. Each job is assigned to a specific pool based on its priority, and the Fair
Scheduler ensures that each pool receives a fair share of the available resources. If a job requires
more resources than are available in its pool, it will be placed in a waiting queue until sufficient
resources become available.

(ii) The Capacity Scheduler:

The Capacity Scheduler is a scheduling policy that allocates resources based on pre-defined
capacities for different jobs and queues. In the Capacity Scheduler, resources are divided into
several queues, each with its own capacity limits. Each job is assigned to a specific queue based on
its priority, and the Capacity Scheduler ensures that each queue receives a fair share of the
available resources.

The Capacity Scheduler works by maintaining a set of queues, each with its own capacity limits.
Each job is assigned to a specific queue based on its priority, and the Capacity Scheduler ensures
that each queue receives a fair share of the available resources. If a job requires more resources
than are available in its queue, it will be placed in a waiting queue until sufficient resources become
available.

In both Fair Scheduler and Capacity Scheduler, the job scheduler uses a JobTracker and
TaskTrackers to allocate resources and manage the processing of data. The JobTracker is
responsible for tracking the progress of each job and allocating resources to the different tasks
that make up the job. The TaskTrackers are responsible for executing the tasks assigned to them by
the JobTracker and reporting their status back to the JobTracker.

In summary, job scheduling is an important aspect of MapReduce that involves allocating

resources to different jobs and tasks running on a cluster. The Fair Scheduler and Capacity
Scheduler are two commonly used scheduling policies in Hadoop, which ensure that resources are
allocated in a fair and efficient manner. The JobTracker and TaskTrackers play a crucial role in
processing data with Hadoop by managing the allocation of resources and the execution of tasks
on a Hadoop cluster.
OR

Discuss role of JobTracker and TaskTracker in processing data with Hadoop.

ANS :

In Hadoop, the JobTracker and TaskTrackers play a crucial role in processing data by managing the
allocation of resources and the execution of tasks on a Hadoop cluster.

The JobTracker is the central component of the Hadoop MapReduce framework. It is responsible
for coordinating the execution of MapReduce jobs on the cluster. When a user submits a
MapReduce job, the JobTracker divides the job into smaller tasks and assigns these tasks to the
TaskTrackers for execution. The JobTracker also monitors the progress of each task and manages
the allocation of resources on the cluster.

The TaskTrackers are responsible for executing the tasks assigned to them by the JobTracker. Each
TaskTracker is allocated a set of tasks to execute, and it communicates with the JobTracker to
report the progress of these tasks. The TaskTracker is also responsible for monitoring the health of
the machine it is running on, and it reports any failures or errors to the JobTracker.

The JobTracker and TaskTrackers work together to ensure the efficient processing of data on a
Hadoop cluster. The JobTracker manages the allocation of resources and the scheduling of tasks
on the cluster, while the TaskTrackers execute these tasks and report their progress back to the
JobTracker. This allows Hadoop to process large datasets in parallel across a cluster of computers,
enabling it to handle data-intensive workloads that cannot be processed on a single machine.

In summary, the JobTracker and TaskTrackers are essential components of the Hadoop MapReduce
framework that play a crucial role in processing data. The JobTracker manages the allocation of
resources and the scheduling of tasks, while the TaskTrackers execute these tasks and report their
progress back to the JobTracker. Together, they enable Hadoop to process large datasets in
parallel across a cluster of computers.

Q .13) .

Q .14) .

Q .15) .

Q .16) .

Q.17) .
Q .18) . What is data serialization? With proper examples discuss and
differentiate structured, unstructured and semi-structured data. Make a
note on how type of data affects data serialization
Data serialization is the process of converting data objects present in complex data structures
into a byte stream for storage, transfer and distribution purposes on physical devices.

Computer systems may vary in their hardware architecture, OS, addressing mechanisms.
Internal binary representations of data also vary accordingly in every environment. Storing
and exchanging data between such varying environments requires a platform-and-language-
neutral data format that all systems understand.

Once the serialized data is transmitted from the source machine to the destination machine,
the reverse process of creating objects from the byte sequence called deserialization is
carried out. Reconstructed objects are clones of the original object.
Q. 20) .Explain working of MapReduce with reference to WordCount’
program for a file having input as below: Red apple Red wine Green apple
Green peas Pink rose Blue whale Blue sky Green city Clean city
ANS :
In Hadoop, MapReduce is a computation that decomposes large manipulation jobs
into individual tasks that can be executed in parallel across a cluster of servers. The
results of tasks can be joined together to compute final results.
MapReduce consists of 2 steps:

• Map Function – It takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (Key-Value pair).

Example – (Map function in Word Count)

Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS,
Input Set of data
caR, CAR, car, BUS, TRAIN

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Convert into another set
(car,1), (bus,1), (car,1), (train,1), (bus,1),
of data
Output
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),
(Key,Value)
(car,1), (BUS,1), (TRAIN,1)

• Reduce Function – Takes the output from Map as an input and combines
those data tuples into a smaller set of tuples.

Example – (Reduce function in Word Count)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Input (car,1), (bus,1), (car,1), (train,1), (bus,1),

Set of Tuples
(output of Map (TRAIN,1),(BUS,1), (buS,1), (caR,1),
function) (CAR,1),

(car,1), (BUS,1), (TRAIN,1)

(BUS,7),
Converts into smaller set of
Output (CAR,7),
tuples
(TRAIN,4)

Work Flow of the Program

Workflow of MapReduce consists of 5 steps:
1. Splitting – The splitting parameter can be anything, e.g. splitting by space,
comma, semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above.
3. Intermediate splitting – the entire process in parallel on different clusters. In
order to group them in “Reduce Phase” the similar KEY data should be on the
same cluster.
4. Reduce – it is nothing but mostly group by phase.
5. Combining – The last phase where all the data (individual result set from each
cluster) is combined together to form a result
21. What is Map Reduce? Explain working of various phases of
Map Reduce with appropriate example and diagram
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful
and efficient to use. MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner. The data is first split and then combined
to produce the final result. The libraries for MapReduce is written in so many programming
languages with various different-different optimizations. The purpose of MapReduce in
Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing
less overhead over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:

Components of MapReduce Architecture:

Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for processing
to the Hadoop MapReduce Manager.
Job: The MapReduce Job is the actual work that the client wanted to do which is comprised
of so many smaller tasks that the client wants to process or execute.
Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of
all the job-parts combined to produce the final output.
Input Data: The data set that is fed to the MapReduce for processing.
Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the
Hadoop MapReduce Master. Now, the MapReduce master will divide this job into further
equivalent job-parts. These job-parts are then made available for the Map and Reduce Task.
This Map and Reduce task will contain the program as per the requirement of the use-case
that the particular company is solving. The developer writes their logic to fulfill the
requirement that the industry requires. The input data which we are using is then fed to the
Map Task and the Map will generate intermediate key-value pair as its output. The output of
Map i.e. these key-value pairs are then fed to the Reducer and the final output is stored on the
HDFS. There can be n number of Map and Reduce tasks made available for processing the
data as per the requirement. The algorithm for Map and Reduce is made with a very
optimized way such that the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
Map: As the name suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of some kind of address
and value is the actual value that it keeps. The Map() function will be executed in its memory
repository on each of these input key-value pairs and generates the intermediate key-value
pair which works as input for the Reducer or Reduce() function.

Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and
sort and send to the Reduce() function. Reducer aggregate or group the data based on its key-
value pair as per the reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
Job Tracker: The work of Job tracker is to manage all the resources and all the jobs across
the cluster and also to schedule each map on the Task Tracker running on the same data node
since there can be hundreds of data nodes available in the cluster.

Task Tracker: The Task Tracker can be considered as the actual slaves that are working on
the instruction given by the Job Tracker. This Task Tracker is deployed on each of the nodes
available in the cluster that executes the Map and Reduce task as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job History
Server. The Job History Server is a daemon process that saves and stores historical
information about the task or application, like the logs which are generated during or after
the job execution are stored on Job History Server.
Q1 ) What is NoSQL database? Discuss key characteristics and
advantages of NoSQL database

NoSQL databases are non-relational databases that provide a flexible and scalable alternative to traditional
relational databases.

NoSQL databases do not use tables and SQL (Structured Query Language) for data storage and retrieval.
Instead, they use various data models such as key-value, document, column-family, or graph databases to
store and manage data.

There are several types of NoSQL databases, each with its own data model:

1. Key-value stores: Key-value stores are the simplest type of NoSQL database and store data
as a collection of key-value pairs.
2. Document databases: Document databases store data in JSON or BSON documents, which
can be nested and have a flexible structure.
3. Column-family databases: Column-family databases store data in columns, rather than rows,
and are optimized for storing and querying large amounts of data.
4. Graph databases: Graph databases store data as nodes and edges, making them ideal for
analyzing complex relationships between data.

Here are some key characteristics of NoSQL databases:

1. Schema-less: NoSQL databases do not require a fixed schema to be defined before data can
be stored. This means that they can handle unstructured or semi-structured data that does
not fit into a relational model.
2. Distributed: NoSQL databases are designed to be distributed across multiple servers or
nodes. This allows them to scale horizontally by adding more nodes to the system, making it
easier to handle large volumes of data.
3. High availability and fault tolerance: NoSQL databases are designed to provide high
availability and fault tolerance, which means that the system remains operational even if one
or more nodes fail.
4. Performance: NoSQL databases are often faster and more efficient than traditional relational
databases, especially when it comes to handling large amounts of data.
5. Flexibility: NoSQL databases can handle a wide variety of data types and structures, making
it easier to work with complex data.
There are several advantages to using NoSQL databases:

1. Scalability: NoSQL databases can scale horizontally by adding more nodes to the system.
This makes it easy to handle large amounts of data and maintain performance as the system
grows.
2. Flexibility: NoSQL databases can handle a wide variety of data types and structures, making
it easier to work with complex data.
3. Speed: NoSQL databases are optimized for high-speed data access and retrieval, making
them ideal for applications that require real-time data processing.
4. Cost-effective: NoSQL databases are often more cost-effective than traditional relational
databases, especially when it comes to scaling and managing large volumes of data.
5. Availability: NoSQL databases are designed to provide high availability and fault tolerance,
ensuring that data is always available, even in the event of node failures.
Q2 ) Why NoSQL? Explain Advantages & Disadvantages.
ANS : Here are some reasons why NoSQL would be used :-

1. Multi-Model

Relational databases store data in a fixed and predefined structure.

NoSQL database provides much more flexibility when it comes to handling data. There
is no requirement to specify the schema to start working with the application. Also, the
NoSQL database doesn’t put a restriction on the types of data you can store together. It
allows you to add more new types as your needs change.

2. Easily Scalable

NoSQL database built with a masterless, peer-to-peer architecture. Data is partitioned

and balanced across multiple nodes in a cluster, and aggregate queries are distributed
by default. This allows easy scaling in no time. Just executing a few commands will add
the new server to the cluster. This scalability also improves performance, allowing for
continuous availability and very high read/write speeds.

3. Distributed

he NoSQL database is designed to distribute data on a global scale. It uses multiple

locations involving multiple data centers and/or cloud regions for write and read
operations. This distributed database has a great advantage with masterclass
architecture. You can maintain continuous availability because data is distributed with
multiple copies where it needs to be.

4. Redundancy and Zero Downtime

The masterclass architecture of the NoSQL database allows multiple copies of data to be
maintained across different nodes. If one node goes down then another node will have a
copy of the data for easy and fast access. This leads to zero downtime in the NoSQL
database. When one considers the cost of downtime, this is a big deal.

5. Big Data Applications

NoSQL can handle the massive amount of data very quickly and that’s the reason it is
best suited for the big data applications. NoSQL databases ensure data doesn’t become
the bottleneck when all of the other components of your server-side application are
designed to be seamless and fast.

Advantages:

1. Scalability: NoSQL databases are highly scalable and can easily handle a large amount of
data. They are designed to work with large data sets and can distribute the data across
multiple servers.
2. Flexibility: NoSQL databases are highly flexible and can handle different types of data such
as structured, semi-structured, and unstructured data. They allow for easy modification of
data models without having to make changes to the entire database schema.
3. Performance: NoSQL databases are optimized for performance and can handle high-speed
data processing. They are designed to perform well in distributed environments and can
handle large volumes of data with high concurrency.
4. Availability: NoSQL databases are designed for high availability and can withstand hardware
failures without losing data. They are highly fault-tolerant and can continue to operate even
in the event of network failures.

Disadvantages:

1. Complexity: NoSQL databases are generally more complex than traditional relational
databases, and may require specialized knowledge to manage and optimize them.
2. Limited functionality: NoSQL databases do not offer the same level of functionality as
traditional relational databases. They lack support for advanced query languages and
transaction processing.
3. Lack of standardization: There is no standard for NoSQL databases, and each database has
its own way of storing and retrieving data. This can make it difficult to switch between
different NoSQL databases.
4. Consistency: NoSQL databases sacrifice consistency for scalability and availability, meaning
that data may not always be consistent across different nodes in the database cluster.
Q3) Write differences between NoSQL and SQL.

Feature SQL NoSQL

Uses Key-Value, Document, Graph, or Column-

Data Structure Uses Tables, Rows, and Columns Family data structures

Data Schema Follows a Fixed Schema Can be Dynamic or Schema-less

Uses Structured Query Language Uses Various Query Languages such as

Query Language (SQL) MapReduce, GraphQL, etc.

Scalability Vertically Scalable Horizontally Scalable

ACID ACID (Atomicity, Consistency, BASE (Basically Available, Soft state, Eventual
Compliance Isolation, Durability) consistency)

Suitable for Unstructured or Semi-Structured

Use Cases Suitable for Structured Data Data

Cost High Cost Low Cost

Community Large Community Support Smaller Community Support

Data Emphasizes on Relationships Emphasizes on Data De-normalization, and

Relationships between Tables avoiding Joins
Feature SQL NoSQL

Storage Stored in a Relational Manner Stored in a Non-Relational Manner

Optimized for Complex Queries and

Performance Transactions Optimized for Fast Reads and Writes

Highly Structured and Enforces Data Loosely Structured and May Not Enforce Data
Data Integrity Integrity Integrity

Consistency
Model Strong Consistency Model Eventual Consistency Model

Hosting Options Typically Hosted on a Single Server Typically Hosted on Multiple Servers or Cloud

Suitable for Big Data and Real-time Web

Use Cases Suitable for Business Applications Applications
Q. 6) . Difference between master-slave versus peer-to-peer distribution models.

ANS :
5. What is NoSQL database?
NoSQL databases, also known as non-relational databases, are a type of database that do not use
the traditional relational model with tables, rows and columns that is used in relational databases
such as SQL. NoSQL databases are designed to handle large amounts of unstructured or semi-
structured data, and offer more flexibility in data models, scalability, and availability compared to
traditional relational databases.

NoSQL databases can store various types of data such as documents, graphs, key-value pairs, and
columns. They are often used in modern applications where there is a need for high scalability and
high performance, such as in big data and real-time web applications. Examples of NoSQL
databases include MongoDB, Cassandra, Redis, and Neo4j. Unlike SQL databases, NoSQL
databases do not use SQL for querying and data manipulation, but offer their own query
languages or APIs.

** List the differences between NoSQL and relational databases.

NoSQL (Not Only SQL) and relational databases are two different types of database management
systems that have their own unique features and characteristics. Here are some of the key
differences between them:

1. Data model: Relational databases use a structured data model with tables, columns, and
rows, while NoSQL databases use a flexible data model, which can include documents, key-
value pairs, graphs, or column-family.
2. Schema: Relational databases enforce a strict schema, which means that all data must be
pre-defined and follow a consistent structure, while NoSQL databases have a flexible
schema, which allows for more dynamic and agile data modeling.
3. Scalability: NoSQL databases are designed to be horizontally scalable, which means they can
handle large volumes of data and high traffic loads by adding more servers or nodes to the
system. Relational databases, on the other hand, are vertically scalable, which means they
can handle more traffic and data by adding more processing power or memory to the
existing server.
4. Performance: NoSQL databases can offer faster performance for certain types of queries and
workloads, such as those involving complex relationships or unstructured data. Relational
databases may perform better for simple queries or those involving transactions and
consistency.
5. ACID compliance: Relational databases are generally designed to ensure ACID compliance,
which means they guarantee atomicity, consistency, isolation, and durability for transactions.
NoSQL databases may not always provide ACID guarantees, but may offer other forms of
consistency or eventual consistency.
6. Cost: NoSQL databases are often open-source or free to use, while relational databases
typically require licensing fees or upfront costs for commercial use.
7. Use cases: Relational databases are commonly used for transactional systems, such as online
banking or e-commerce, where data integrity and consistency are critical. NoSQL databases
are often used for big data applications, real-time analytics, and distributed systems, where
scalability and agility are important.
It's important to note that there are many types of NoSQL databases, and each may have its own
unique features and characteristics. Additionally, many modern databases have hybrid features that
combine elements of both NoSQL and relational databases.

** Explain in brief various types of NoSQL databases in practice.

NoSQL databases are a family of non-relational databases that do not use the traditional relational
model with tables and rows, but instead offer more flexible data models that can be better suited
for certain types of applications. Here are some of the most common types of NoSQL databases in
practice:

1. Document-oriented databases: These databases store data in flexible and dynamic

document formats, such as JSON or BSON, instead of tables. Examples of document-
oriented databases include MongoDB and Couchbase.
2. Key-value databases: These databases store data as key-value pairs, where the key is a
unique identifier and the value can be any type of data. Key-value databases are often used
for caching and high-speed data access. Examples include Redis and Riak.
3. Column-family databases: These databases store data as columns instead of rows, and can
scale horizontally by adding more nodes or servers. They are often used for big data and
analytics applications. Examples include Apache Cassandra and HBase.
4. Graph databases: These databases use a graph data model to store and query data, where
nodes represent entities and edges represent relationships between them. Graph databases
are often used for social networks, recommendation engines, and fraud detection. Examples
include Neo4j and OrientDB.
5. Object-oriented databases: These databases store objects, classes, and methods, and can
provide a more natural mapping between data and objects in an application. Examples of
object-oriented databases include db4o and ObjectStore.

It's important to note that many NoSQL databases offer hybrid features that combine elements of
different types, and that choosing the right type of database depends on the specific requirements
and characteristics of the application.

Q.6) . What is impedance mismatch with example.

ANS : ISKA ANS CONFIRM NAHI HAI /// INSHORT NAHI MILA …..
Q.7 ). Explain different types of consistencies.

Explain Read and Update consistency with example.

ANS :

Read consistency and update consistency are two important concepts in NoSQL databases that affect how data is
read and updated in a distributed environment.

1. Read Consistency: Read consistency refers to the level of assurance that a read operation returns the
most up-to-date data. In NoSQL databases, read consistency can be achieved through various
techniques, such as quorum reads and vector clocks.

For example, consider a NoSQL database that has three replicas of data, and a quorum of two replicas is required
to read data. In this scenario, a read operation will return the most recent data that has been written to at least
two of the three replicas. This ensures that the read operation returns the most up-to-date data while still
maintaining high availability.

2. Update Consistency: Update consistency refers to the level of assurance that a write operation updates all
replicas of the data correctly. In NoSQL databases, update consistency can be achieved through various
techniques, such as quorum writes and version vectors.

For example, consider a NoSQL database that has three replicas of data, and a quorum of two replicas is required
to write data. In this scenario, a write operation will update at least two of the three replicas with the new data.
This ensures that all replicas are updated correctly, while still maintaining high availability.

In summary, read and update consistency are important concepts in NoSQL databases that ensure data
consistency and high availability in a distributed environment. By choosing the right consistency level for read
and write operations, developers can design scalable and reliable applications that meet the needs of their users.

ISME AUR BHI AA SAKTA HAI !!!!!!!!!!!!.

Q.8) . Explain CAP theorem.

ANS :

The CAP theorem, originally introduced as the CAP principle, can be used to explain
some of the competing requirements in a distributed system with replication. It is a tool
used to make system designers aware of the trade-offs while designing networked
shared-data systems.
The three letters in CAP refer to three desirable properties of distributed systems with
replicated data: consistency (among replicated copies), availability (of the system for
read and write operations) and partition tolerance (in the face of the nodes in the
system being partitioned by a network fault).
The CAP theorem states that it is not possible to guarantee all three of the desirable
properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication.
The theorem states that networked shared-data systems can only strongly support two of
the following three properties:

• Consistency –
Consistency means that the nodes will have the same copies of a replicated
data item visible for various transactions. A guarantee that every node in a
distributed cluster returns the same, most recent and a successful write.
Consistency refers to every client having the same view of the data. There are
various types of consistency models. Consistency in CAP refers to sequential
consistency, a very strong form of consistency.

• Availability –
Availability means that each read or write request for a data item will either be
processed successfully or will receive a message that the operation cannot be
completed. Every non-failing node returns a response for all the read and write
requests in a reasonable amount of time. The key word here is “every”. In
simple terms, every node (on either side of a network partition) must be able to
respond in a reasonable amount of time.

• Partition Tolerance –
Partition tolerance means that the system can continue operating even if the
network connecting the nodes has a fault that results in two or more partitions,
where the nodes in each partition can only communicate among each other.
That means, the system continues to function and upholds its consistency
guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from
partitions once the partition heals.

The use of the word consistency in CAP and its use in ACID do not refer to the same
identical concept.
In CAP, the term consistency refers to the consistency of the values in different copies of
the same data item in a replicated distributed system. In ACID, it refers to the fact that a
transaction will not violate the integrity constraints specified on the database schema.
Q.9). Explain Following terms:

1. Relaxing Consistency

2. Relaxing Durability

3. Quorum

4. Version stamp

ANS :

1.
Relaxing Consistency: Relaxing consistency in NoSQL databases means allowing for some
inconsistency in the data to improve performance and scalability. This means that not all
replicas of the data are always kept in sync with each other, and a read operation may not
always return the most recent write. Relaxing consistency is a trade-off between consistency
and performance and is commonly used in distributed systems where high availability is
critical.
2. Relaxing Durability: Relaxing durability in NoSQL databases means allowing for the
possibility of data loss in exchange for improved performance. This means that not all writes
to the database are immediately persisted to disk, and some writes may be lost in the event
of a system failure. Relaxing durability is a trade-off between durability and performance
and is commonly used in systems where high write throughput is critical, and data loss is
acceptable.
3. Quorum: Quorum in NoSQL databases is the minimum number of replicas that must be
available to perform a read or write operation. A quorum is used to ensure that the read or
write operation is performed on a majority of the replicas, ensuring data consistency and
avoiding conflicts. For example, a quorum of two in a database with three replicas means
that at least two replicas must be available to perform a read or write operation.
4. Version stamp: A version stamp in NoSQL databases is a metadata tag that is associated
with each piece of data. The version stamp is used to determine the order of updates to the
data and is commonly used in conflict resolution. When multiple clients update the same
piece of data concurrently, the version stamp is used to determine which update is the most
recent and should be applied.

In summary, NoSQL databases have different features and trade-offs compared to traditional
relational databases. By understanding concepts such as relaxing consistency, relaxing durability,
quorum, and version stamp, developers can design scalable and reliable applications that meet the
needs of their users.

Q.10) .

Q. 11) .

Q. 12) .
Q. 13) . Differentiate Strong Vs. Eventual Consistency.

1. Eventual Consistency : Eventual consistency is a consistency model that enables the data
store to be highly available. It is also known as optimistic replication & is key to distributed
systems. So, how exactly does it work? Let’s Understand this with the help of a use
case. Real World Use Case :

Think of a popular microblogging site deployed across the world in different geographical
regions like Asia, America, and Europe. Moreover, each geographical region has multiple
data center zones: North, East, West, and South.

Furthermore, each zone has multiple clusters which have multiple server nodes running. So,
we have many datastore nodes spread across the world that micro-blogging site uses for
persisting data. Since there are so many nodes running, there is no single point of failure.

The data store service is highly available. Even if a few nodes go down persistence service is
still up. Let’s say a celebrity makes a post on the website that everybody starts liking around
the world.

At a point in time, a user in Japan likes a post which increases the “Like” count of the post
from say 100 to 101. At the same point in time, a user in America, in a different geographical
zone, clicks on the post, and he sees “Like” count as 100, not 101.

Reason for the above Use case :

Simply, because the new updated value of the Post “Like” counter needs some time to move
from Japan to America and update server nodes running there. Though the value of the
counter at that point in time was 101, the user in America sees old inconsistent values.

But when he refreshes his web page after a few seconds “Like” counter value shows as 101.
So, data was initially inconsistent but eventually got consistent across server nodes deployed
around the world. This is what eventual consistency is.

2. Strong Consistency: Strong Consistency simply means the data must be strongly
consistent at all times. All the server nodes across the world should contain the same value as
an entity at any point in time. And the only way to implement this behavior is by locking
down the nodes when being updated. Real World Use Case :

Let’s continue the same Eventual Consistency example from the previous lesson. To ensure
Strong Consistency in the system, when a user in Japan likes posts, all nodes across different
geographical zones must be locked down to prevent any concurrent updates.

This means at one point in time, only one user can update the post “Like” counter value. So,
once a user in Japan updates the “Like” counter from 100 to 101. The value gets replicated
globally across all nodes. Once all nodes reach consensus, locks get lifted. Now, other users
can Like posts.
If the nodes take a while to reach a consensus, they must wait until then. Well, this is surely
not desired in the case of social applications. But think of a stock market application where
the users are seeing different prices of the same stock at one point in time and updating it
concurrently. This would create chaos. Therefore, to avoid this confusion we need our
systems to be Strongly Consistent.

The nodes must be locked down for updates. Queuing all requests is one good way of
making a system Strongly Consistent. The strong Consistency model hits the capability of
the system to be Highly Available & perform concurrent updates. This is how strongly
consistent ACID transactions are implemented.

14. Explain with example Polyglot Persistence.

ANS : BHAI YEH ANS HAMKO NAHI PRAPT HUA !!!!!!!!!!

------------------------------------------------------------------DISCLAIMER ! --------------------------------------------------------------------

KRUPIYA JO ANS BAKI HAI WHO DHUNDH KE VAPAS EDIT KARLE AUR PHIR LIKH LE ………….

AUR HA BHAI AGR ANS NAHI MILTE TONA BHAI ATLEAST MESSAGE KARKE DUSRE SE CONSULT TO KAR LIYA KARO
BHAI ………………..

BDAchap 1
No ratings yet
BDAchap 1
15 pages
BigData UNIT-1
No ratings yet
BigData UNIT-1
19 pages
Introduction To Big Data - Report 1
No ratings yet
Introduction To Big Data - Report 1
5 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Emma Mensah
No ratings yet
Emma Mensah
4 pages
Big Data Analytics
No ratings yet
Big Data Analytics
127 pages
BDA Answerbank
No ratings yet
BDA Answerbank
71 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Big Data Analytics Is
No ratings yet
Big Data Analytics Is
17 pages
Emerging Big Data and Cloud Computing
No ratings yet
Emerging Big Data and Cloud Computing
15 pages
Unit 2 Notes Data Analytics
No ratings yet
Unit 2 Notes Data Analytics
11 pages
Big Data 1 - 1
No ratings yet
Big Data 1 - 1
98 pages
What Is Data
No ratings yet
What Is Data
20 pages
Big Data ANALYSIS LONG
No ratings yet
Big Data ANALYSIS LONG
117 pages
Introduction
No ratings yet
Introduction
18 pages
BDA 1-5 Imp
No ratings yet
BDA 1-5 Imp
120 pages
BCC (IEEE Format) Big Data
No ratings yet
BCC (IEEE Format) Big Data
2 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
13 pages
Unit 1
No ratings yet
Unit 1
56 pages
BDCC 03 00032 v2 PDF
No ratings yet
BDCC 03 00032 v2 PDF
30 pages
Big Data Notes UNIT-1
No ratings yet
Big Data Notes UNIT-1
14 pages
Big Data Analytics
No ratings yet
Big Data Analytics
23 pages
Attachment
No ratings yet
Attachment
10 pages
What Is Big Data
No ratings yet
What Is Big Data
5 pages
Module 1
No ratings yet
Module 1
14 pages
Bda QB
No ratings yet
Bda QB
24 pages
BDA - Unit-I
No ratings yet
BDA - Unit-I
35 pages
Big Data: Abstract
No ratings yet
Big Data: Abstract
15 pages
Unit 1 Notes Bda
No ratings yet
Unit 1 Notes Bda
20 pages
BDA Notes
No ratings yet
BDA Notes
35 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
7 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
115 pages
Big Data
No ratings yet
Big Data
13 pages
Big Data and Business Analytics: Trends, Platforms, Success Factors and Applications
No ratings yet
Big Data and Business Analytics: Trends, Platforms, Success Factors and Applications
32 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
Bda PST
No ratings yet
Bda PST
11 pages
Bda Mse
No ratings yet
Bda Mse
62 pages
Emerging Tech & Big Data Guide
No ratings yet
Emerging Tech & Big Data Guide
30 pages
Big Data Analytics 1
No ratings yet
Big Data Analytics 1
21 pages
What Is Big Data
No ratings yet
What Is Big Data
4 pages
Big Data Julian Cerquera
No ratings yet
Big Data Julian Cerquera
15 pages
Big Data Analytics
No ratings yet
Big Data Analytics
19 pages
Sem Csen1301
No ratings yet
Sem Csen1301
12 pages
ETB 1 (Big Data)
No ratings yet
ETB 1 (Big Data)
28 pages
Content For
No ratings yet
Content For
7 pages
Bda Aiml Note Unit 1
No ratings yet
Bda Aiml Note Unit 1
14 pages
Bda Unit1
No ratings yet
Bda Unit1
19 pages
BDA Unit 1
No ratings yet
BDA Unit 1
17 pages
Unit - 1 Bda
No ratings yet
Unit - 1 Bda
14 pages
Big Data Analytics Essentials
No ratings yet
Big Data Analytics Essentials
143 pages
Big Data Analytics
No ratings yet
Big Data Analytics
32 pages
Big Data Analytics: Recent Achievements and New Challenges
No ratings yet
Big Data Analytics: Recent Achievements and New Challenges
5 pages
ABSTRACT
No ratings yet
ABSTRACT
9 pages
Big Data: Presented by J.Jitendra Kumar
No ratings yet
Big Data: Presented by J.Jitendra Kumar
14 pages
Big Data: Challenges and Future Trends
No ratings yet
Big Data: Challenges and Future Trends
9 pages
Big Data
No ratings yet
Big Data
27 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Apostila Projeto de Estradas
No ratings yet
Apostila Projeto de Estradas
4 pages
Project 2 Rootkits
No ratings yet
Project 2 Rootkits
13 pages
Applications of Artificial Intelligence
No ratings yet
Applications of Artificial Intelligence
6 pages
DBMS Lab-1
No ratings yet
DBMS Lab-1
59 pages
Enclosed Transfer Switches - REV
No ratings yet
Enclosed Transfer Switches - REV
11 pages
DMX Demux Guide for DIY Enthusiasts
100% (1)
DMX Demux Guide for DIY Enthusiasts
4 pages
AI Model Test Paper 1
No ratings yet
AI Model Test Paper 1
9 pages
CompTIA A+ Complete Study Guide Core 1 Exam 220 1101 and Core 2 Exam 220 110
No ratings yet
CompTIA A+ Complete Study Guide Core 1 Exam 220 1101 and Core 2 Exam 220 110
4 pages
TLM With Examples
No ratings yet
TLM With Examples
13 pages
Astable Multivibrator
No ratings yet
Astable Multivibrator
17 pages
Valves Cutsheet
No ratings yet
Valves Cutsheet
1 page
Thermal Receipt Printer RONGTA TECHNOLOGY RP80
No ratings yet
Thermal Receipt Printer RONGTA TECHNOLOGY RP80
12 pages
Control Narrative
67% (3)
Control Narrative
19 pages
Ejemplo 1 Test Evaluation Summary
No ratings yet
Ejemplo 1 Test Evaluation Summary
4 pages
Core01 Overview
No ratings yet
Core01 Overview
53 pages
CheckPoint U-10 Getting Started
No ratings yet
CheckPoint U-10 Getting Started
57 pages
Communication System Eeeb453 Amplitude Modulation: Dept of Electrical Engineering Universiti Tenaga Nasional
No ratings yet
Communication System Eeeb453 Amplitude Modulation: Dept of Electrical Engineering Universiti Tenaga Nasional
18 pages
dATABASE Presentation
No ratings yet
dATABASE Presentation
20 pages
Winter 2013 Question Paper
No ratings yet
Winter 2013 Question Paper
5 pages
Scope User Guide
No ratings yet
Scope User Guide
428 pages
MLT
No ratings yet
MLT
13 pages
CSAA Part 1
No ratings yet
CSAA Part 1
44 pages
Lenovo IdeaPad Slim 3 15IAH8 83ER000HCL
No ratings yet
Lenovo IdeaPad Slim 3 15IAH8 83ER000HCL
2 pages
Database Table Operations Guide
No ratings yet
Database Table Operations Guide
1 page
IS 350WBX Rev313
No ratings yet
IS 350WBX Rev313
30 pages
India in The Persian World of Letters Hni RZ Among The EighteenthCentury Philologists Arthur Dudney Instructor Test Bank
No ratings yet
India in The Persian World of Letters Hni RZ Among The EighteenthCentury Philologists Arthur Dudney Instructor Test Bank
329 pages
Mathematics - September - 2024 - P2 17 Mar 25 05 28 32
No ratings yet
Mathematics - September - 2024 - P2 17 Mar 25 05 28 32
13 pages
Yell
No ratings yet
Yell
40 pages
Types of Digital Data
No ratings yet
Types of Digital Data
33 pages
Ai-Driven Data Analytics and Automation A Systematic Literature Review of Industry Applications
No ratings yet
Ai-Driven Data Analytics and Automation A Systematic Literature Review of Industry Applications
20 pages

QB Bda Solution

Uploaded by

QB Bda Solution

Uploaded by

Q1) What is Big Data ? Explain Characteristics of Big Data .

OR YOU CAN WRITE THIS ALSO OR ADD ACCORDINGLY IN UR ANS.

data that are becoming increasingly important:

Challenges to Big Data

ANU ANSWER Q1 JEVUJ CHE SO POTE JOINE BHANVU

BIG DATA ANALYTICS

Big data analytics typically involves four key stages:

1) Data collection: Collecting and aggregating data from multiple sources.

Four V’s of Big Data .

APPLICATIONS OF BIG DATA .

Q4 ) Describe Traditional vs. Big Data business approach. Explain Challenges of

OR AVI REETE PAN LAKHI SHAKO CHO IN TABULAR FORM

Data collection is extensive, involving

Data processing is done using

Data analysis is sophisticated, using

Decision- Decision-making is based on Decision-making is data-driven, with the

Limited scalability, mainly

Low agility, as changes take time High agility, as changes can be

Limited customer insight due to Deep customer insights, including

Competitive advantage is based Competitive advantage is based on

Big Data in Transportation.

ANS : ATTENTION! !!!!!!!!!!!!!!!,

Q.11) . Describe Map Reduce Types and Formats ?

TYPES OF INPUT FORMAT

TYPES OF OUTPUT FORMAT

4. MapFileOutputFormat: It is another form of FileOutputFormat. It also writes output as map files.

6. LazyOutputFormat: In MapReduce job execution, FileOutputFormat sometimes create output

(i) The Fair Scheduler

(ii) The Capacity Scheduler

(i) The Fair Scheduler:

(ii) The Capacity Scheduler:

In summary, job scheduling is an important aspect of MapReduce that involves allocating

Discuss role of JobTracker and TaskTracker in processing data with Hadoop.

Example – (Map function in Word Count)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Example – (Reduce function in Word Count)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Input (car,1), (bus,1), (car,1), (train,1), (bus,1),

(car,1), (BUS,1), (TRAIN,1)

Work Flow of the Program

Components of MapReduce Architecture:

Here are some key characteristics of NoSQL databases:

Relational databases store data in a fixed and predefined structure.

NoSQL database built with a masterless, peer-to-peer architecture. Data is partitioned

he NoSQL database is designed to distribute data on a global scale. It uses multiple

4. Redundancy and Zero Downtime

5. Big Data Applications

Feature SQL NoSQL

Uses Key-Value, Document, Graph, or Column-

Data Schema Follows a Fixed Schema Can be Dynamic or Schema-less

Uses Structured Query Language Uses Various Query Languages such as

Scalability Vertically Scalable Horizontally Scalable

Suitable for Unstructured or Semi-Structured

Cost High Cost Low Cost

Community Large Community Support Smaller Community Support

Data Emphasizes on Relationships Emphasizes on Data De-normalization, and

Storage Stored in a Relational Manner Stored in a Non-Relational Manner

Optimized for Complex Queries and

Suitable for Big Data and Real-time Web

** List the differences between NoSQL and relational databases.

** Explain in brief various types of NoSQL databases in practice.

1. Document-oriented databases: These databases store data in flexible and dynamic

Q.6) . What is impedance mismatch with example.

Explain Read and Update consistency with example.

ISME AUR BHI AA SAKTA HAI !!!!!!!!!!!!.

Reason for the above Use case :

14. Explain with example Polyglot Persistence.

You might also like