QB Bda Solution
QB Bda Solution
Ans :
Big data is a term used in the field of Big Data Analytics (BDA) to describe large and complex
datasets that are difficult to process and analyze using traditional data processing techniques. Big
data is characterized by its volume, velocity, and variety.
Big data is used to describe datasets that are so large and complex that traditional data
processing techniques are not sufficient to handle them. These datasets can be generated from a
variety of sources, including social media, Internet of Things (IoT) devices, sensors, and other
sources.
Big data is typically characterized by three main characteristics, often referred to as the three Vs:
volume, velocity, and variety.
1) Volume: Refers to the vast amount of data generated every day. Big data often refers to
datasets that are too large to be managed and analyzed using traditional database
management systems. The volume of data is usually measured in terms of the size of the
dataset, typically in petabytes, exabytes, or zettabytes.
“One of the defining characteristics of big data is its volume. Big data sets can range from
terabytes to petabytes and beyond, and they can contain millions or even billions of records.
Traditional data processing tools may struggle to handle this amount of data, but big data
technologies such as Hadoop and Spark are designed to handle it with ease.
2) Velocity: Refers to the speed at which data is generated and processed. Big data is often
generated in real-time or near real-time, making it difficult to manage and process using
traditional systems. Streaming data, social media data, and IoT data are examples of data
that is generated at high velocity.
3) Variety: Refers to the different types of data that exist, including structured, semi-
structured, and unstructured data. Structured data is typically found in databases and is
easily searchable and analyzed using traditional tools. Semi-structured data is often found in
XML, JSON, or other formats that have some structure but are not as easily searchable as
structured data. Unstructured data includes things like text, images, audio, and video files
that have no inherent structure and require specialized tools to search and analyze.
In addition to the three Vs, there are two other characteristics of big
4) Veracity: Refers to the accuracy and quality of the data. Big data often includes data from a
wide range of sources, and it can be difficult to ensure that the data is accurate and
trustworthy. Veracity refers to the process of ensuring that the data is accurate and can be
used to make informed decisions.
5) Value: Refers to the potential value that can be extracted from big data. While big data can
be overwhelming, it also offers tremendous opportunities to gain insights and make better
decisions. By using advanced analytics tools, organizations can extract valuable insights
from big data that can help them improve their business processes, products, and services.
Q2 ) What are the benefits of Big Data? Discuss challenges under Big Data. How Big Data
Analytics can be useful in the development of smart cities.(Discuss one application).
ANS :
There are many benefits to utilizing big data in business and other fields. Here are some of the key
benefits:
1) Better decision-making: Big data provides insights that can help decision-makers make
more informed decisions. By analyzing large datasets, organizations can identify patterns
and trends that might not be apparent with smaller datasets, and make data-driven
decisions based on these insights.
2) Improved operational efficiency: Big data analytics can help organizations optimize their
operations by identifying areas where they can improve efficiency, reduce waste, and lower
costs. For example, analyzing sensor data from manufacturing equipment can help identify
potential maintenance issues before they become costly problems.
3) Enhanced customer experience: Big data analytics can help organizations understand their
customers better and provide them with more personalized experiences. By analyzing
customer data, organizations can identify trends in customer behavior, preferences, and
needs, and tailor their products and services accordingly.
4) Increased revenue: Big data analytics can help organizations identify new revenue
opportunities and improve their sales and marketing efforts. For example, analyzing
customer data can help identify cross-selling and upselling opportunities, and improve
customer retention rates.
5) Improved risk management: Big data analytics can help organizations identify potential
risks and mitigate them before they become major problems. For example, analyzing
financial data can help identify potential fraud or other financial irregularities.
Overall, big data offers tremendous opportunities to gain insights and make better
decisions, improve efficiency, and drive innovation and growth in organizations. By
leveraging big data analytics, organizations can stay ahead of the competition and make
data-driven decisions that lead to better outcomes.
While big data offers many benefits, there are also several challenges associated with
managing and analyzing large datasets. Here are some of the key challenges:
1) Data quality: With so much data being generated from a variety of sources, ensuring the
accuracy and quality of the data can be a challenge. Poor quality data can lead to inaccurate
insights and decisions, so it's important to ensure that the data being used is accurate and
reliable.
2) Data integration: Big data often comes from a variety of sources and in different formats,
making it difficult to integrate and analyze. Organizations need to be able to bring together
data from different sources and make it usable for analysis.
3) Data security: Big data can contain sensitive information that needs to be protected.
Organizations need to ensure that the data is secure and that appropriate security measures
are in place to prevent unauthorized access.
4) Lack of skilled personnel: Managing and analyzing big data requires specialized skills and
expertise that may not be readily available. Organizations may need to invest in training or
hire new personnel with the necessary skills to manage and analyze big data effectively.
5) Infrastructure: Storing and processing large datasets requires significant computing power
and storage capacity. Organizations need to have the appropriate infrastructure in place to
manage and analyze big data effectively.
6) Cost: Managing and analyzing big data can be expensive. Organizations need to invest in
the appropriate technology, personnel, and infrastructure, which can be costly.
How Big Data Analytics can be useful in the development of smart cities
Big data analytics can be a valuable tool for developing and managing smart cities. One specific
application of big data analytics in smart cities is in the management of traffic and transportation
systems.
Traffic congestion is a major problem in many cities around the world, leading to wasted time and
increased air pollution. By using big data analytics, cities can analyze traffic patterns and develop
more efficient transportation systems.
For example, cities can use data from traffic sensors, GPS devices, and other sources to analyze
traffic patterns and identify areas where congestion is most likely to occur. This data can be used
to develop more efficient traffic flow patterns, optimize traffic signals, and adjust speed limits to
reduce congestion.
Cities can also use big data analytics to improve public transportation systems. By analyzing data
from bus and train schedules, passenger flows, and other sources, cities can optimize routes and
schedules to reduce wait times and improve service.
Q3) What is Big data? Discuss it in terms of four dimensions, volume, velocity,variety and
veracity.
ANS :
OR
Q3 ) What is big data analytics? Explain four ‘V’s of Big data. Briefly discuss applications of
big data
ANS.
Big data analytics refers to the process of analyzing and extracting insights from large and complex
datasets. It involves using advanced analytical techniques and tools to process and analyze data
that is too large and complex to be handled by traditional data processing systems.
The goal of big data analytics is to uncover patterns, trends, and insights that can be used to make
informed decisions and drive business value. This involves processing and analyzing large volumes
of structured and unstructured data from a variety of sources, including social media, sensors, and
other digital devices.
Some of the key tools and technologies used in big data analytics include Hadoop, Spark,
NoSQL databases, machine learning algorithms, and data visualization tools.
Overall, big data analytics offers tremendous opportunities to gain insights and make better
decisions, but it also presents significant challenges related to data quality, integration,
security, and infrastructure. Effective big data analytics requires a combination of advanced
technology, skilled personnel, and sound data management practices.
The four V's of Big Data are Volume, Velocity, Variety, and Veracity. These four V's help to
define the characteristics of Big Data and provide a framework for understanding how it
differs from traditional data.
1) Volume: Refers to the sheer amount of data that is generated and collected. Big Data is
characterized by its massive volume, which is typically too large to be handled by
traditional data processing methods. This volume includes structured data, like customer
data and sales figures, as well as unstructured data, like social media posts, images, and
videos.
2) Velocity: Refers to the speed at which data is generated and must be processed. Big
Data is often generated in real-time or near real-time, and businesses need to be able to
analyze this data quickly to make timely decisions. The velocity of Big Data is increasing
due to the proliferation of devices and sensors that are constantly generating data.
3) Variety: Refers to the different types of data that are generated and collected. Big Data
includes structured, semi-structured, and unstructured data, each of which requires
different processing methods. Structured data is organized and easily searchable, while
unstructured data is not. Semi-structured data is a combination of both.
4) Veracity: Refers to the accuracy and reliability of the data. Big Data can come from many
different sources, and not all of it is trustworthy. 5) Veracity refers to the ability to
determine the accuracy of the data and ensure that it is reliable.
Together, these four V's provide a framework for understanding the characteristics of Big
Data and the challenges associated with managing and analyzing it. Big Data Analytics tools
and technologies, such as Hadoop, Spark, and NoSQL databases, are designed to help
businesses manage and analyze Big Data efficiently, extract valuable insights, and make
data-driven decisions.
Big Data has many applications across various industries and domains. Here are some
examples of how Big Data is being used in different fields:
1) Healthcare: Big Data is being used to improve patient outcomes and reduce costs. By
analyzing patient data from electronic health records, medical devices, and wearables,
healthcare providers can identify patterns and trends that can inform treatment
decisions and improve patient care.
2) Marketing: Big Data is used in marketing to target customers more effectively. By
analyzing data from social media, online searches, and purchase history, businesses can
identify consumer preferences and behavior to deliver more personalized and targeted
advertising.
3) Finance: Big Data is used in finance to identify fraud, assess risk, and improve customer
service. By analyzing transactional data, social media sentiment, and other data sources,
financial institutions can make better-informed decisions about lending, investing, and
risk management.
4) Manufacturing: Big Data is used in manufacturing to optimize production and reduce
downtime. By analyzing data from sensors and other sources, manufacturers can identify
inefficiencies in production processes and implement changes to improve efficiency and
reduce costs.
5) Transportation: Big Data is used in transportation to improve safety and efficiency. By
analyzing data from GPS devices, traffic sensors, and other sources, transportation
companies can optimize routes, reduce congestion, and improve safety on the roads.
Traditional business approach involves using structured data from internal systems such as
enterprise resource planning (ERP) and customer relationship management (CRM) to support
decision-making processes. This approach relies on databases that can be queried and analyzed
using traditional data analysis tools such as SQL. This type of data is often limited in volume,
velocity, and variety, which means that businesses may miss out on valuable insights if they rely
solely on this type of data.
On the other hand, the Big Data business approach involves using both structured and
unstructured data from various sources such as social media, mobile devices, sensors, and other
IoT devices to support decision-making processes. This approach relies on Big Data Analytics tools
and technologies to store, process, and analyze large volumes of data, often in real-time. This type
of data is characterized by high volume, velocity, and variety, which makes it difficult to analyze
using traditional data analysis tools.
The Big Data approach provides businesses with more comprehensive insights into customer
behavior, market trends, and operational efficiencies. It allows businesses to identify patterns and
trends that were previously impossible to detect, enabling them to make more informed decisions.
For example, a retailer can use Big Data to analyze customer browsing and purchasing behavior to
identify which products are popular and why, and then adjust their inventory and pricing strategy
accordingly.
Conventional systems, also known as legacy systems, often face a range of challenges that can
limit their effectiveness and hinder business operations. Here are some of the main challenges of
conventional systems:
Limited functionality: Conventional systems often have limited functionality, as they were
designed to perform specific tasks or functions. They may lack the flexibility to adapt to changing
business needs or integrate with other systems.
Data silos: Conventional systems may create data silos, where data is stored in isolated systems
that are not connected to other systems or applications. This can make it difficult to share
information across different departments or business units, leading to inefficiencies and delays.
Security vulnerabilities: Conventional systems may have security vulnerabilities due to outdated
technology or lack of updates. These vulnerabilities can make them susceptible to cyber-attacks or
data breaches, which can result in significant financial and reputational damage.
Limited scalability: Conventional systems may not be scalable, meaning that they cannot easily
accommodate increased demand or growth. This can result in bottlenecks and slowdowns, leading
to reduced productivity and customer dissatisfaction.
Lack of integration: Conventional systems may not be designed to integrate with other systems
or applications, making it difficult to streamline processes and improve efficiency. This can result in
manual processes and data entry, leading to errors and delays.
Costly maintenance: Conventional systems can be costly to maintain, as they may require
specialized skills and knowledge to keep them running effectively. This can result in high IT costs
and limited resources for other business needs.
Q.5). What are the benefits of Big Data? Discuss challenges under Big Data. How Big Data Analytics can be
useful in the development of smart cities
ANS :
Q.6). Discuss big data in healthcare, transportation and medicine.
ANS:
All dataNodes continuously communicate with the name node. They also inform the nameNode about the
blocks they are storing. Furthermore, the dataNodes also perform block creation, deletion, and replication as
instructed by the nameNode.
Relationship Between NameNode and DataNode
Namenode and Datanode operate according to master-slave architecture in Hadoop Distributed File System
(HDFS).
Difference Between NameNode and DataNode
Definition
NameNode is the controller and manager of HDFS whereas DataNode is a node other than the NameNode in
HDFS that is controlled by the NameNode. Thus, this is the main difference between NameNode and
DataNode in Hadoop.
Synonyms
Moreover, Master node is another name for NameNode while Slave node is another name for DataNode.
Main Functionality
While nameNode handles the metadata of all the files in HDFS and controls the dataNodes, Datanode store
and retrieve blocks according to the master node’s instructions. Hence, this is another difference between
NameNode and DataNode in Hadoop.
Conclusion
The main difference between NameNode and DataNode in Hadoop is that the NameNode is the master node
in HDFS that manages the file system metadata while the DataNode is a slave node in HDFS that stores the
actual data as instructed by the NameNode. In brief, NameNode controls and manages a single or multiple
data nodes.
INPUT FORMATS
● InputFormat describes the input-specification for execution of the Map-Reduce job.
● In MapReduce job execution, InputFormat is the first step. InputFormat describes how to split and read
input files.
● InputFormat is responsible for splitting the input data file into records which is used for map-reduce
operation.
○ InputFormat selects the files or other objects for input.
○ It defines the Data splits. It defines both the size of individual Map tasks and its potential execution
server.
○ InputFormat defines the RecordReader. It is also responsible for reading actual records from the
input files.
1. FileInputFormat: It is the base class for all file-based InputFormats. When we start a MapReduce job
execution, FileInputFormat provides a path containing files to read. This InputFormat will read all files and
divides these files into one or more InputSplits.
2. TextInputFormat: It is the default InputFormat. This InputFormat treats each line of each input file as a
separate record. It performs no parsing. TextInputFormat is useful for unformatted data or line-based
records like log files.
3. KeyValueTextInputFormat: It is similar to TextInputFormat. This InputFormat also treats each line of input as
a separate record. While the difference is that TextInputFormat treats entire line as the value, but the
KeyValueTextInputFormat breaks the line itself into key and value.
4. SequenceFileInputFormat: It is an InputFormat which reads sequence files. Sequence files are binary files.
These files also store sequences of binary key-value pairs. These are block-compressed and provide direct
serialization and deserialization of several arbitrary data.
5. N-lineInputFormat: It is another form of TextInputFormat where the keys are byte offset of the line. And
values are contents of the line. So, each mapper receives a variable number of lines of input with
TextInputFormat and KeyValueTextInputFormat. So, if want our mapper to receive a fixed number of lines of
input, then we use NLineInputFormat.
6. DBInputFormat: This InputFormat reads data from a relational database, using JDBC. It also loads small
datasets, perhaps for joining with large datasets from HDFS using MultipleInputs.
OUTPUT FORMATS
● The outputFormat decides the way the output key-value pairs are written in the output files by
RecordWriter.
The OutputFormat and InputFormat functions are similar. OutputFormat instances are used to write
to files on the local disk or in HDFS. In MapReduce job execution on the basis of output
specification;
● Hadoop MapReduce job checks that the output directory does not already present.
● OutputFormat in MapReduce job provides the RecordWriter implementation to be used to write the
output files of the job. Then the output files are stored in a FileSystem.
5. MultipleOutputs: This format allows writing data to files whose names are derived from the output
keys and values.
7. DBOutputFormat: It is the OutputFormat for writing to relational databases and HBase. This
format also sends the reduce output to a SQL table. It also accepts key-value pairs. In this, the key
has a type extending DBwritable.
Q.12) . Explain Job Scheduling in Map Reduce. How it is done in case of
ANS :
Job scheduling in MapReduce is an important aspect of the Hadoop framework that involves
allocating resources to different jobs and tasks running on a cluster. There are several scheduling
policies available in Hadoop, but the two most commonly used policies are the Fair Scheduler and
the Capacity Scheduler.
The Fair Scheduler is a scheduling policy that allows multiple jobs to run on a Hadoop cluster in a
fair and efficient manner. In the Fair Scheduler, jobs are allocated resources based on their relative
priorities, and the available resources are distributed evenly across all jobs. This ensures that no
single job dominates the cluster and that all jobs receive a fair share of the available resources.
The Fair Scheduler works by dividing the available resources into several pools, each with its own
set of allocation rules. Each job is assigned to a specific pool based on its priority, and the Fair
Scheduler ensures that each pool receives a fair share of the available resources. If a job requires
more resources than are available in its pool, it will be placed in a waiting queue until sufficient
resources become available.
The Capacity Scheduler is a scheduling policy that allocates resources based on pre-defined
capacities for different jobs and queues. In the Capacity Scheduler, resources are divided into
several queues, each with its own capacity limits. Each job is assigned to a specific queue based on
its priority, and the Capacity Scheduler ensures that each queue receives a fair share of the
available resources.
The Capacity Scheduler works by maintaining a set of queues, each with its own capacity limits.
Each job is assigned to a specific queue based on its priority, and the Capacity Scheduler ensures
that each queue receives a fair share of the available resources. If a job requires more resources
than are available in its queue, it will be placed in a waiting queue until sufficient resources become
available.
In both Fair Scheduler and Capacity Scheduler, the job scheduler uses a JobTracker and
TaskTrackers to allocate resources and manage the processing of data. The JobTracker is
responsible for tracking the progress of each job and allocating resources to the different tasks
that make up the job. The TaskTrackers are responsible for executing the tasks assigned to them by
the JobTracker and reporting their status back to the JobTracker.
ANS :
In Hadoop, the JobTracker and TaskTrackers play a crucial role in processing data by managing the
allocation of resources and the execution of tasks on a Hadoop cluster.
The JobTracker is the central component of the Hadoop MapReduce framework. It is responsible
for coordinating the execution of MapReduce jobs on the cluster. When a user submits a
MapReduce job, the JobTracker divides the job into smaller tasks and assigns these tasks to the
TaskTrackers for execution. The JobTracker also monitors the progress of each task and manages
the allocation of resources on the cluster.
The TaskTrackers are responsible for executing the tasks assigned to them by the JobTracker. Each
TaskTracker is allocated a set of tasks to execute, and it communicates with the JobTracker to
report the progress of these tasks. The TaskTracker is also responsible for monitoring the health of
the machine it is running on, and it reports any failures or errors to the JobTracker.
The JobTracker and TaskTrackers work together to ensure the efficient processing of data on a
Hadoop cluster. The JobTracker manages the allocation of resources and the scheduling of tasks
on the cluster, while the TaskTrackers execute these tasks and report their progress back to the
JobTracker. This allows Hadoop to process large datasets in parallel across a cluster of computers,
enabling it to handle data-intensive workloads that cannot be processed on a single machine.
In summary, the JobTracker and TaskTrackers are essential components of the Hadoop MapReduce
framework that play a crucial role in processing data. The JobTracker manages the allocation of
resources and the scheduling of tasks, while the TaskTrackers execute these tasks and report their
progress back to the JobTracker. Together, they enable Hadoop to process large datasets in
parallel across a cluster of computers.
Q .13) .
Q .14) .
Q .15) .
Q .16) .
Q.17) .
Q .18) . What is data serialization? With proper examples discuss and
differentiate structured, unstructured and semi-structured data. Make a
note on how type of data affects data serialization
Data serialization is the process of converting data objects present in complex data structures
into a byte stream for storage, transfer and distribution purposes on physical devices.
Computer systems may vary in their hardware architecture, OS, addressing mechanisms.
Internal binary representations of data also vary accordingly in every environment. Storing
and exchanging data between such varying environments requires a platform-and-language-
neutral data format that all systems understand.
Once the serialized data is transmitted from the source machine to the destination machine,
the reverse process of creating objects from the byte sequence called deserialization is
carried out. Reconstructed objects are clones of the original object.
Q. 20) .Explain working of MapReduce with reference to WordCount’
program for a file having input as below: Red apple Red wine Green apple
Green peas Pink rose Blue whale Blue sky Green city Clean city
ANS :
In Hadoop, MapReduce is a computation that decomposes large manipulation jobs
into individual tasks that can be executed in parallel across a cluster of servers. The
results of tasks can be joined together to compute final results.
MapReduce consists of 2 steps:
• Map Function – It takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (Key-Value pair).
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS,
Input Set of data
caR, CAR, car, BUS, TRAIN
• Reduce Function – Takes the output from Map as an input and combines
those data tuples into a smaller set of tuples.
(BUS,7),
Converts into smaller set of
Output (CAR,7),
tuples
(TRAIN,4)
Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and
sort and send to the Reduce() function. Reducer aggregate or group the data based on its key-
value pair as per the reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
Job Tracker: The work of Job tracker is to manage all the resources and all the jobs across
the cluster and also to schedule each map on the Task Tracker running on the same data node
since there can be hundreds of data nodes available in the cluster.
Task Tracker: The Task Tracker can be considered as the actual slaves that are working on
the instruction given by the Job Tracker. This Task Tracker is deployed on each of the nodes
available in the cluster that executes the Map and Reduce task as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job History
Server. The Job History Server is a daemon process that saves and stores historical
information about the task or application, like the logs which are generated during or after
the job execution are stored on Job History Server.
Q1 ) What is NoSQL database? Discuss key characteristics and
advantages of NoSQL database
NoSQL databases are non-relational databases that provide a flexible and scalable alternative to traditional
relational databases.
NoSQL databases do not use tables and SQL (Structured Query Language) for data storage and retrieval.
Instead, they use various data models such as key-value, document, column-family, or graph databases to
store and manage data.
There are several types of NoSQL databases, each with its own data model:
1. Key-value stores: Key-value stores are the simplest type of NoSQL database and store data
as a collection of key-value pairs.
2. Document databases: Document databases store data in JSON or BSON documents, which
can be nested and have a flexible structure.
3. Column-family databases: Column-family databases store data in columns, rather than rows,
and are optimized for storing and querying large amounts of data.
4. Graph databases: Graph databases store data as nodes and edges, making them ideal for
analyzing complex relationships between data.
1. Schema-less: NoSQL databases do not require a fixed schema to be defined before data can
be stored. This means that they can handle unstructured or semi-structured data that does
not fit into a relational model.
2. Distributed: NoSQL databases are designed to be distributed across multiple servers or
nodes. This allows them to scale horizontally by adding more nodes to the system, making it
easier to handle large volumes of data.
3. High availability and fault tolerance: NoSQL databases are designed to provide high
availability and fault tolerance, which means that the system remains operational even if one
or more nodes fail.
4. Performance: NoSQL databases are often faster and more efficient than traditional relational
databases, especially when it comes to handling large amounts of data.
5. Flexibility: NoSQL databases can handle a wide variety of data types and structures, making
it easier to work with complex data.
There are several advantages to using NoSQL databases:
1. Scalability: NoSQL databases can scale horizontally by adding more nodes to the system.
This makes it easy to handle large amounts of data and maintain performance as the system
grows.
2. Flexibility: NoSQL databases can handle a wide variety of data types and structures, making
it easier to work with complex data.
3. Speed: NoSQL databases are optimized for high-speed data access and retrieval, making
them ideal for applications that require real-time data processing.
4. Cost-effective: NoSQL databases are often more cost-effective than traditional relational
databases, especially when it comes to scaling and managing large volumes of data.
5. Availability: NoSQL databases are designed to provide high availability and fault tolerance,
ensuring that data is always available, even in the event of node failures.
Q2 ) Why NoSQL? Explain Advantages & Disadvantages.
ANS : Here are some reasons why NoSQL would be used :-
1. Multi-Model
2. Easily Scalable
3. Distributed
The masterclass architecture of the NoSQL database allows multiple copies of data to be
maintained across different nodes. If one node goes down then another node will have a
copy of the data for easy and fast access. This leads to zero downtime in the NoSQL
database. When one considers the cost of downtime, this is a big deal.
Advantages:
1. Scalability: NoSQL databases are highly scalable and can easily handle a large amount of
data. They are designed to work with large data sets and can distribute the data across
multiple servers.
2. Flexibility: NoSQL databases are highly flexible and can handle different types of data such
as structured, semi-structured, and unstructured data. They allow for easy modification of
data models without having to make changes to the entire database schema.
3. Performance: NoSQL databases are optimized for performance and can handle high-speed
data processing. They are designed to perform well in distributed environments and can
handle large volumes of data with high concurrency.
4. Availability: NoSQL databases are designed for high availability and can withstand hardware
failures without losing data. They are highly fault-tolerant and can continue to operate even
in the event of network failures.
Disadvantages:
1. Complexity: NoSQL databases are generally more complex than traditional relational
databases, and may require specialized knowledge to manage and optimize them.
2. Limited functionality: NoSQL databases do not offer the same level of functionality as
traditional relational databases. They lack support for advanced query languages and
transaction processing.
3. Lack of standardization: There is no standard for NoSQL databases, and each database has
its own way of storing and retrieving data. This can make it difficult to switch between
different NoSQL databases.
4. Consistency: NoSQL databases sacrifice consistency for scalability and availability, meaning
that data may not always be consistent across different nodes in the database cluster.
Q3) Write differences between NoSQL and SQL.
ACID ACID (Atomicity, Consistency, BASE (Basically Available, Soft state, Eventual
Compliance Isolation, Durability) consistency)
Highly Structured and Enforces Data Loosely Structured and May Not Enforce Data
Data Integrity Integrity Integrity
Consistency
Model Strong Consistency Model Eventual Consistency Model
Hosting Options Typically Hosted on a Single Server Typically Hosted on Multiple Servers or Cloud
ANS :
5. What is NoSQL database?
NoSQL databases, also known as non-relational databases, are a type of database that do not use
the traditional relational model with tables, rows and columns that is used in relational databases
such as SQL. NoSQL databases are designed to handle large amounts of unstructured or semi-
structured data, and offer more flexibility in data models, scalability, and availability compared to
traditional relational databases.
NoSQL databases can store various types of data such as documents, graphs, key-value pairs, and
columns. They are often used in modern applications where there is a need for high scalability and
high performance, such as in big data and real-time web applications. Examples of NoSQL
databases include MongoDB, Cassandra, Redis, and Neo4j. Unlike SQL databases, NoSQL
databases do not use SQL for querying and data manipulation, but offer their own query
languages or APIs.
NoSQL (Not Only SQL) and relational databases are two different types of database management
systems that have their own unique features and characteristics. Here are some of the key
differences between them:
1. Data model: Relational databases use a structured data model with tables, columns, and
rows, while NoSQL databases use a flexible data model, which can include documents, key-
value pairs, graphs, or column-family.
2. Schema: Relational databases enforce a strict schema, which means that all data must be
pre-defined and follow a consistent structure, while NoSQL databases have a flexible
schema, which allows for more dynamic and agile data modeling.
3. Scalability: NoSQL databases are designed to be horizontally scalable, which means they can
handle large volumes of data and high traffic loads by adding more servers or nodes to the
system. Relational databases, on the other hand, are vertically scalable, which means they
can handle more traffic and data by adding more processing power or memory to the
existing server.
4. Performance: NoSQL databases can offer faster performance for certain types of queries and
workloads, such as those involving complex relationships or unstructured data. Relational
databases may perform better for simple queries or those involving transactions and
consistency.
5. ACID compliance: Relational databases are generally designed to ensure ACID compliance,
which means they guarantee atomicity, consistency, isolation, and durability for transactions.
NoSQL databases may not always provide ACID guarantees, but may offer other forms of
consistency or eventual consistency.
6. Cost: NoSQL databases are often open-source or free to use, while relational databases
typically require licensing fees or upfront costs for commercial use.
7. Use cases: Relational databases are commonly used for transactional systems, such as online
banking or e-commerce, where data integrity and consistency are critical. NoSQL databases
are often used for big data applications, real-time analytics, and distributed systems, where
scalability and agility are important.
It's important to note that there are many types of NoSQL databases, and each may have its own
unique features and characteristics. Additionally, many modern databases have hybrid features that
combine elements of both NoSQL and relational databases.
It's important to note that many NoSQL databases offer hybrid features that combine elements of
different types, and that choosing the right type of database depends on the specific requirements
and characteristics of the application.
ANS : ISKA ANS CONFIRM NAHI HAI /// INSHORT NAHI MILA …..
Q.7 ). Explain different types of consistencies.
OR
ANS :
Read consistency and update consistency are two important concepts in NoSQL databases that affect how data is
read and updated in a distributed environment.
1. Read Consistency: Read consistency refers to the level of assurance that a read operation returns the
most up-to-date data. In NoSQL databases, read consistency can be achieved through various
techniques, such as quorum reads and vector clocks.
For example, consider a NoSQL database that has three replicas of data, and a quorum of two replicas is required
to read data. In this scenario, a read operation will return the most recent data that has been written to at least
two of the three replicas. This ensures that the read operation returns the most up-to-date data while still
maintaining high availability.
2. Update Consistency: Update consistency refers to the level of assurance that a write operation updates all
replicas of the data correctly. In NoSQL databases, update consistency can be achieved through various
techniques, such as quorum writes and version vectors.
For example, consider a NoSQL database that has three replicas of data, and a quorum of two replicas is required
to write data. In this scenario, a write operation will update at least two of the three replicas with the new data.
This ensures that all replicas are updated correctly, while still maintaining high availability.
In summary, read and update consistency are important concepts in NoSQL databases that ensure data
consistency and high availability in a distributed environment. By choosing the right consistency level for read
and write operations, developers can design scalable and reliable applications that meet the needs of their users.
ANS :
The CAP theorem, originally introduced as the CAP principle, can be used to explain
some of the competing requirements in a distributed system with replication. It is a tool
used to make system designers aware of the trade-offs while designing networked
shared-data systems.
The three letters in CAP refer to three desirable properties of distributed systems with
replicated data: consistency (among replicated copies), availability (of the system for
read and write operations) and partition tolerance (in the face of the nodes in the
system being partitioned by a network fault).
The CAP theorem states that it is not possible to guarantee all three of the desirable
properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication.
The theorem states that networked shared-data systems can only strongly support two of
the following three properties:
• Consistency –
Consistency means that the nodes will have the same copies of a replicated
data item visible for various transactions. A guarantee that every node in a
distributed cluster returns the same, most recent and a successful write.
Consistency refers to every client having the same view of the data. There are
various types of consistency models. Consistency in CAP refers to sequential
consistency, a very strong form of consistency.
• Availability –
Availability means that each read or write request for a data item will either be
processed successfully or will receive a message that the operation cannot be
completed. Every non-failing node returns a response for all the read and write
requests in a reasonable amount of time. The key word here is “every”. In
simple terms, every node (on either side of a network partition) must be able to
respond in a reasonable amount of time.
• Partition Tolerance –
Partition tolerance means that the system can continue operating even if the
network connecting the nodes has a fault that results in two or more partitions,
where the nodes in each partition can only communicate among each other.
That means, the system continues to function and upholds its consistency
guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from
partitions once the partition heals.
The use of the word consistency in CAP and its use in ACID do not refer to the same
identical concept.
In CAP, the term consistency refers to the consistency of the values in different copies of
the same data item in a replicated distributed system. In ACID, it refers to the fact that a
transaction will not violate the integrity constraints specified on the database schema.
Q.9). Explain Following terms:
1. Relaxing Consistency
2. Relaxing Durability
3. Quorum
4. Version stamp
ANS :
1.
Relaxing Consistency: Relaxing consistency in NoSQL databases means allowing for some
inconsistency in the data to improve performance and scalability. This means that not all
replicas of the data are always kept in sync with each other, and a read operation may not
always return the most recent write. Relaxing consistency is a trade-off between consistency
and performance and is commonly used in distributed systems where high availability is
critical.
2. Relaxing Durability: Relaxing durability in NoSQL databases means allowing for the
possibility of data loss in exchange for improved performance. This means that not all writes
to the database are immediately persisted to disk, and some writes may be lost in the event
of a system failure. Relaxing durability is a trade-off between durability and performance
and is commonly used in systems where high write throughput is critical, and data loss is
acceptable.
3. Quorum: Quorum in NoSQL databases is the minimum number of replicas that must be
available to perform a read or write operation. A quorum is used to ensure that the read or
write operation is performed on a majority of the replicas, ensuring data consistency and
avoiding conflicts. For example, a quorum of two in a database with three replicas means
that at least two replicas must be available to perform a read or write operation.
4. Version stamp: A version stamp in NoSQL databases is a metadata tag that is associated
with each piece of data. The version stamp is used to determine the order of updates to the
data and is commonly used in conflict resolution. When multiple clients update the same
piece of data concurrently, the version stamp is used to determine which update is the most
recent and should be applied.
In summary, NoSQL databases have different features and trade-offs compared to traditional
relational databases. By understanding concepts such as relaxing consistency, relaxing durability,
quorum, and version stamp, developers can design scalable and reliable applications that meet the
needs of their users.
Q.10) .
Q. 11) .
Q. 12) .
Q. 13) . Differentiate Strong Vs. Eventual Consistency.
1. Eventual Consistency : Eventual consistency is a consistency model that enables the data
store to be highly available. It is also known as optimistic replication & is key to distributed
systems. So, how exactly does it work? Let’s Understand this with the help of a use
case. Real World Use Case :
Think of a popular microblogging site deployed across the world in different geographical
regions like Asia, America, and Europe. Moreover, each geographical region has multiple
data center zones: North, East, West, and South.
Furthermore, each zone has multiple clusters which have multiple server nodes running. So,
we have many datastore nodes spread across the world that micro-blogging site uses for
persisting data. Since there are so many nodes running, there is no single point of failure.
The data store service is highly available. Even if a few nodes go down persistence service is
still up. Let’s say a celebrity makes a post on the website that everybody starts liking around
the world.
At a point in time, a user in Japan likes a post which increases the “Like” count of the post
from say 100 to 101. At the same point in time, a user in America, in a different geographical
zone, clicks on the post, and he sees “Like” count as 100, not 101.
Simply, because the new updated value of the Post “Like” counter needs some time to move
from Japan to America and update server nodes running there. Though the value of the
counter at that point in time was 101, the user in America sees old inconsistent values.
But when he refreshes his web page after a few seconds “Like” counter value shows as 101.
So, data was initially inconsistent but eventually got consistent across server nodes deployed
around the world. This is what eventual consistency is.
2. Strong Consistency: Strong Consistency simply means the data must be strongly
consistent at all times. All the server nodes across the world should contain the same value as
an entity at any point in time. And the only way to implement this behavior is by locking
down the nodes when being updated. Real World Use Case :
Let’s continue the same Eventual Consistency example from the previous lesson. To ensure
Strong Consistency in the system, when a user in Japan likes posts, all nodes across different
geographical zones must be locked down to prevent any concurrent updates.
This means at one point in time, only one user can update the post “Like” counter value. So,
once a user in Japan updates the “Like” counter from 100 to 101. The value gets replicated
globally across all nodes. Once all nodes reach consensus, locks get lifted. Now, other users
can Like posts.
If the nodes take a while to reach a consensus, they must wait until then. Well, this is surely
not desired in the case of social applications. But think of a stock market application where
the users are seeing different prices of the same stock at one point in time and updating it
concurrently. This would create chaos. Therefore, to avoid this confusion we need our
systems to be Strongly Consistent.
The nodes must be locked down for updates. Queuing all requests is one good way of
making a system Strongly Consistent. The strong Consistency model hits the capability of
the system to be Highly Available & perform concurrent updates. This is how strongly
consistent ACID transactions are implemented.
------------------------------------------------------------------DISCLAIMER ! --------------------------------------------------------------------
KRUPIYA JO ANS BAKI HAI WHO DHUNDH KE VAPAS EDIT KARLE AUR PHIR LIKH LE ………….
AUR HA BHAI AGR ANS NAHI MILTE TONA BHAI ATLEAST MESSAGE KARKE DUSRE SE CONSULT TO KAR LIYA KARO
BHAI ………………..