0% found this document useful (0 votes)
38 views48 pages

Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views48 pages

Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Unit-II : NOSQL Data Management

• Introduction to NoSQL,
• Aggregate data models,
• Key-value and document data Models,
• Relationships, Graph databases,
• Schema less databases,
materialized views, distribution models,
• Sharding, Master-slave replication,
• Peer-peer replication,
• Sharing and replication
Introduction to NoSQL

• NoSQL is a type of database management system (DBMS) that is


designed to store and handle large volumes of unstructured and
semi-structured data. Unlike traditional relational databases that
use tables with pre-defined schemas to store data.

• NoSQL databases use flexible data models that can adapt to


changes in data structures and are capable of scaling horizontally to
handle growing amounts of data.

• The term NoSQL originally referred to “non-SQL” or “non-


relational” databases, but the term has since evolved to mean “not
only SQL,” as NoSQL databases have expanded to include a wide
range of different database architectures and data models.
In order to better understand What is NoSQL, we should compare it with Relational
databases :
Criteria Relational NoSQL
Database Management Database Management
Data model Tables and schemas Partition Keys to
retrieve data
ACID properties Strictly followed No strict adherence
Scalability Vertical scalability Horizontal scalability
Data manipulation Using queries in SQL Using object-based APIs
and executed by
RDBMS
Velocity of data Moderate Very high
Suitability Structured data Structured, semi-
structured and
unstructured data
SQL NoSQL
RELATIONAL DATABASE
Non-relational or distributed
MANAGEMENT SYSTEM
database system.
(RDBMS)
These databases have fixed or
They have a dynamic schema
static or predefined schema
These databases are not suited for These databases are best suited for
hierarchical data storage. hierarchical data storage.
These databases are best suited for These databases are not so good
complex queries for complex queries
Vertically Scalable Horizontally scalable
Follows CAP(consistency,
Follows ACID property
availability, partition tolerance)
Features of NoSQL :
• Dynamic schema: NoSQL databases do not have a fixed
schema and can accommodate changing data structures without
the need for migrations or schema alterations.

• Horizontal scalability: NoSQL databases are designed to scale


out by adding more nodes to a database cluster, making them
well-suited for handling large amounts of data and high levels of
traffic.

• Document-based: Some NoSQL databases, such as MongoDB,


use a document-based data model, where data is stored in semi-
structured format, such as JSON or BSON.

• Key-value-based: Other NoSQL databases, such as Redis, use a


key-value data model, where data is stored as a collection of
key-value pairs.
• Column-based: Some NoSQL databases, such as Cassandra, use a
column-based data model, where data is organized into columns
instead of rows.

• Distributed and high availability: NoSQL databases are often


designed to be highly available and to automatically handle node
failures and data replication across multiple nodes in a database
cluster.

• Flexibility: NoSQL databases allow developers to store and


retrieve data in a flexible and dynamic manner, with support for
multiple data types and changing data structures.

• Performance: NoSQL databases are optimized for high


performance and can handle a high volume of reads and writes,
making them suitable for big data and real-time applications.
Advantages of NoSQL:
There are many advantages of working with NoSQL databases such as
MongoDB and Cassandra. The main advantages are high scalability and
high availability.\
• High scalability : NoSQL databases use sharding for horizontal
scaling. Partitioning of data and placing it on multiple machines in
such a way that the order of the data is preserved is sharding. Vertical
scaling means adding more resources to the existing machine
whereas horizontal scaling means adding more machines to handle
the data. Vertical scaling is not that easy to implement but horizontal
scaling is easy to implement. Examples of horizontal scaling
databases are MongoDB, Cassandra, etc.

• Flexibility: NoSQL databases are designed to handle unstructured or


semi-structured data, which means that they can accommodate
dynamic changes to the data model. This makes NoSQL databases a
good fit for applications that need to handle changing data
requirements.
• High availability : Auto replication feature in NoSQL databases
makes it highly available because in case of any failure data
replicates itself to the previous consistent state.

• Scalability: NoSQL databases are highly scalable, which means


that they can handle large amounts of data and traffic with ease.
This makes them a good fit for applications that need to handle
large amounts of data or traffic

• Performance: NoSQL databases are designed to handle large


amounts of data and traffic, which means that they can offer
improved performance compared to traditional relational databases.

• Cost-effectiveness: NoSQL databases are often more cost-effective


than traditional relational databases, as they are typically less
complex and do not require expensive hardware or software.

• Agility: Ideal for agile development.


Disadvantages of NoSQL:
• Lack of standardization : There are many different types of NoSQL
databases, each with its own unique strengths and weaknesses. This
lack of standardization can make it difficult to choose the right
database for a specific application

• Lack of ACID compliance : NoSQL databases are not fully ACID-


compliant, which means that they do not guarantee the consistency,
integrity, and durability of data. This can be a drawback for
applications that require strong data consistency guarantees.

• Narrow focus : NoSQL databases have a very narrow focus as it is


mainly designed for storage but it provides very little functionality.
Relational databases are a better choice in the field of Transaction
Management than NoSQL.

• Open-source : NoSQL is open-source database. There is no reliable


standard for NoSQL yet. In other words, two database systems are
likely to be unequal.
• Lack of support for complex queries : NoSQL databases are not
designed to handle complex queries, which means that they are not a good
fit for applications that require complex data analysis or reporting.
• Lack of maturity : NoSQL databases are relatively new and lack the
maturity of traditional relational databases. This can make them less
reliable and less secure than traditional databases.
• Management challenge : The purpose of big data tools is to make the
management of a large amount of data as simple as possible. But it is not so
easy. Data management in NoSQL is much more complex than in a
relational database. NoSQL, in particular, has a reputation for being
challenging to install and even more hectic to manage on a daily basis.
• GUI is not available : GUI mode tools to access the database are not
flexibly available in the market.
• Backup : Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a
consistent manner.
• Large document size : Some database systems like MongoDB and
CouchDB store data in JSON format. This means that documents are quite
large (BigData, network bandwidth, speed), and having descriptive key
names actually hurts since they increase the document size.
Aggregate Data Models in NoSQL?
• Aggregate means a collection of objects that are treated as a unit. In
NoSQL Databases, an aggregate is a collection of data that interact as a
unit. Moreover, these units of data or aggregates of data form the
boundaries for the ACID operations.

• Aggregate Data Models in NoSQL make it easier for the Databases to


manage data storage over the clusters as the aggregate data or unit can now
reside on any of the machines. Whenever data is retrieved from the
Database all the data comes along with the Aggregate Data Models in
NoSQL.

• Aggregate Data Models in NoSQL don‟t support ACID transactions and


sacrifice one of the ACID properties. With the help of Aggregate Data
Models in NoSQL, you can easily perform OLAP operations on the
Database.

• You can achieve high efficiency of the Aggregate Data Models in the
NoSQL Database if the data transactions and interactions take place within
the same aggregate.
Types of Aggregate Data Models in NoSQL Databases
The Aggregate Data Models in NoSQL are majorly classified into
4 Data Models listed below:
1. Key-Value Model
• The Key-Value Data Model contains the key or an ID
used to access or fetch the data of the aggregates
corresponding to the key. In this Aggregate Data Models
in NoSQL, the data of the aggregates are secure and
encrypted and can be decrypted with a Key
Use Cases:
• These Aggregate Data Models in NoSQL
Database are used for storing the user session
data.
• Key Value-based Data Models are used for
maintaining schema-less user profiles.
• It is used for storing user preferences and
shopping cart data
2. Document Model
The Document Data Model allows access to the parts of
aggregates. In this Aggregate Data Models in NoSQL, the data can
be accessed in an inflexible way. The Database stores and
retrieves documents, which can be XML, JSON, BSON, etc. There
are some restrictions on data structure and data types of the data
aggregates that are to be used in this Aggregate Data Models in
NoSQL Database.

Use Cases:
• Document Data Models are widely used in
E-Commerce platforms
• It is used for storing data from content management systems.
• Document Data Models are well suited for Blogging and
Analytics platforms
3. Column Family Model
• Column family is an Aggregate Data Models in NoSQL Database
usually with big - table style Data Models that are referred to as column
stores. It is also called a two-level map as it offers a two-level
aggregate structure.

• In this Aggregate Data Models in NoSQL, the first level of the Column
family contains the keys that act as a row identifier that is used to select
the aggregate data. Whereas the second level values are referred to as
columns.

Use Cases:
• Column Family Data Models are used in systems that maintain counters.
• These Aggregate Data Models in NoSQL are used for services that have
expiring usage.
• It is used in systems that have heavy write requests.
Working of Columnar Data Model:
In Columnar Data Model instead of organizing information into rows, it does in
columns. This makes them function the same way that tables work in relational
databases. This type of data model is much more flexible obviously because it is a
type of NoSQL database. The below example will help in understanding the
Columnar data model:
Column – Oriented Table:
Row-Oriented Table:

S.No. Name Course Branch ID


S.No. Name ID S.No. Course ID
01. Tanmay B-Tech Computer 2
01. Tanmay 2 01. B-Tech 2
02. Abhishek B-Tech Electronics 5
02. Abhishek 5 02. B-Tech 5

03. Samriddha B-Tech IT 7 03. B-Tech 7


03. Samriddha 7
04. Aditi B-Tech E & TC 8 04. B-Tech 8
04. Aditi 8
S.No. Branch ID
Columnar Data Model uses the concept of keyspace, 01. Computer 2
which is like a schema in relational models.
02. Electronics 5

03. IT 7

04. E & TC 8
Advantages of Columnar Data Model :
• Well structured: Since these data models are good at compression so these are very structured or
well organized in terms of storage.
• Flexibility: A large amount of flexibility as it is not necessary for the columns to look like each
other, which means one can add new and different columns without disrupting the whole database
• Aggregation queries are fast: The most important thing is aggregation queries are quite fast because
a majority of the information is stored in a column. An example would be Adding up the total number
of students enrolled in one year.
• Scalability: It can be spread across large clusters of machines, even numbering in thousands.
• Load Times: Since one can easily load a row table in a few seconds so load times are nearly
excellent.
Disadvantages of Columnar Data Model:
• Designing indexing Schema: To design an effective and working schema is too difficult and very
time-consuming.
• Suboptimal data loading: incremental data loading is suboptimal and must be avoided, but this
might not be an issue for some users.
• Security vulnerabilities: If security is one of the priorities then it must be known that the Columnar
data model lacks inbuilt security features in this case, one must look into relational databases.
• Online Transaction Processing (OLTP): Online Transaction Processing (OLTP) applications are
also not compatible with columnar data models because of the way data is stored.
Applications of Columnar Data Model:
• Columnar Data Model is very much used in various Blogging Platforms.
• It is used in Content management systems like WordPress, Joomla, etc.
• It is used in Systems that maintain counters.
• It is used in Systems that require heavy write requests.
• It is used in Services that have expiring usage.
Graph Database
A graph database (GDB) is a database that uses graph structures for storing data. It uses
nodes, edges, and properties instead of tables or documents to represent and store data. The
edges represent relationships between the nodes. This helps in retrieving data more easily and,
in many cases, with one operation. Graph databases are commonly referred to as NoSQL.
Ex: Neo4j, Amazon Neptune, ArangoDB etc.

Representation:
The graph database is based on graph theory. The data is stored in the nodes of the graph and
the relationship between the data are represented by the edges between the nodes.
When do we need Graph Database?
1. It solves Many-To-Many relationship problems
• If we have friends of friends and stuff like that, these are many to many
relationships.
• Used when the query in the relational database is very complex.

2. When relationships between data elements are more important


• For example- there is a profile and the profile has some specific information in it
but the major selling point is the relationship between these different profiles
that is how you get connected within a network.
• In the same way, if there is data element such as user data element inside a graph
database there could be multiple user data elements but the relationship is what
is going to be the factor for all these data elements which are stored inside the
graph database.

3. Low latency with large scale data


• When you add lots of relationships in the relational database, the data sets are
going to be huge and when you query it, the complexity is going to be more
complex and it is going to be more than a usual time.
• However, in graph database, it is specifically designed for this particular
purpose and one can query relationship with ease.
Why do Graph Databases matter?
Because graphs are good at handling
relationships, some databases store data in
the form of a graph.
user_i friend_
d id
Example We have a social network in 1 2

which five friends are all connected. These 1 3


1 4
friends are Anay, Bhagya, Chaitanya,
1 5
Dilip, and Erica. A graph database that
2 1
will store their personal information may 2 3
look something like this: 2 4
2 5
id first name last name email phone 3 1
3 2
1 Anay Agarwal anay@example.net 555-111-5555
3 4
2 Bhagya Kumar bhagya@example.net 555-222-5555 3 5
4 1
3 Chaitanya Nayak chaitanya@example.net 555-333-5555
4 2
4 Dilip Jain dilip@example.net 555-444-5555 4 3
4 5
5 Erica Emmanuel erica@example.net 555-555-5555
5 1
5 2
5 3
5 4
Advantages:
• Frequent schema changes, managing volume of data, real-time query response
time, and more intelligent data activation requirements are done by graph model.

Disadvantages:
• Note that graph databases aren‟t always the best solution for an application. We
will need to assess the needs of application before deciding the architecture.

Limitations of Graph Databases:


• Graph Databases may not be offering better choice over the NoSQL variations.
• If application needs to scale horizontally this may introduces poor performance.
• Not very efficient when it needs to update all nodes with a given parameter.
Relationships
• Relationships associations between different collections of
data in a database. You can create relationships and define
its object properties for NoSQL databases using following
methods:
Embedding
• Embeds the related data in collections into a single or
multiple structured collections.
Referencing
• Relates the data in multiple collections as an identifying or
non-identifying relationships
Embedding Method
• You can select either of the following options to create
relationships for collections in NoSQL databases:
Normalization
• Splits the fields in a collection into multiple collections
based on the selected relationship type
Denormalization
• Embeds different collections into a single collection based
on the selected embedding type
Referencing Method

• Identifying: Specifies that the child collection is dependent on the


parent collection for its identity and cannot exist without it

• Non-Identifying: Specifies that child collection is not dependent on


the parent collection for its identity and can exist without it

• Views: Specifies the relationships between the View objects


Schemaless database
• Schemaless databases are a type of NoSQL databases that do not have a
predefined schema or structure for data. This means that data can be
inserted and retrieved without adhering to a specific structure, and the
database can adapt to changes in data over time without requiring schema
migrations or changes
Benefits of using schemaless databases:
Flexibility: Schema less databases allow for greater flexibility in data
modeling, as there are no constraints on the structure of the data. This
allows for a more a gile development process, as changes to the data
model can be made quickly and easily.

Scalability: Schema less databases are designed for scalability, as they can
handle large amounts of unstructured data with ease. This makes them well-
suited for use cases such as big data analytics and real-time data processing.

Reduced complexity: Schema less databases ca n reduce the complexity of


da ta modeling a nd development, a s there is no need to define and
maintain a complex schema. This can lead to faster development times and
lower maintenance costs
Advantages
• Ability for rapid and easy scaling of servers (sharding/clustering)
supported by most NoSQL technologies.
• None to little requirements to conform to a rigid schema
• No enforcement of data type limitations
• Ability to store all formats, including missing fields
• Ability to store both unstructured and structured data
• Faster to set up due to no schema model to be designed
• Generally less overhead and better performance
• Assists applications to be backward and forward compatible
• Developers have control of what objects (schema) they want to build
with ease and on the fly, without the need to involve a Database
Administrator (DBA)
• Popular solution for collecting mountains of application logs!
Challenges associated with using schemaless databases:
Lack of structure: The lack of a predefined schema can make it
more difficult to understand and analyze the data, especiallly as
the data set grows in size and complexity.
Data quality issues: Without a defined schema, it can be more
difficult to ensure that data is consistent and accurate. This can
lead to data quality issues a nd errors in analysis.

Performance issues: In some cases, schema less data bases may


have performance issues due to the lack of structure and indexing.
Queries may need to scan large a mounts of da ta to find relevant
information, leading to slower query times.

When using schemaless databases, it is important to carefully


consider the specific requirements of the application and the
potential trade-offs between flexibility, scalability, and data quality.
Disadvantages
• All documents require parsing due to have no columns
• This is mitigated as schema-less design is stored in memory

• Lack of metadata, the application is requires investigation to learn


this information (field validation/data size and types

• No control over what data goes in, generally no filters for bad data
whilst getting loaded in either way

• Higher network traffic, due to sending entire documents

• Inability to apply database normalization standards, can lead to


higher data storage demands.

• Lose out on functionality that SQL databases would give especially


if you need automatic enforcement, referential integrity
Materialized view
• A materialized view is a duplicate data table created by combining
data from multiple existing tables for faster data retrieval.
• A materialized view is a view where the query has been executed
and the results has been stored as a physical table.
• In fact, it is a real table that you can index, declare constraints etc.
When accessing a materialized view, you are accessing the
pre-computed results.
• You are NOT executing the underlaying query. There are several
strategies for how to keeping the materialized view up-to-date.
Difference Between Materialized View And View :

View Materialized Views(Snapshots)


• View is nothing but the logical • Materialized views(Snapshots) are also
structure of the table which will logical structure but data is physically
retrieve data from 1 or more table. stored in database.
• You need to have Create view • You need to have create materialized
privileges to create simple or complex view „s privileges to create
view Materialized views
• Data retrieval is fast as compare to
• Data access is not as fast as
simple view because data is accessed
materialized views
from directly physical location
• There are 2 types of views: • There are following types of Materialized
1.Simple View views:
2.Complex view 1. Refresh on Auto
2. Refresh on demand
• In Application level views are used • Materialized Views are used in
to restrict data from database Data Warehousing.
Create View V_Employee as
• Select E.Employee_num,E.Employee_name,D.Department_Name
• from Employee E , Department D where E.Dept_no=D.Dept_no;
• Fetch the records from the View.
• Select * from V_Employee;
• It will fetch 1 million records with associated department. But to
fetch that records check the time. Let us consider it will take 2 Mins
means 120 secs to fetch records
Let us Create materialized view which will refresh automatically.
• Create or Replace Materialized view MV_Employee
• as
• Select E.Employee_num,E.Employee_name,D.Department_Name
• from Employee E , Department D where E.Dept_no=D.Dept_no
• Refresh auto on commit select from Employee E ,Department D;
• We have created materialized views in sql
• Select* from MV_Employee;
• It will fetch 1 million records in 60 secs. So performance is
improved double when you use materialized view.
Advantages of materialized views
Speed
• Read queries scan through different tables and rows of data to gather the necessary
information. With materialized views, you can query data directly from your new view
instead of having to compute new information every time. The more complex your query is,
the more time you will save using a materialized view.
Data storage simplicity
• Materialized views allow you to consolidate complex query logic in one table. This makes
data transformations and code maintenance easier for developers. It can also help make
complex queries more manageable. You can also use data subsetting to decrease the amount
of data you need to replicate in the view.
Consistency
• Materialized views provide a consistent view of data captured at a specific moment. You can
configure read consistency in materialized views and make data accessible even in multi-
user environments where concurrency control is essential.
• Materialized views also provide data access even if the source data changes or is deleted.
Over time, this means that you can use materialized views to report on time-based data
snapshots. The level of isolation from source tables ensures that you have a greater degree of
consistency across your data.
Improved access control
• You can use a materialized view to control who has access to specific data. You can filter
information for users without giving them access to the source tables. This approach is
practical if you want to control who has access to what data and how much of it they can see
and interact with.
Distribution Models
• The primary driver of interest in NoSQL has been its ability to run
databases on a large cluster.
• As data volumes increase, it becomes more difficult and expensive
to scale up buy a bigger server to run the database on.
• Aggregate orientation fits well with scaling out because the
aggregate is a natural unit to use for distribution.
• Depending on your distribution model, you can get a data store that
will give you the ability to handle larger quantities of data, the
ability to process a greater read or write traffic, or more availability
in the face of network slowdowns or breakages
• Broadly, there are two paths to data distribution: replication and
sharding.
• Replication takes the same data and copies it over multiple nodes.
Sharding puts different data on different nodes.
• Replication and sharding are orthogonal techniques: You can use
either or both of them.
Sharding
• Sharding is a method of storing data records across many server
instances. This is done through storage area networks to make
hardware perform like a single server.
• The NoSQL framework is natively designed to support automatic
distribution of the data across multiple servers including the query
load.
• Sharding is a method for distributing a single dataset across multiple
databases, which can then be stored on multiple machines.
• This allows for larger datasets to be split into smaller chunks and
stored in multiple data nodes, increasing the total storage capacity of
the system
In its simplest configuration (a single shard), a sharded cluster will look like this:

Sharding Benefits
• Increased read/write throughput: You can
take advantage of parallelism by distributing
the data set across multiple shards. Let‟s say
one shard can process one thousand operations
per second. For each additional shard, you
would gain an additional one thousand
operations per second in throughput.
• Increased storage capacity: Similarly, by increasing the number of
shards, you can also increase overall total storage capacity. Let‟s say
one shard can hold 4TB of data. Each additional would increase your
total storage by 4TB. This allows near-infinite storage capacity.
• Data Locality: Zone Sharding allows you to easily create distributed
databases to support geographically distributed apps, with policies
enforcing data residency within specific regions. Each zone can have
one or more shards.
Replication

NoSQL Data Replication is also a robust feature that allows you to


seamlessly copy and store your structured, unstructured, and semi-
structured data and prevent data losses in case of a server crash.

Replication comes in two forms in NoSql:


• Master-slave replication makes one node the authoritative copy that
handles writes while slaves synchronize with the master and may
handle reads.

• Peer-to-peer replication allows writes to any node; the nodes


coordinate to synchronize their copies of the data.

• Master-slave replication reduces the chance of update conflicts but


peer-to-peer replication avoids loading all writes onto a single point
of failure.
Master-slave replication

Master
• Master is the authoritative
source for the data
• Master is responsible for
processing any updates to that
data
• Master can be appointed
manually or automatically
Slaves
• A replication process
synchronizes the slaves with the
master
• After a failure of the master, a
slave can be appointed as new
master very quickly
Pros and cons of Master-Slave Replication
Pros
1. More read requests:
• Add more slave nodes
• Ensure that all read requests are routed to the slaves
2. Should the master fail, the slaves can still handle read requests
3. Good for datasets with a read-intensive dataset

Cons
1. The master is a bottleneck Limited by its ability to process updates
and to pass those updates on
• Its failure does eliminate the ability to handle writes until:
• the master is restored or
• a new master is appointed
2. Inconsistency due to slow propagation of changes to the slaves
3. Bad for datasets with heavy write traffic
Peer-to-Peer Replication
• All the replicas have equal weight, they can all accept writes
• The loss of any of them doesn‟t prevent access to the data store.
Pros and cons of peer-to-peer replication
Pros:
• you can ride over node failures without losing access to data
• you can easily add nodes to improve your performance

Cons:
Inconsistency!
• Slow propagation of changes to copies on different nodes
• Inconsistencies on read lead to problems but are relatively
transient
• Two people can update different copies of the same record stored
on different nodes at the same time - a write-write conflict.
• Inconsistent writes are forever.
Combining Sharding and Replication
Replication and sharding are strategies that can be combined. If we use both master-
slave replication and sharding , this means that we have multiple masters, but each
data item only has a single master. Depending on your configuration, you may choose
a node to be a master for some data and slaves for others, or you may dedicate nodes
for master or slave duties.

Using peer-to-peer replication and sharding is a common strategy for column-family


databases. In a scenario like this you might have tens or hundreds of nodes in a cluster with
data sharded over them. A good starting point for peer-to-peer replication is to have a
replication factor of 3, so each shard is present on three nodes. Should a node fail, then the
shards on that node will be built on the other nodes
1. What is NoSQL , features of NoSQl and explain difference between
NoSQL and Sql?
2. What Aggregation and Explain Different aggregation data models and
features with suitable example?
3. Explain Relationships, Schema less databases and materialized views
with suitable examples?
4. What is distributed data models and difference between distributed
model with suitable example?
5. Discuss and differentiate between Sharing and replication?

You might also like