0 ratings0% found this document useful (0 votes) 159 views30 pagesBda - Unit 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
UNIT II
NoSQL Data Management
Syllabus
Introduction to NoSQL - aggregate data models - key-value and document data models -
relationships - graph databases - schemaless databases - materialized views - distribution models -
master-slave replication - consistency - Cassandra - Cassandra data model - Cassandra examples
~ Cassandra clients
Contents
24
22
23
24
25
26
27
28
Introduction to NoSQL
Aggregate Data Models
Schemaless Databases
Materialized Views
Distribution Models
Consistency
Cassandra ‘
Two Marks Questions with Answers
(2-1)NoSQL Di j
Big Data Analytics 2:2 SOF Pata Managemen,
ERG introduction to NoSQL |
NoSQL means Not Only SQL, it solves the problem of handling huge volume o,
data that relational databases cannot handle. NoSQL databases are schema free
and are non-relational databases. Most of the NoSQL databases are open source, |
NoSQL is also type of distributed database, which means that information jg |
copied and stored on various servers, which can be remote or local. This ensures |
availability and reliability of data, If some of the data goes offline, the rest of the
database can continue to run, |
NoSQL encompasses
polymorphic data.
structured data, semi-structured data, unstructured data ang
|
No SQL database provides a mechanism for storage ‘and retrieval of data that
employs less constrained consistency models than traditional relational databases,
NoSQL is a response to nowadays business data related factors :
1. Volume and velocity, referring to the ability to handle large datasets that arrive
quickly;
2. Variability, referring to how diverse data types don't fit into structured tables;
Agility, referring to how fast an organization responds to business changes.
NoSQL databases are very often referred to as data stores rather than data-bases,
NoSQL. sy
computer s
ems work on multiple processors and can run on low-cost separate
stems - No need for expensive nodes to get high-speed performance,
It supports linear scalability. Every time we add more processors, we get a
consistent increase in performance.
History of NoSQL =
© The acronym NoSQL was first used in 1998 by Carlo Strozzi while naming,
his lightweight, open-source "relational" database that did not use SQL. The
name came up again in 2009 when Erie Evans and Johan Oskarsson used It
to describe non-relational database
» Relational databases are oft
referred to as SQL systems, The term NoSQL
can mean either No SQL systems” or the more commonly accepted
translation of “Not only SQ1," to emphasize the fact wome systems might
support SQLlike query languages,
# NoSQL de
ped at Teast Jn the beginning ana reaponne to web data, the
need for processing unstructured data and the need for faster processing:
The NoSQL model uses a distributed databane system, meaning a nystem
with multiple computers.
TEGHMIOAL PUBLICATIONS” « an up-thrust for KnowledgeBig Data Analytics 2-3
NoSQL Data Management
Not only can NoSQL systems handle both structured and unstructured data,
but they can also process unstructured big data quickly. This led to
organizations such as Facebook, Twitter, LinkedIn, and Google adopting
NoSQL systems, These organizations process tremendous amounts of
unstructured data, coordinating it to find
patterns and gain business insights.
Big data became an official term in 2005,
Why NoSQL 7
It can handle large volumes of structured, semi-structured and unstructured data.
= Agile sprints, quick iteration and frequent code pushes.
= Object-oriented programming that is easy to use and flexible.
= Scale-out architecture,
Types of NoSQL Stores :
1
Column Oriented (Accumulo, Cassandra, Hbase)
2, Document Oriented (MongoDB, Couchbase, Clusterpoint)
3.
4, Graph (Allegro, Neodj, OrientDB)
Key-value (Dynamo, MemeacheDB, Riak)
NoSQL databases are guaranteed to adhere to two of the
CAP properties. Such
databases are of several types,
1, Key-value store : Stores in the form of a hash table (Example - Rink, Amazon
$3 (Dynamo), Redis)
2, Document-based store : Stores objects, mostly JSON, which is web friendly or
supports ODM (Object Document Mappings). (Example - CouchDB, MongoDB)
3. Column-based store : Ench storage block contains data from only one column
(Example - HBase, Cassandra)
4. Graph-based : Graph representation of relationships, mostly used by social
networks, (Example - NeodJ)
The Dofinition of Four Types of NoSQL Databases
NoSQL database provides a mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used in relational databases,
NoSQL Is often interpreted as Not-only-SQL to emphasize that they may also
support SQL-like query languages. Most NoSQL databases are designed to store
Inege quantities of data in a fault-tolerant way,
NoSQL tn simply the term that Is used to describe a family of databases that are
All nor-relational, While the technologies, data types and use casen vary widely
Smount them, it I generally agreed that there are four types of NoSQL databases s
TECHNICAL PUBLICATIONS!
40 upethruat for knowledge<4 NoSQL Date Menagon,
Big Data Anelytics |
using any of four primary aul
7, manage information
© NoSQL databases can 8 mn based and graph based, “4,
models : Key-value store, document-based, colu
[ERE Example and Advantages
© Examples of NoSQL databases
a) Apache CouchDB, an open source, JSON document-based database that y,.
JavaScript as its query language. |
b) Elasticsearch, a document-based database that includes a full-text search enging
|
©) Couchbase, a key-value and document database that empowers developers 1.
build responsive and flexible applications for cloud, mobile and edgy
computing. t
Advantages
a) NoSQL databases have a simple and flexible structure. They are schema-free,
b) NoSQL databases are based on key-value pairs.
©) Some store types of NoSQL databases include column store, document store
key value store, graph store, object store, XML, store and other data store
modes.
|
|
@) Each value in the database has a key. Some NoSQL database stores also allow
developers to store serialized objects into the database, not just simple string)
values.
e) Open-source NoSQL databases do not require expensive licensing fees and can|
run on inexpensive hardware, rendering their deployment cost-effective. |
Disadvantages : |
2) Most NoSQL databases do not support reliability features that are natively!
supported by relational database systems. |
b) In order to support reliability and consistency features, developers mus
implement their own proprietary code, which adds more complexity to the
/\ system.
CAP Theorem
JS Fig. 2.1.1 shows the three properties of the CAP theorem.
© The theorem states that distributed data systems will offer a trade-off between”
consistency, availability and partition tolerance. And, that any database can only
guarantee two of the three properties : f
© Consistency : Every node in the cluster responds with the most recent data, even”
if the system must block the request until all replicas update. If you query *
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics NoSQL Data Management
Avaliablity
Partition
tolerance
Fig. 2.1.1 CAP theorem
“consistent system” for an item that is currently updating, you will wait for that
i
response until all replicas successfully update. However, you will receive the most
current data.
Availability : Every node returns an immediate response, even if that response is |
not the most recent data. If you query an “available system" for an item that is j
updating, you will get the best possib
le answer the service can provide at that
moment.
Partition tolerance : Guarantees the system continues to operate even if a :
replicated data node fails or loses connectivity with other replicated data nodes.
Comparison of SQL and NoSQL Databases
Sr. No. SQL
1 SQL databases are relational.
2. SQL databases are vertically scalable.
SQL databases use strisctured
language and have a predefined
|
Lan as ae
4. SQL databases are table-based. NoSQL databases are: “document,
_key-value, graph, or wide-column stores,
|
SQL databases are better for multi-row while NoSQL is better for unstructured
transactions. data like documents or JSON.
at We a |
EBV Keshrosat Data Models
* Aggregate means a collection of objects that are treated as a unit. In NoSQL
Databases, an aggregate is a collection of data that interact as a unit. Moreover,
these units of data or aggregates of data form the boundaries for the ACID
operations.
ao
TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgeData Ansitios 2-6 NoSQL Data Menagom
en
. dseeee's a models in NoSQL make it easier for the databases to manage
= a a a as the aggregate data or unit can now reside on se
sane nats, Whenever data is retrieved from the databate all the da i
S aggregate data models in NoSQL.
er eon in NoSQL do not support ACID transactions and SACtificg
pan ot toe AC Properties. With the help of aggregate data models in Nosqy
cen easily perform OLAP operations on the database. We can achieve higy,
efficiency of the aggregate data models in the NoSQL database if the d
transactions and interactions take place within the same aggregate. .:
Key-value Store
* In the key-value structure, the key is usually a simple string of characters and the
value is 2 series of uninterrupted bytes that are opaque to the database. Key-value
store is like 2 relational database with only two columns : The key or attribute
name and the value.
Fig. 22.1 shows key-value store.
Key Value
oo > [sereete_]
Aadher 222222222222
cay —— it
Fig. 2.2.1 Key-value store
2. comes
© Saves data as a group of key value pairs, which are made up of two data items
that are linked. The link between the items is a "key" which acts as an identifier
for an item within the data and the "value" that is the data that has been
identified. .
The data itself is usually some primitive data type (string, integer, array) or a
more complex object that an application needs to persist and access directly.
This replaces the rigidity of relational schemas with a more flexible data model
that allows developers to easily modify fields and object structures as. their
applications evolve.
¢ systems treat the data as a single opaque collection which may have
Key valu
different fields for every record.
In each key value pair,
a) The key is represented by an arbitrary string
b) The value can be any kind of data like an image, file, text or document.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics 2-7 NoSQL Data Management
« In general, key-value stores have no query language. They simply Provide a way
to store, retrieve, and update data using simple GET, PUT and DELETE
commands.
The simplicity of this model makes a key-value store fast, easy to use, scalable,
portable and flexible.
Advantages of key value stores : 2 ‘i
a) The secret to its speed lies in its simplicity. The Path to retrieve data is a direct
request to the object in memory or on disk.
b) The relationship between data does not have to be calculated by a query
language, there is no optimization performed.
©) They can exist on distributed systems and do not need to worry about where to
store indexes.
Disadvantages of key value stores :
a) No complex query filters
b) All joins must be done in code
©) No foreign key constraints
4) No trigger.
Document-based
* A document is an object and keys (strings) that have values of recognizable types,
including numbers, Booleans and strings, as well as nested arrays and dictionaries.
All data is stored in one table, so there is no need for cross-referencing and
instead of storing information in a table, it is stored in a document.
*° Fig. 2.2.2 shows document based data model.
Documents Documents
Fig. 2.2.2 Document based data model
* Document databases are designed for flexibility. They are not typically forced to
have a schema and are therefore easy to modify.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Analytics 2-8
(E283 column-based
< amounts of data, document databases are a good oP!
NoSQL Data Menogn,. |
nt
o store varying attributes along with jg,
ion. Be
including XML and JSON, 7,
ypedance match. is
If an application requires the ability t
Document stores work with multiple formats in¢
allows for storage and retrieval of data without an #™
Terminologied in document data store are as follows :
a) A table is called a collection
b) A row is called a document
c) A column in called a field.
Typical use cases for document stores include the storage
blog posts, news articles and data analysis.
MongoDB and Apache CouchDB are
databases.
Do not use document databases for transactions across multiple documents
and retrieval of catalog,
examples of popular document-baseg
(records) and Ad hoc cross-document queries.
Advantages of document based model :
a) Faster retrieval of data.
b) Dynamic architecture for unstructured data and storage options
¢) Sharing for horizontal scalability
4) Replication is managed internally, so chances of accidental loss of data is
negligible.
Disadvantages of document data model :
a) No views, triggers, scripts or stored procedure.
b) Relationship not well defined.
©) No support for transactions, which could lead to data corruption. !
Column-based is also called 'wide column’ models enabling very quick data access
using a row key, column name and cell timestamp.
The flexible schema of these types of databases means that the columns do nt
have to be consistent across records and you can add a column to specific 1o¥8 |
without having to add them to every single record. |
|
It is also called a two-level map as it offers a two-level aggregate structure.
‘As data is organized into columns, we have better indexing compared to oth
key-value stores. Also, when it comes to updates, multiple column block update
can be aggregated.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeig Data Analytics 2-9
NoSQL Data Management
Column store databases were born when
of a Column store NoSQL database called
well-known Google e-mail service,
NoSQL Database.
The wide, columnar storés data model, like
derived from Google's BigTable paper.
Google open sourced its implementation
Big Table. Apparently, the data for the
Gmail, is stored in the Google Big Table
that found in Apache Cassandra, are
Organizations mostly use Column data stores for data warehousing and data
processing, which is evident in services suchas Amazon Redshift.
Advantages of column data stores :
a) Column stores are very efficient at data compression and/or partitioning.
b) Columnar stores can be loaded extremely fast,
©) Columnar databases are very scalable.
4) Due to their structure, columnar databases perform particularly well with
aggregation queries.
Disadvantages of column data store : -
a) Updates can be inefficient. The fact that columnar families group attributes, as
opposed to rows of tuples, works against it.
b) If multiple attributes are touched by a join or query, this may also lead to
column storage experiencing slower performance.
©) It is also slower when deleting rows from columnar systems, as a record needs
to be deleted from each of the record files.
EJ craph-based
The modern graph database is a data storage and processing engine that makes
the persistence and exploration of data and relationships more efficient.
Graph-based data models store data in nodes that are connected by edges. These
Aggregate Data Models in NoSQL are widely used for storing the huge volumes
of complex aggregates and multidimensional data having many interconnections
between them.
In graph theory, structures are composed of vertices and edges, or what would
later be called "data relationships".
Graphs behave similarly to how people think, in specific relationships between
discrete units of data. This database type is particularly useful for visualizing,
analyzing, or helping to find connections between different pieces of data.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Dota Analytics 2-10
: No standard language.
J joSQL Key/Value Database : MongoDB
.
NoSQL Data Me
1eG0m on |
}
ologies for recommendation en
|
ines |
ies of graph-based NoSQL dat Binoy |
As a result, businesses leverage graph tech
Abate
fraud analytics and network analysis. Examp!
include Neodj and JanusGraph.
tions, social media any
analyze customer interac
Graph databases can be used to
0 traverse Jong relationship graph |,
|
scientific applications where it is crucial ( |
better understand data. |
Advantages of graph data :
a) More descriptive queries |
b) Greater flexibility in adapting your model
¢) Greater performance when traversing data relationships. |
Disadvantages of graph data stores : i
a) Difficult to scale |
MongoDB is an open-source document database that provides high performance, |
high availability and automatic scaling. MongoDB is one of the most popular
open-source NoSQL databases written in C++. As of February 2015, MongoDB is |
the fourth most popular database management system. It was developed by a |
company 10gen which is now known as MongoDB Inc.
Why use MongoDB ? }
a) Simple queries i
b) Functionality provided applicable to most web applications
c) Easy and fast integration of data
d) No ERD diagram
¢) Not well suited for heavy and complex transactions systems. t
MongoDB did not provides any command to create a “database”. Actually, you do |
not need to create it manually, because, MangoDB will create it on the fly, during —
the first time you save the value into the defined collection (or table in SQL) and
database. |
MongoDB is a document-oriented database which stores data in JSON-like }
documents with dynamic schema. It means you can store your records without!
worrying about the data structure such as the number of fields or types of fields
to store values. MongoDB documents are similar to JSON objects. }
MongoDB stores data records 2s BSON documents, BSON is a binary |
representation of JSONdocuments, though it contains more data types than JSON.
|
,
TECHNICAL PUBLICATIONS” - an up-thrust for knewdodyoBoa Anas aq NoSQL Data Management
« MongoDB stores data in documents in-spite of tables, You can change the
structure of records simply by adding new fields or deleting existing ones. This
ability of MongoDB helps you to represent hierarchical relationships, to store
arrays and other more complex structures easily.
+ MongoDB uses Mongo server and Mongo shell commands to fetch records or the
information from the database (i.e. collections). Few areas where MongoDB is ideal
are big data, user data management, mobile and social infrastructure, content
management and delivery, data hub.
+ A MongoDB instance may have zero or more databases. A database may have
zero or more ‘collections’. A collection may have zero or more ‘documents’. A
document may have one or more ‘fields’. MongoDB ‘Indexes’ function much like
their RDBMS counterparts.
+ Database is a physical container for collections. Collection is a group of documents
and is similar to an RDBMS table, A document is a set of key-value pairs.
Documents have dynamic schema.
+ MongoDB documents are composed of field-and-value pairs and have the
following structure :
fleld1: valuel,
flold2; value2,
fleld3: valuo3,
floldN: valueNy
}
* The value of a field can be any of the BSON data types, including other
documents, arrays and arrays of documents. MongoDB supports many data types
such as : String, integer, boolean, double, arrays, timestamp, object, null, symbol,
date, code and binary data.
* Fig. 2.2.3 shows relation between SQL terms and MangoBD terms.
weve
Fig. 2.2.3 Relation between SQL terms and MangoBD terms
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics NoSQL Data Managemen
supports Ad hoc queries
* MongoDB uses MongoDB language and
ee f MongoDB that helps it 4,
replication and sharding. Sherding is a feature o|
operate as a distributed data system.
* Sharding is used by MongoDB to store data across multiple machines. Tt use.
horizontal scaling to add more machines to distribute data and operation with
Tespect to the growth of load and demand.
* Sharding arrangement in MongoDB has mainly three components :
Fig. 22.4 Sharding by MongoDB
2) Shards or replica sets : Each shard serves 2s a separate replica set. They store
all the date. They target to increase the consistency and availability of the data.
are like the managers of the clusters. These
metadatz. They actually have the mapping of the
The
b) Configuration servers
router is mongo instances which serve as interfaces
y take in the user queries from the applications and
ne required results
serve the applications v
Advantages of MangoDB :
»goDB is 2 schema - less document type database.
© MongoDB supports field, range based query, regular expression for searching the
from the stored data.
TECHNICAL PUBLICATIONS® - an upthnust for knowtedgeBig Dato Analytics
2-13 NoSQL Data Management
MongoDB is very easy to scale up or down,
It uses internal memo,
ry for storing the working temporary datasets for which it is
much faster.
MongoDB support primary and Secondary indexes on any field.
MongoDB supports replication of databases,
MongoDB can be used as a file storage system which is known as a GridES,
EE] schemaless Databases
Since NoSQL does not require a schema, there is no blueprint on how data should
be stored and therefore varies between databases, Generally, there are two ways
that NoSQL data storage functions :
1. On-the-disk using B-Trees, with the top of it being permanently in RAM.
2. Inmemory where it is all on RAM using RB-Trees and anything stored on the
disc is just an append.
Schemaless databases are a type of No:
Predefined schema or structure for data. This means that data can be inserted and
retrieved without adhering to a specific structure and the database can adapt to
changes in data over time without requiring schema migrations or changes.
Schemaless database manages information without the need for a blueprint. The
onset of building a schemaless datab
jase does not rely on conforming to certain
fields, tables, or data model structures.
SQL databases that do not have a
There is no Relational Database Management System (RDBMS) to enforce any
specific kind of structure. In other words, it is a nor-relational database thet can
handle any database type, whether that be a key-value store, document store,
in-memory, column-oriented, or graph data model.
In actuality, there is no such thing as schema-less dataset :
1. In a relational database, the schema is explicit and created separately in
advance.
2. In column-based databases, we create a fresh schema for each row and in fact,
We often reuse schema fragments from rows that are grouped together. The
same is true for document databases.
3. Im column-based and also in document databases, users directly query data
based on the schema.
4. In graph-based databases, we are in essence building the schema as we build
the data.
TECHNICAL PUBLICATIONS® - an upthrust for knowledgeig Dota Analytics et4 NoSQL Data Monogeme |
ony |
In schemaless databases, information is stored in JSON-style documents which «,
have varying sets of fields with different data types for each field. So, a collects.
I
{
could look like this : |
t
name : "Joe", ago : 30, interests : ‘football’ } a t
name : ‘Kate’, age : 25
In the above condition, the data itself normally has a fairly consistent structuy |
With the schemaless MongoDB database, there is some additional structure, tp, |
system namespace contains an explicit list of collections and indexes. Collection,
may be implicitly or explicitly created, indexes must be explicitly declared. |
Benefits of using schemaless databases :
1. Flexibility : Schemaless databases allow for greater flexibility in data modeling, |
2. Scalability : Schemaless databases are designed for scalability, ag
they can handle large amounts of unstructured data with ease.
3. Reduced complexity : Schemaless databases can reduce the complexity of data|
modeling and development. |
4. Good support for non-uniform data. |
Disadvantages :
1, Potentially inconsistent names and data types for a single value,
2. Management of the implicit schema migrates into the application layer.
Ba Materialized Views |
Materialized views solve the problem of views. The views provide a mechanism t
hide from the client whether data is derived data or base data, Views are used
when data is to be accessed infrequently and the data in a table gets updated on a_
frequent basis. 7
‘A materialized view is a replica of a target master from a single point in time. The *
master can be either a master table at a master site or a master materialized view |
at a materialized view site. A materialized view is like a cache, a copy of the dalt_
that can be accessed quickly. |
If a regular view is a saved query, a materialized view is a saved query plus it
results stored as a table. }
NoSQL databases do not have views, they may have precomputed and cached
queries and they reuse the term "materialized view" to describe them NoSql.
|
TECHNICAL PUBLICATIONS® - an up-hrust for knowiadgo |Big Data Analytics
15 NoSQL Data Management
We can use mat
ized vi i
1. Ease network lower 4 VMS ' achieve one or more of the following goals :
2. Create a mass deployment environment
3. Enable data subsetting
4. Enable disconnected ¢
* Two methods are used f
‘or building a materialized view «
1. Eager approach : user . eviews:
‘omputing,
it becomes more difficult and «
bigger server to run the database on.
ERI single server
* Database is run on a single machine which handles all the reads and writes to the
data store. Organizations prefer a single server because it eliminates all the
complexities that the other options introduce.
* Single server is eas
'y to manage for application developers. Lot of NoSQL
databases are designe
d around the idea of running on a cluster, it can make sense
to use NoSQL with a single-server distribution model if the data model of the
NoSQL store is more suited to the application.
Single-server configuration is suitable for graph-database.
If data usage is mostly about Processing aggregates, then a single-server document
or key-value store may be useful.
EEF sharding
* Sharding is a method for distributing a single dataset across multiple databases,
which can then be stored on multiple machines. This allows for larger datasets to
be split into smaller chunks and stored in multiple data nodes, increasing the total
storage capacity of the system.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Analytics 2-16
NoSQL Data Managemen
q
as horizontal scaling or scale-out,
he load. Horizontal scaling allows |
intense workloads. tf
Sharding is a form of scaling known
additional nodes are brought on to share #
near-limitless scalability to handle big data and
Sharding is also known as data partitioning. Many NoSQL databases of, |
auto-sharding. Fig. 2.5.1 shows Sharding.
-|—_ eee
Fig. 2.5.1 Sharding
Sharding is the process of splitting a large dataset into many small partitions
which are placed on different machines. Each partition is known as a “shard”,
Each shard has the same database schema as the original database. Most data is
distributed such that each row appears in exactly one shard. The combined data
from all shards is the same as the data from the original database. The load is
balanced out nicely between servers, for example, if we have five servers, each one
only has to handle 20 % of the load.
The NoSQL framework is natively designed to support automatic distribution of
the data across multiple servers including the query load. Both data and query
replacements are automatically distributed across multiple servers located in the |
different geographic regions and this facilitates rapid, automatic and transparent —
replacement of the data or query instances without any disruption.
Sharding is particularly valuable for performance because it can improve both read i
and write performance. Using replication, particularly with caching, can greatly
improve read performance but does little for applications that have a lot of writes.
Advantages of Sharding
a) Faster performance : There are more servers available to handle input/output.
b) Horizontal scaling : We can quickly add additional servers to a cluster.
©) Costs : Horizontal scaling can often be less expensive than vertical scaling.
d) Distribution/uptime : A horizontally scaled distributed database can achieve
better uptime than a traditional single server.
TECHNICAL PUBLICATIONS® - an up-thrst for knowledgeBig Data Analytics a 2-17 NoSQL Data Management
© Disadvantages of Sharding
a) Complexity : Depending on the database system, sharding complexity can vary.
b) Rebalancing : When adding additional machines to a cluster, the shards will
likely need to be rebalanced to distribute data evenly.
c) Increased infrastructure costs.
Master-slave Replication
* We replicate data across multiple nodes. One node is designed as primary
(master), others as secondary (slaves). Master is responsible for processing any
updates to that data. A replication process synchronizes the slaves with the
master.
Master is the authoritative source for the data. It is responsible for processing any
updates to that data. Masters can be appointed manually or automatically.
Slaves is a replication process that synchronizes the slaves with the master. After
a failure of the master, a slave can be appointed as new master very quickly.
Fig. 2.5.2 shows master-slave replication.
All updates aro,
Read
made to master Masse ——
Changés propogates to all slaves
Fig. 2.5.2 Master-slave replication
Master-slave replication is most helpful for scaling when we have a read-intensive
dataset. It will scale horizontally to handle more reads.
This design offers read resilience. Even if one or more of the servers fails, the
remaining servers can keep offering read access. This can help a lot with
read-heavy applications, but will offer little benefit to write-intensive applications.
As the slaves are exact replicas of the master server, one of them can assume the
role of the master in case the master fails. In fact most of the time you can simply
create a set of nodes and have them automatically decide who would be the
master. There are some consistency issues that occur due to the delay in updating
between master and slaves.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Analytics 2-18 NoSQL Data Menagemen,
automatically. In manual appointing
d we configure one node as the
* Masters can be appointed manually 0!
Juster of nodes and they clog,
performed when we configure our cluster am
master. With automatic appointment, we create a ¢
one of themselves to be the master.
* Problems of master-slave replication :
1. Does not help with scalability of writes
2. Provides resilience against failure of a slave, but not of a master
3. The master is still a bottleneck.
EBERT Peer-to-Peer Replication
Any node
© In a peerto-peer replication setup the various nodes are all “equals’.
can accept reads as well as writes and they communicate these writes to each
other, In peer-to-peer replication updates on any one server are replicated to all
other associated servers.
Fig. 2.5.3 shows peer-to-peer replications.
Requests:
Node]
Node]
Node}
Fig. 2.5.3 Peer-to-peer replications
« The advantage of this setup is its read and write resilience. One node's failure
does not cause problems, as the remaining nodes can continue their work without
losing a beat.
The problem that arises is that of consistency. For example we may have
conflicting write requests that come to different nodes and then those nodes
attempt to communicate those requests to the rest of the nodes. This could lead to
considerable inconsistencies.
There are various ways to resolve this problem. The most standard approach
would be to have the replicas communicate their writes first before they “accept,
them. Once a majority of the replicas has confirmed a write, it can now be
considered as having been successfully performed and a response sent to the
client. This requires a certain amount of network traffic in coordinating these
writes.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgelg Date Anelytics 2-19 NoSOL Data Management
There is a problem of write-write conflict. Two users can update different copies
of the same record stored on different nodes at the same time is called a
write-write conflict.
EI combining Sharding and Replication
« Sharding and replication can be combined to get a better response. If we use both
master-slave replication With Sharding and Peer-to-peer replication with Sharding.
4, Master-slave replication and Sharding :
We have multiple masters, but each data item only has a single master.
« A-node can be a master for some data and a slave for others.
2, Peer-to-peer replication and Sharding :
+ A common strategy for column-family databases.
« A good starting point for peer-to-peer replication is to have a replication factor of
3, so each shard is present on three nodes.
[EGES Difference between Replication and Sharding
Replication Sharding
‘sr. No
The primary server node copies data Sharding : Handles horizontal scaling:
onto secondary server nodes. This can across servers using a shard key.
help increase data availability and act as |
a backup, in case the primary server
=|
fails.
serve multiple s
Replication copies data across multiple Sharding distributes different data =
Each
__ places.
of data can be found in multiple
a subset of data.
Each server acts as the single source for }
de Replicated servers contain identical Sharded database servers each contain a |
copies of the entire database. part of the overall data, ie. they store |
|
different data on separate nodes:
More read
AA Gonsistony
* The CAP theorem is important when considering a distributed database, since we
must make a decision about what we are willing to give up. The database we
choose will lose either availability or consistency. Reading about NoSQL databases
we can face the concept of quorum. A quorum is the minimal number of nodes
that_must respond to a read or write operation to be considered complete.
It can improve both reads and writes.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Analytics 2-20 NOSGt Pala Managemen
Update Consistency
OF course having a maximum quorum and querying all servers is the way we ,
determine the correct result. :
Consistency can be simply defined by how the copies from the same data may vary quip,
the same replicated database system, ‘
Nowadays systems need to scale, The “traditional monolithic databas, |
architecture, based on a powerful server, does not guarantee the high availabitin
and network partition required by today’s web-scale systems, as demonstrated py
the CAP theorem. To achieve such requirements, systems cannot impose strong
consistency.
In the past, almost all architectures used in database systems were strongly
consistent. In these cases, most architectures would have a single database instang
only responding to a few hundred clients. Nowadays, many systems are accesseq
by hundreds of thousands of clients, so there was a mandatory requirement tg
system's architectures that scale. However, considering the CAP theorem, |
high-availability and consistency do conflict on distributed systems when subjeq
to a network partition event.
Two users updating the same data item at the same time is called write-write
conflict.
When the writes reach the server of the two users, the server will serialize them
and decide to apply one, then the other. First user's update would be applied and
immediately overwritten by the second user.
In this case first user's is a lost update. Here the lost update is not a big problem.
We see this as a failure of consistency because second user's update was based on
the state before first user's update, yet was applied after it.
Approaches for maintaining consistency in the case of concurrency are often
described as pessimistic or optimistic.
A pessimistic approach works by preventing conflicts from occurring; an optimistic
approach lets conflicts occur, but detects them and takes action to sort them out.
For update conflicts, the most common pessimistic approach is to have write locks,
so that in order to change a value we need to acquire a lock and the system
ensures that only one client can get a lock at a time. 7
So both users would attempt to acquire the write lock, but only the first user
would succeed, Second user would then see the result of the first user's write
before deciding whether to make his own update.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics 2-21 NoSQL Data Management
+ A common optimistic approach is a conditional update where any client that does
an update tests the value just before updating it to see if it is changed since his
last read.
+ Both the pessimistic and optimistic approaches rely on a consistent serialization of
the updates and it is possible for a single server.
« Two general solutions for write-write conflict are as follows :
1. Pessimistic approach : Preventing conflicts from occurring. Also acquire write
locks before update.
2. Optimistic approach : Lets conflicts occur, but detects them and takes actions to
resolve them.
« If there are more than one server ie. peer-to-peer replication, then two nodes
might apply the updates in a different order, resulting in a different value for the
telephone number on each peer. Sequential consistency is used in distributed
systems.
* Optimistic way to handle a write-write conflict is to save both updates and record
that they are in conflict. Replication makes it much more likely to run into
write-write conflicts. If different nodes have different copies of some data which
can be independently updated, then we will get conflicts unless we take specific
measures to avoid them. Using a single node as the target for all writes for some
data makes it much easier to maintain update consistency.
[EQ Read Consistency
* Problem : One user reads in the middle of another user's writing.
+ It is called read-write conflict, inconsistent reading. This leads to logical
inconsistency,
* In NoSQL databases, read consistency refers to the level of consistency between
multiple read operations on the same data. In a distributed database, where data
can be replicated across multiple nodes, ensuring read consistency can be
challenging.
* Aggregate-oriented databases do support atomic updates, but only within a single
aggregate. This means that we will have logical consistency within an aggregate
but not between aggregates.
* The length of time an inconsistency is present is called the inconsistency window.
A NoSQL system may have a quite short inconsistency window.
* There are different levels of read consistency available in NoSQL databases,
ranging from eventual consistency to strong consistency,
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Analytics 2-22 NeS2L Data Manag,
Eventual consistency allows for a cersin degree of Ineosistency t0 ocqy
different replicas of data, In this model, the database guarantees that ay)
“i . but it makes no guarantees aboys 4 Peay
Will eventually propagate to all nodes, but ill be applied, ho
this will take or about the order in which updates will De applted.
Read-your-writes consistency means that once we have updated a Tecord,
Yall
our subsequent reads of that record will return the updated value, Nyy
* Session consistency means read-yourswrites consistency but at session ley
Session can be identified with a conversation between a client and a sere,
long as the conversation continues, we will read everything we haye Wri
during this conversation. If the session ends and we start another session, with _
same server, there is no guarantee that we can read values we haye Mite
during previous conversation.
* Session consistency is of two types : Sticky session and version stamps,
Quorums
* Quorum consistency is used in systems where consistency is more important thy
availability (CAP theorem) for write and read.
+ In systems with multiple replicas there is a possibility that the user rea
inconsistent data. This happens say when there are 2 replicas, N1 and N2 in
cluster and 2 user writes value v1 to node NI and then another user reads fron!
node N2 which is still behind Ni and thus will not have the value v1, so th
second user will not get the consistent state of data.
* In order to achieve a state where at least one node has consistent data we wi
quorum consistency.
© Fig. 2.6.1 shows write and read quorums
Write (key, A) Write (key, of Bh (key, A) ‘Write (key, B),.
(a) Write quorums
Fig. 2.6.1
TECHNICAL PUBLICATIONS®
~ an up-thrust for knowledge2-23 NoSQL Data Management
pig Dota Analytics
Quorum is achieved when nodes follow the below protocol: w+r>n
where n= Nodes in the quorum group,
w = Minimum write nodes,
1 = Minimum read nodes
Here w is our write quorum and r is our read quorum.
oo Relaxing Durability
1a write is committed, the change is permanent.
« Whe
strict durability is not essential and it
In some cases,
(write performance).
relax durability is to store data in memory and flush to disk
we lose updates in memory.
can be traded for scalability
+ A simple way to
regularly. f the system shuts down,
ERA Cejsancra
base. It was initially developed by Facebook to
Cassandra is a column NoSQL datat
fulfill the needs of the company's Inbox Search services. In 2009, it became an
‘Apache Project.
‘Apache Cassandra is an open source, distributed, NoSQL database. Apache
Cassandra is a distributed database system using a shared nothing architecture.
Apache Cassandra was initially designed at Facebook using a Staged Event-Driven
‘Architecture (SEDA) to implement a combination of Amazon's Dynamo distributed
storage and replication techniques and Google's Bigtable data and storage engine
model.
+ A columnar database, also called a column-oriented database or a wide-column
store, is a database that stores the values of each column together, rather than
storing the values of each row together.
* Columnar databases are well suited for big data processing, Business Intelligence
(BD and analytics.
* Cassandra provides tunable consistency i.e. users can determine the consistency
level by tuning it via read and write operations. Cassandra enables users to
configure the number of replicas in a cluster that must acknowledge a read or
write operation before considering the operation successful.
* Cassandra uses a gossip protocol to discover node state for all nodes in a cluster.
Cassandra is designed to handle “big data" workloads by distributing data, reads
and writes (eventually) across multiple nodes with no single point of failure.
TECHNICAL PUBLICATIONS? an up-thrust for knowledgeBig Data Anaiytios 2.
NoSQL. Data Managomont
it allows to add more
alata as per requirement,
Features of Cassandra :
1) Blastic scalability: Cassandra is highly. seal)
le;
hardware te accommodate more customers andl More
point of failure.
2) Always on architecture : Cassandra has no single
is Hnearly scalable, ie, it increases
3) Fast linearseale performance : Cassandra
in the cluster.
throughput as we increase the number of node
4 Mlenible data storage : Cassandra accommodtates all poss
incleding : Structured, semi-structured and unstructured
5) Transaction support : Cassandra supports properties Tike ACID.
canta provides the flexibility to distribute data
datacenters.
ible data formats
6) Easy data distribution : C
where you need by replicating data across multiple
Cassandra Architecture
© Components of C
shows Cassandra architecture.
Fig, 2.7.1 Cassandra architecture
andra architecture are node, data center, cluster, commit log,
Table, Bloom Filters and Cassandra query Janguage.
memtable,
Node ; A Cassandra node is a place where data is stored.
Data center ; Data center is a collection of related nodes.
Cluster ; A cluster is a component which contains one or more data centers.
Commit log : In Cassandra, the commit log is a crash-recovery mechanism. Every
write operation is written to the commit log.
Mem+-table : A menvtable is a memory-resident data structure. After commit log,
the data will be written to the mem-table, Sometimes, for a single-column family,
there will be multiple memtables.
TECHNICAL PUBLICATIONS® - an up-thrust for knowtedge5
© SSToble : tt ig 4 disk file
te
contents reach a thr shold vale ea nian
Bloom fi
ment is
Bloo :
° Bloom filter ; Hers are very fast, nondete in
whether an ole 4 member of a 4, ae
filters are Accessed after every query, et. Tt is a Special
algorithms for
testing
kind of cache,
Bloom
* Some of the features o}
1) Data in Cassandta i
f Cassandra data model ai
is stored as a
2) Tables are also called column fa
re as follows ;
Set of rows that are organized into tables,
milies.
3) Bach row is identified by a primary key value,
*) Data is partitioned by the primary key,
Data modelling in Cassandra uses a Query-driven approach, in which specific
Gueries are the key to organizing the data, The mein goal of Cassandra data
modeling is to develop and design a hightperformance ang well-organized Cluster,
Apache Cassandra data model components include keyspaces, tables and
columns :
a) Cassandra stores data as a set of rows organized into tables or column families
b) A primary key value identifies each row
©) The primary key partitions data
4) We can fetch data in part or in its entirety based on the primary key.
Cassandra data model provides a mechanism for data storage, The
components of
Cassandra data model are keyspaces, tables and columns,
TECHNICAL PUBLICATIONS® - an upthrust for knowledgeNoSQL Data Man
Big Data Analytics 2-26 =aemony
© Fig, 2.7.2 shows Cassandra data model
Cluster |
KeySpacet
Column familyt ‘Column femily4 L Column family2
Row Row Row Ro, |
KeySpace2 |
coun ICotumn2}| Cotumnt}|Cotumn2 cone coun Cotumng [cote cota
Value |) Value |) Value |
Fig. 2.7.2 Cassandra data model
Value
Value |} Value |) Value vaiue | Value
4. Keyspaces :
* Ata high level, the Cassandra NoSQL data model consists of data containers
called keyspaces. Keyspaces are similar’ to the schema in a relational database,
‘Typically, there are many tables in a keyspace. i
© Features of keyspaces are :
a) A keyspace needs to be defined before creating tables, as there is no default
keyspace. a
b) A keyspace can contain any number of tables and a table bélongs only td one
keyspace. This represents a one-to-many relationship.
©) Replication is specified at the keyspace level. For example, replication of three
implies that each data row in the keyspace will have three copies. f
2. Tables : j
* Tables, also called column families in eatlier iterations of Cassandra, aré (defined
within the keyspaces. Tables store data in a set of rows and contain a primary key
and a set of columns.
* Cassandra tables are used to hold the actual data in the form of rows and
columns. A table in Cassandra must be created with the primary key during table
creation time, post that it can not be altered.
© To alter the table new tables should be created with existing data, The primary
key would be used to locate and order the data,
TECHNICAL PUBLICATIONS® - an up.tnrust for knowledgeBig Data Analytics 2-27 NoSQL Data Management
* Some of the features of tables are :
a) Tables have multiple rows and col
P lumns. As mentioned earlier, a table is also
called column family in the earlier versions of Cassandra.
b) It is still referred to as column family in some of the error messages and
documents of Cassandra.
¢) It is important to define a primary key for a table.
3. Columns :
+ Columns define data structure within
a table. There are various types of columns,
such as Boolean, double,
integer and text.
* Cassandra column is used to store a sin;
gle piece of data. The column can consist
of various types of data such as big inte;
ger, double, text, float and Boolean.
Each column value has a timestamp associated with it that shows the time of
update. Cassandra provides the collection type of columns such as list, set and
map.
© Some of its features'are :
a) Columns consist of various types,
such as integer, big integer, text, float, double
and Boolean.
b) Cassandra also provides collection types such as set, list and map.
©) Further,
column values have an associated time stamp representing the time of
update.
d) This timestamp can be retrieved using the function write time.
Cassandra Clients
* Thrift is the driver-level interface; it provides the API for client implementations in
a wide variety of languages. Thrift was developed at Facebook.
A Client holds connections to a Cassandra cluster, allowing it to be queried. Each
Client instance maintains multiple connections to the cluster nodes, provides
Policies to choose which node to use for each query and handles retries for failed
query etc...
Client instances are designed to be long-lived and usually a single instance is
enough per application, As a given Client can only be “logged” into one keyspace
at a time, it can make sense to create one client per keyspace used. This is
however not necessary to query multiple keyspaces since it is always possible to
use a single session with fully qualified table name in queries.
The Cassandra cluster is denoted as a ring. The idea behind this representation is
to show token distribution.
TECHNICAL PUBLICATIONS® ~ an up-thrust for knowledgeaie NoSOL Date Managem, |
iace to write the date to. The task oy
‘sczs) that are responsible to hold g.
strategy to do that.
ctlizes 2 replication
axe idected, # sends the RowMutetion message to §
- hace nodes, but # does not weit for all the reptie,
the node waits
west individually. Every node Gist writes the
cad Gen write: the mutztion to the memiable
. Sls is Gokhed to an immmnteble structure celled as SSTable (Sorted
Sting Table). The commit log is used for playback purposes in case data
Som the memizble is lost due to node feilure.
EZ] Two Marks Questions with Answers
az What is consistency In a distributed system ?
Ans. stem, consistency will be defined as one that responds with
| the seme output for the same request at the same time across all the replicas.
z Hbuting 2 single dataset across multiple databases,
which can then be stored on multiple machines. This allows for larger datasets to be
s and stored in multiple deta nodes, increasing the total storage
TECHNICAL PUBLICATIONS” - an up-thrust for bmoutedgeQ6 What are write-write and read-erite conficts 7
Ans. : Write-write confiicts ocr when tro diews =
middle of another dien!’s write,
Q7 Define Cassandra.
Cessendra is a peerto-peer distributed system made up
any nods can accept 2 read or write request.
Qs What is the use of Bloom filters in Cassandra 7
Ans. : Bloom filters are used as 2 performance booster: Bloom flies ae vey Sst |
nondeterministic algorithms for testing whether an clement is 2 member of 2 set They
are nondeterministic because it is possible to get 2 telsepositive read fom 2 Bloom
filter, but not 2 false-negative. Bloom filtess work by mapping the values ina doe
into a bit array and condensing a larger data set into 2 digest sting Bloom filter is 2 |
special kind of cache.
Q3 Explain sorted strings table.
Ans. : Sorted strings table which is 2 fle format used by Cassandra to store the statics
and the data from the memtzbles. The Cassendre $STables are ixmutzble hence any
update on the table creates 2 new SSTables fle The dete structie focmat wed by
SSTables is Log-Structured Mezge which is qualified for writing intense heavy dete sets
compared to the traditional B tree structure.
TECHNICAL PUBLICATIONS? - en ustteust for Imontedgeig Dots Analytics oo NoSOL Data Managemen,
Q.10 Explain Cassandra data center.
Ans. : Cassandra data center is the collection of related nodes which are configured in
| the cluster to perform the replication. The data centers can be physical data centers o,
| logical data center and depending upon the workload a separate data center can be
| used,
|
| Qi1 Explain advantages and disadvantages of graph data.
| Ans. : © Advantages of graph data:
| 2) More descriptive queries
| b) Greater flexibility in adapting your model
<) Greater performance when traversing data relationships.
| © Disadvantages of graph data stores :
| 2) Difficult to scale,
j b) No standard language.
| Q32 Describe session consistency.
| Ans. : Session consistency means read-your-writes consistency but at session level.
| Session can be identified with 2 conversation between a client and a server. As long as
the conversation continues, we will read everything we have written during this
conversation. If the session ends and we start another session with the same server,
there is no guarantee that we can read values we have written during previous
conversation.
Qi3 What are schemaless databases 7
Ans. : Schemaless databases are a type of NoSQL databases that do not have a
predefined schema or structure for data. This means that data can be inserted and
ved without adhering to a specific structure and the database can adapt to
changes in date over time without requiring schema migrations or changes.
qaa
+ an up-thrust for knowledge