BIG DATA ANALYTICS     niketamoda@gmail.
com
      UNIT 3: NO SQL
What is NoSQL?
Not Only SQL                               Schema-less Design
A versatile family of database             Dynamic data structures that
technologies engineered to address         eliminate the need for predefined
specialized data challenges that           schemas, allowing developers to
traditional relational database            iterate quickly and adapt seamlessly
models cannot efficiently handle.          to evolving business requirements.
Horizontal Scalability
Architected to distribute workloads across multiple commodity servers,
enabling cost-effective scaling for high-volume applications without requiring
expensive hardware upgrades.
Evolution of Data Storage
     1970s-1990s
     Relational databases revolutionized enterprise computing with structured
     SQL queries, ACID transactions, and normalized data models that became
     industry standard.
     2000s
     The explosive growth of web applications strained traditional RDBMS
     capabilities, revealing scalability challenges as data volumes grew
     exponentially and global distribution became essential.
     2010s-Present
     NoSQL databases transformed the landscape with specialized solutions for
     diverse workloads, offering horizontal scalability, flexible schemas, and
     distributed architectures optimized for cloud environments.
Making Sense of NoSQL_ A guide for managers and the rest of us
 NOT ONLY SQL
DBMS has              the      following
advantages
• ACID properties
• Designed for all purpose
• Strong             consistency,
  concurrency, recovery
• Standard Query language
  (SQL)
ACID PROPERTIES
1) Atomicity:
It means if any operation is performed on the data, either it should be performed or
executed completely or should not be executed at all.
It further means that the operation should not break in between or execute partially.
In the case of executing operations on the transaction, the operation should be
completely executed and not partially.
ACID PROPERTIES
2) Consistency:
The word consistency means that the value should remain preserved always.
In DBMS, the integrity of the data should be maintained, which means if a change in
the database is made, it should remain preserved always.
In the case of transactions, the integrity of the data is very essential so that the
database remains consistent before and after the transaction.
The data should always be correct.
ACID PROPERTIES
3) Isolation: :
The term 'isolation' means separation.
In DBMS, Isolation is the property of a database where no data should affect the
other one and may occur concurrently.
In short, the operation on one database should begin when the operation on the first
database gets complete.
It means if two operations are being performed on two different databases, they may
not affect the value of one another.
In the case of transactions, when two or more transactions occur simultaneously, the
consistency should remain maintained.
ACID PROPERTIES
4) Durability:
Durability ensures the permanency of something.
In DBMS, the term durability ensures that the data after the successful execution of
the operation becomes permanent in the database.
The durability of the data should be so perfect that even if the system fails or leads to
a crash, the database still survives.
However, if gets lost, it becomes the responsibility of the recovery manager for
ensuring the durability of the database. For committing the values, the COMMIT
command must be used every time we make changes.
Making Sense of NoSQL_ A guide for managers and the rest of us
 NOT ONLY SQL
Need
In real time data requirements are
changed a lot.
Data is readily available with Facebook,
Google, Twitter etc.
This data include the user information,
social graphs, geographic locations etc.
To provide the quality services to the
user we must be able to use the relevent
technology which can operate on this
data
NoSQL Key Characteristics
                  Horizontal Scalability
                  Seamlessly expand capacity by adding more servers to the cluster
                             Distributed Architecture
            2                Intelligently partitions data across multiple nodes for redundancy and
                             performance
                                         Schema Flexibility
                                         Dynamically evolve data models without downtime or migrations
NoSQL Business Drivers Overview
     Volume
     Scale effortlessly to accommodate petabytes of structured and unstructured data
     Velocity
     Ingest and analyze real-time data streams with millisecond response times
     Variability
     Flexibly store and process heterogeneous data types without fixed schemas
     Agility
     Quickly adapt database structures to evolving business requirements and market
     demands
Business Driver: Volume
 2.5EB                      40ZB                         1B+
Daily Data Creation         Projected Data              Active Users
                                Volume              Generating content across
Generated globally each   Global data sphere will    major social platforms
 day, equivalent to 2.5   reach 40 zettabytes by          every minute
   quintillion bytes               2025
Business Driver: Velocity
          Real-time Processing
          Immediate analysis and action on streaming data without batching delays
                                 Low Latency
                                 Sub-millisecond response times enabling critical business decisions
                                                         Continuous Ingestion
                                                         Seamlessly handling billions of events per second across distributed
                                                         systems
Business Driver: Variability
                         Structured
                         Rigidly organized JSON and XML documents with well-defined fields, consistent schemas, and predictable nested hierarchies.
                         Semi-structured
                         Flexible data formats with dynamic schemas, optional fields, and extensible attribute sets that adapt to changing business requirements.
                         Unstructured
                         Raw content like text documents, media files, and IoT sensor streams that lack inherent organization but contain valuable insights when properly analyzed.
Business Driver: Agility
 Rapid Development
 Schema-less architecture eliminates migration bottlenecks, accelerating development cycles.
 Development teams can implement new features with greater autonomy and reduced database
 dependencies.
    A/B Testing
    Seamlessly maintain multiple concurrent data models without structural conflicts.
    Evaluate performance metrics across different implementation strategies with minimal overhead.
        MVP Launches
        Accelerate time-to-market with flexible, evolution-ready data structures.
        Continuously refine products based on customer feedback without database architecture
        constraints.
Business Scalability with NoSQL
   Cloud-Native Architecture
   Engineered specifically for cloud environments with auto-scaling capabilities that
   respond to traffic demands in real-time.
   Horizontal Scaling
   Seamlessly distribute workloads across additional servers to achieve linear
   performance growth without application downtime.
   Cost-Effective Operations
   Optimize expenditure with precise resource allocation that automatically scales
   up during peak periods and down during low-demand intervals.
Making Sense of NoSQL_ A guide for managers and the rest of us
 NOT ONLY SQL
Advantage of NoSQL
•   Good resource scalable
•   Lower operation cost
•   Supports semi-structure data
•   No static schema
•   Supports distributed computing
•   Faster data processing
•   Relatively simple data model
Disadvantage of NoSQL
• Not a defined standard
• Limited query capabilities
Making Sense of NoSQL_ A guide for managers and the rest of us
 CAP / BREWER’S THEOREM
• It is very important to understand the limitations of
  NoSQL database.
• NoSQL can not provide consistency and high
  availability together.
• This was first expressed by Eric Brewer in CAP
  Theorem.
• CAP theorem states that we can only achieve at most
  two out of three guarantees for a database:
  Consistency, Availability and Partition Tolerance.
Making Sense of NoSQL_ A guide for managers and the rest of us
 CAP / BREWER’S THEOREM
• Consistency means that all nodes in the network see
  the same data at the same data.
• Availability is a guarantee that every request receives
  a response about whether it was successful or failed.
• Partition Tolerance is a guarantee that the system
  continues to operate despite arbitrary message loss or
  failure of part of the system.
• In other words, even if there is a network outage in
  the data center and some of the computers are
  unreachable, still the system continues to perform.
Making Sense of NoSQL_ A guide for managers and the rest of us
 CAP / BREWER’S THEOREM
• Out of these three guarantees, no system can provide
  more than 2 guarantees.
• Since in the case of a distributed systems, the
  partitioning of the network is must, the tradeoff is
  always between consistency and availability.
• RDBMS can provide only consistency but not
  partition tolerance.
• MongoDB, HBASE and Redis                              can      provide
  Consistency and Partition tolerance.
• CouchDB, Cassandra and Dynamo guarantee only
  availability but no consistency.
Making Sense of NoSQL_ A guide for managers and the rest of us
 CAP / BREWER’S THEOREM
• Let us take a look at various scenarios or architectures of systems
  to better understand the CAP theorem.
• The first one is RDBMs where Reading and writing of data
  happens on the same machine.
• Such systems are consistent but not partition tolerant because if
  this machine goes down, there is no backup.
• If one user is modifying the record, others would have to wait thus
  compromising the high availability.
Making Sense of NoSQL_ A guide for managers and the rest of us
 CAP / BREWER’S THEOREM
• The second diagram is of a system which has two machines.
• Only one machine can accept modifications while the reads
  can be done from all machines.
• In such systems, the modifications flow from that one machine
  to the rest.
• Such systems are highly available as there are multiple
  machines to serve.
• Such systems are partition tolerant because if one machine
  goes down, there are other machines available to take up that
  responsibility.
Making Sense of NoSQL_ A guide for managers and the rest of us
 CAP / BREWER’S THEOREM
• Since it takes time for the data to reach other machines from
  the node A, the other machine would be serving older data.
• This causes inconsistency. Though the data is eventually going
  to reach all machine and after a while, things are going to
  okay.
• There we call such systems eventually consistent instead of
  strongly consistent.
• This kind of architecture is found in Zookeeper and
  MongoDB.
Making Sense of NoSQL_ A guide for managers and the rest of us
 CAP / BREWER’S THEOREM
• In the third design of any storage system, we have one machine
  similar to our first diagram along with its backup.
• Every new change or modification at A in the diagram is
  propagated to the backup machine B.
• There is only one machine which is interacting with the readers
  and writers.
• So, It is consistent but not highly available. If A goes down, B
  can take A's place. Therefore this system is partition tolerant.
• Examples of such system we are HDFS having secondary
  Namenode and even relational databases having a regular
  backup.
Making Sense of NoSQL_ A guide for managers and the rest of us
 COMPARISON BETWEEN SQL AND
 NOSQL
NoSQL is a non-relational DBMS
It does not require a fixed schema, avoids joins, and
is easy to scale.
NoSQL database is used for distributed data stores
with huge data storage needs.
NoSQL is used for Big data and real-time web apps.
For example, companies like Twitter, Facebook,
Google that collect terabytes of user data every
single day.
Making Sense of NoSQL_ A guide for managers and the rest of us
 COMPARISON BETWEEN SQL AND
 NOSQL
Traditional RDBMS uses SQL syntax to store and
retrieve data for further insights.
NoSQL database system can store structured,
semi-structured, unstructured data.
NoSQL databases became popular with Internet
giants like Google, Facebook, Amazon, etc. who deal
with huge volumes of data.
The system response time becomes slow when you
use RDBMS for massive volumes of data.
Making Sense of NoSQL_ A guide for managers and the rest of us
 COMPARISON BETWEEN SQL AND
 NOSQL
To resolve this problem, we could "scale up" our
systems by upgrading our existing hardware. This
process is expensive.
The alternative for this issue is to distribute
database load on multiple hosts whenever the load
increases. This method is known as "scaling out.“
 Making Sense of NoSQL_ A guide for managers and the rest of us
  COMPARISON BETWEEN SQL AND                                                               May 2015
  NOSQL
SQL                                                    NoSQL
Full form Structured query language                    Full form Not only SQL
Relational database                                    Non relational database
SQL is declarative query language                      Non declarative query language
SQL database works on ACID properties                  NoSQL follow Cap theorem
Atomicity                                              Consistency
Consistency                                            Availability
Isolation                                              Partition Tolerace
Durability
Structured and organized data                          Unstructured and replicable data
Relational database tables are used                    Key-value pair storage, Column Store, Document Store,
                                                       Graph Database
Tightly consistent                                     Eventually consistent
MySQL, Oracle, MS SQL, PostgreSQL, SQLite, DB2         Mongo DB, Big Table, Neo4j, Couch DB, Cassandra,
                                                       HBase
Making Sense of NoSQL_ A guide for managers and the rest of us
 NOSQL DATABASES TIME LINE
• 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source
  relational database
• 2000- Graph database Neo4j is launched
• 2004- Google BigTable is launched
• 2005- CouchDB is launched
• 2007- The research paper on Amazon Dynamo is released
• 2008- Facebooks open sources the Cassandra project
• 2009- The term NoSQL was reintroduced
Making Sense of NoSQL_ A guide for managers and the rest of us
 FEATURES OF NOSQL
Non Relational
• NoSQL databases never follow the
  relational model
• Never provide tables with flat fixed-column
  records
• Work with self-contained aggregates or
  BLOBs (Binary large object)
• Doesn't require object-relational mapping
  and data normalization
• No complex features like query languages,
  query planners, referential integrity joins,
  ACID
Making Sense of NoSQL_ A guide for managers and the rest of us
 FEATURES OF NOSQL
Schema-Free
• NoSQL databases are either schema-free or
  have relaxed schemas
• Do not require any sort of definition of the
  schema of the data
• Offers heterogeneous structures of data in
  the same domain
Making Sense of NoSQL_ A guide for managers and the rest of us
 FEATURES OF NOSQL
There are mainly four categories of NoSQL
databases.
Each of these categories has its unique
attributes and limitations.
No specific database is better to solve all
problems.
You should select a database based on your
product needs.
DIFFERENT
ARCHITECTURAL
PATTERS            ON
1. Key Value Pair Based
                        NOSQL
This is very simple NoSQL database
Data is stored in key/value pairs.
It is designed to store as schema free data.
Such data is stored along with indexed key.
Example: Cassandra,       DynamoDB,       Azure
Table Storage
                                                  IT DEC 2016
 DIFFERENT
Making Sense of NoSQL_ A guide for managers and the rest of us
 ARCHITECTURAL
 PATTERS
Use Case: ON NOSQL
This type is generally used when you need
quick performance for basic Create – Read –
Update – Delate operations.
Example:
Storing and retrieving session information for
Web pages
Storing user profile and preferences
Storing shopping cart data for ecommerce
 DIFFERENT
 ARCHITECTURAL
 PATTERS ON NOSQL
Limitations:
It may not work for complex queries
attempting to connect multiple relations
If data contain many to many relationship a
key value pair is likely to show poor
performance
 DIFFERENT
 ARCHITECTURAL
 PATTERS
Column-based       ON NOSQL
             Database
Column-oriented databases work on columns and
are based on BigTable paper by Google.
They deliver high performance on aggregation
queries like SUM, COUNT, AVG, MIN etc. as the
data is readily available in a column.
Instead of storing data in relational tuples (rows),
it is stored in cells grouped in columns
Making Sense of NoSQL_ A guide for managers and the rest of us
 DIFFERENT ARCHITECTURAL
 PATTERS ON NOSQL
Example:
HBase, Hyper Table, Big Table.
Use Cases:
It is used for storing blogs
Making Sense of NoSQL_ A guide for managers and the rest of us
 DIFFERENT ARCHITECTURAL
 PATTERS
Document-based       ON NOSQL
               database:
Document-Oriented NoSQL DB works on key value storage where document
contain lot of complex data.
The document is stored in JSON or XML formats. The value is understood by the
DB and can be queried.
JSON (JavaScript Object Notation) is a lightweight data-interchange format.
Every document contains a unique key to retrieve the document
Key is used for storing, retrieving and managing document – oriented information
 Making Sense of NoSQL_ A guide for managers and the rest of us
  DIFFERENT ARCHITECTURAL
  PATTERS ON NOSQL
Eg: Amazon SimpleDB, CouchDB, MongoDB, Lotus Notes are popular Document
originated DBMS systems.
Use Cases:
Used for storing Event logging information and online blogging
All document would contain information about type of document, userid, post
content, timestamp etc.
Limitation:
It may be good for blogging but not good for aggregation.
 DIFFERENT
 ARCHITECTURAL
 PATTERS ON NOSQL
Graph-Based
A graph type database stores entities as well the
relations amongst those entities.
The entity is stored as a node with the relationship as
edges.
An edge gives a relationship between nodes. Every
node and edge has a unique identifier.
Compared to a relational database where tables are
loosely connected, a Graph database is a
multi-relational in nature.
 DIFFERENT
 ARCHITECTURAL
 PATTERS
Example:
         ON NOSQL
Neo4J, Infinite Graph, OrientDB, FlockDB are some
popular graph-based databases.
Use cases:
Very important application is social networking site,
it can quickly locate friend, friends of friends
Google maps useful for navigarion and finding the
closest location
Making Sense of NoSQL_ A guide for managers and the rest of us
 DIFFERENT ARCHITECTURAL
 PATTERS ON NOSQL
              Database Model             Performanc    Scalability Flexibility
                                         e
              Key value store database   High          High        High
              Column store database      High          High        Moderate
              Document store database    High          Variable    High
              Graph database             Variable      Variable    High
 Making Sense of NoSQL_ A guide for managers and the rest of us
     ADVANTAGES OF NOSQL
1.  Growth of Big Data
   1. Big data is one of the main driving factor of
       NoSQL for business
   2. Web data has increased exponentially within last
       two years
2. Continuous Availability of data
   1. Hardware failure are possible but NoSQL is built
       on distributed architecture which is robust.
   2. If data node goes down we have replication factor,
       if name node goes down we have secondary name
       node
3. Location Independent
   1. It is ability to read and write the database from
       anywhere
 Making Sense of NoSQL_ A guide for managers and the rest of us
     ADVANTAGES OF NOSQL
1.  Flexible data models
   1. NoSQL has more flexible data models as
       compared to others which is schema less
2. Better Architecture
   1. NoSQL has more business oriented architecture
       for a particular application
   2. Organizations migrate their data to NoSQL
       platform which allows them to maintain very
       volume of data
3. Analytics and Business Intelligence
   1. Extracting meaningful information from vey high
       volume of data is very difficult task for RDBMS
   2. Modern NoSQL provides integrated data analysis
       and better understanding of complex data sets
       which facilitate flexible decision-making.
IMPLEMENTATION OF KEY
VALUE DATABASE
This is very simple NoSQL
It is designed for storing data as schema free
In this data is stored in the form of indexed key
Key: 1 ID: 123   First Name: Ganesh
Key: 2     Email: abc@gmail.com       Location: Mumbai    Pin: 401209
Key: 3     Facebook ID: xyz           Password: *******   Name: Tom
Working
The schema less format of a key value database is required for data storage needs.
The key can be auto generated while the value can be string
IMPLEMENTATION OF KEY
VALUE DATABASE
Key value uses a hash table in which there exists a
Unique key and pointer to each data item
The logical group of keys is known as bucket
It will improve the performance because of cache mechanism
Read Write values
• Get(key): It will return the value associated with the key
• Multi-get(key1, key2, …, keyN): It will return the list of values associated with
  the
  key
• Put(key, value): It will associate the value with the key
• Delete(key): it will delete entry for the key from the data store
Alex Homes, Hadoop in Practice 2nd Edition Page 28
 COLUMN STORE DATA
Instead of storing the data in in rows, it stored in cells
grouped in columns
It offers high performance and high scalability
Working:
• In column-oriented NoSQL database, data is stored
  in cells grouped in columns rather than rows
• Read and write is done using columns
• It offers fast search and access of data &
  Aggregation.
Alex Homes, Hadoop in Practice 2nd Edition Page 28
 COLUMN STORE DATA
Data Model:
• Column Family: Single structure that can group
  Columns
• Example:
   • Hbase, BigTable, Hyper Table
Alex Homes, Hadoop in Practice 2nd Edition Page 28
 COLUMN STORE DATA
Document Based:
• It is based on the concept of key value store where
  “documents” contain a lot of complex data.
• Every document contain a unique key used to
  retrieve a document
• Key is used for managing, storing and retrieving
  document oriented information
Working:
This type of data is collection of key-value pair where
value is a compressed document
JSON and XML are commonly used documents
Eg: MongoDB , CoutchDB
Alex Homes, Hadoop in Practice 2nd Edition Page 28
 COLUMN STORE DATA
Graph Database:
• Data is stored I graph and their
  relationship are stored as a link where as
  entity act as a node.
Working:
In this a flexible graphical notation is used
with edges and nodes
Data can be easily transformed from one
model to another model by using graph
based NoSQL database
Eg: Neo4j, Polyglot
SQL CASE STUDIES
AmazonDB
It has the largest ecommerce operations in the world
Customers across the globe shop 24*7
Initially Amazon used RDBMS systems for shopping and checkout system
Amazon DynamoDB a NoSQL brought a turning point
DynamoDB addresses the core problem of RDBMS scalability and partition
tolerance
Developers can store unlimited amount of data by creating a database table
DynamoDB saves the table in multiple servers
DynamoDB is a Key-Value store NoSQL
Salient features of key-value store are as follows:
Scalable: If the application requirement changes, AWS management console can
SQL CASE STUDIES
Automated storage scaling: More storage can be obtained when ever more storage
is required.
Built-in fault tolerance: DynamoDB automatically replicates data across various
nodes
Flexible: DynamoDB has a schema free format. Multiple data types can be used
Efficient Indexing: Every item is defined by a primary key. It allows secondary
indexes on non key attributes.
Strong Consistency: DynamoDb ensures strong consistency on reads (reads only
the latest value)
Secure: DynamoDB used cryptography to authenticate users
SQL CASE STUDIES
Google Big Table:
Motivation for developing BigTable is to achieve massive scalability, better
performance and ability to run commodity hardware.
The volume of Google data is generally in petabytes and is distributed over
1,00,000 nodes
Big table is column based NoSQL.
SQL CASE STUDIES
MongoDB
MongoDB was designed by Eliot Horowitz
Mongo DB was designed for building large scale, high availability, robust systems
MongoDB changed the transformed the relational data to document based data to
manage speed, agility, schema less databases
MongoDB is a document data model that stores data in JSON document
 Making Sense of NoSQL_ A guide for managers and the rest of us
   DATATYPES USED IN MONGO DB
Data Types Description
String         String is the most commonly used datatype. It is used to store data. A string must be UTF 8 valid in
               mongodb.
Integer        Integer is used to store the numeric value. It can be 32 bit or 64 bit depending on the server you are
               using.
Boolean        This datatype is used to store boolean values. It just shows YES/NO values.
Double         Double datatype stores floating point values.
Min/Max Keys This datatype compare a value against the lowest and highest bson elements.
Arrays         This datatype is used to store a list or multiple values into a single key.
Object         Object datatype is used for embedded documents.
Null           It is used to store null values.
Symbol         It is generally used for languages that use a specific type.
Date           This datatype stores the current date or time in unix time format. It makes you possible to specify your
               own date time by creating object of date and pass the value of date, month, year into it.
SQL CASE STUDIES
Neo4j
Neo4j is open source sponsored by Neo Technologies
It is graph based NoSQL which is implemented in Java and Scala
Its development was started in 2003 and was made public in 2007
Neo4j is used by many organizations for scientific research, routing, matchmaking,
network management, recommendations, social networks, software analytics and
project management
SHARED NOTHING”
ARCHITECTURE
A Shared Nothing Architecture is one
in which you have a number of nodes.
These nodes do not share resources like
memory or storage with any one.
On the other hand One Alternative
Architecture shares every resource
when requested.
SHARED NOTHING”
ARCHITECTURE
Advantages      of     shared     nothing
architecture:
• easier scaling
• non-disruptive upgrades
• elimination of a single point of
  failure self-healing capabilities.
SHARED NOTHING”
ARCHITECTURE
Scaling becomes simpler when things
such as disks are not shared.
For example, scaling up a single shared
disk to get more storage space can lead
to enormous problems if things do not
go well.
On the other hand, if you are using
several nodes that do not share the
space, scaling up the disk space
becomes quite a bit easier.
SHARED NOTHING”
ARCHITECTURE
If the scaling should fail on one of the
resources, the others will still continue
to do their work normally.
“This architecture is followed by
essentially  all    high-performance,
scalable, DBMSs, including Teradata,
Netezza, Greenplum, as well as several
Morpheus integrations.
It is also used by most of the high-end
e-commerce       platforms,   including
Amazon, Akamai, Yahoo, Google, and
Facebook.”
SHARED NOTHING”
ARCHITECTURE
Enables Non-disruptive Upgrades
Similar to the scaling advantages, you
can use shared nothing architecture to
perform non-disruptive upgrades to
your services.
Instead of having a certain amount of
downtime while you are upgrading an
infrastructure with shared resources,
you can upgrade a node at a time.
The redundancy in the other nodes will
continue to run so that you do not need
to shut everything down for the
amount of time it takes to perform the
SHARED NOTHING”
ARCHITECTURE
Eliminates Single Point of Failure
With shared systems, a single point of
failure can take down your site or app
entirely.
As noted, the ability to have separate
systems on separate nodes with
redundancy can make things much
easier while avoiding the disaster of a
single failure causing unexpected
downtime.
SHARED NOTHING”
ARCHITECTURE
Avoids Unexpected Downtime
Shared Nothing architecture allows for
some amount of self-healing that can
be another line of defense against
unexpected downtime.
For example, when you have
redundant copies of data or databases
on different disks, a disk that loses data
may be able to recover it when the
redundancies are synced.
SHARED NOTHING”
ARCHITECTURE
Had it instead been a single, shared
disk, the data would be lost and
downtime would be indefinite.
As you can see, shared nothing
architecture can be very helpful.
DISTRIBUTION MODELS
NoSQL has its ability to run databases on a large cluster.
The ability to process a greater read or write traffic, or more availability in the
face of network slowdowns or breakages.
There are two paths to data distribution: Replication and Sharding.
• Replication: Replication takes the same data and copies it over multiple nodes.
• Sharding: Sharding is a method for storing data across multiple machines.
Replication comes into two forms:
• Master-slave
• Peer-to-peer.
DISTRIBUTION MODELS
NoSQL has its ability to run databases on a large cluster.
The ability to process a greater read or write traffic, or more availability in the
face of network slowdowns or breakages.
There are two paths to data distribution: Replication and Sharding.
• Replication: Replication takes the same data and copies it over multiple nodes.
• Sharding: Sharding is a method for storing data across multiple machines.
Replication comes into two forms:
• Master-slave
• Peer-to-peer.
SHARDING
In a busy data store different people are accessing different
parts of the dataset.
In these circumstances we can support horizontal
scalability by putting different parts of the data onto
different servers—a technique that’s called sharding.
The load is balanced out nicely between servers—for
example, if we have ten servers, each one only has to handle
10% of the load.
In order to do it we have to ensure that data that’s accessed
together is clumped together on the same node and that
these clumps are arranged on the nodes to provide the best
data access.
SHARDING
The first part of this question is how to clump the data up
so that one user mostly gets her data from a single server.
When it comes to arranging the data on the nodes, there
are several factors that can help improve performance.
If you know that most accesses of certain aggregates are
based on a physical location, you can place the data close to
where it’s being accessed. If you have orders for someone
who lives in Boston, you can place that data in your eastern
US data center.
MASTER-SLAVE REPLICATION
With master-slave distribution, you replicate data
across multiple nodes.
One node is designated as the master, or primary.
This master is the authoritative source for the data
and is usually responsible for processing any updates
to that data. The other nodes are slaves, or
secondaries.
PEER-TO-PEER REPLICATION
Master-slave replication helps with read scalability
but doesn’t help with scalability of writes.
Peer-to-peer replication attacks these problems by not
having a master.
All the replicas have equal weight, they can all accept
writes, and the loss of any of them doesn’t prevent
access to the data store.