Principles of Distributed Database
Systems
                                           TS. Phan Thị Hà
                                           Khoa CNTT. PTIT
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                             1
Thị Hà_PTIT
Outline
◼    Introduction
◼    Distributed and Parallel Database Design
◼    Distributed Data Control
◼    Distributed Query Processing
◼    Distributed Transaction Processing
◼    Data Replication
◼    Database Integration – Multidatabase Systems
◼    Parallel Database Systems
◼    Peer-to-Peer Data Management
◼    Big Data Processing
◼    NoSQL, NewSQL and Polystores
◼    Web Data Management
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                    2
Thị Hà_PTIT
Outline
◼    Introduction
       ❑   What is a distributed DBMS
       ❑   History
       ❑   Distributed DBMS promises
       ❑   Design issues
       ❑   Distributed DBMS architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           3
Thị Hà_PTIT
Distributed Computing
◼    A number of autonomous processing elements (not
     necessarily homogeneous) that are interconnected by a
     computer network and that cooperate in performing their
     assigned tasks.
◼    What is being distributed?
       ❑   Processing logic
       ❑   Function
       ❑   Data
       ❑   Control
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                               4
Thị Hà_PTIT
Current Distribution – Geographically
Distributed Data Centers
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           5
Thị Hà_PTIT
What is a Distributed Database System?
A distributed database is a collection of multiple, logically
interrelated databases distributed over a computer network
A distributed database management system (Distributed
DBMS) is the software that manages the DDB and provides
an access mechanism that makes this distribution
transparent to the users
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                6
Thị Hà_PTIT
What is not a DDBS?
◼    A timesharing computer system
◼    A loosely or tightly coupled multiprocessor system
◼    A database system which resides at one of the nodes of
     a network of computers - this is a centralized database
     on a network node
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                               7
Thị Hà_PTIT
Distributed DBMS Environment
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           8
Thị Hà_PTIT
Implicit Assumptions
◼    Data stored at a number of sites → each site logically
     consists of a single processor
◼    Processors at different sites are interconnected by a
     computer network → not a multiprocessor system
       ❑   Parallel database systems
◼    Distributed database is a database, not a collection of
     files → data logically related as exhibited in the users’
     access patterns
       ❑   Relational data model
◼    Distributed DBMS is a full-fledged DBMS
       ❑   Not remote file system, not a TP system
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                 9
Thị Hà_PTIT
Important Point
                                      Logically integrated
                                              but
                                     Physically distributed
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                              10
Thị Hà_PTIT
Outline
◼    Introduction
       ❑
       ❑   History
       ❑
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           11
Thị Hà_PTIT
History – File Systems
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           12
Thị Hà_PTIT
History – Database Management
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           13
Thị Hà_PTIT
History – Early Distribution
Peer-to-Peer (P2P)
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           14
Thị Hà_PTIT
History – Client/Server
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           15
Thị Hà_PTIT
History – Data Integration
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           16
Thị Hà_PTIT
History – Cloud Computing
On-demand, reliable services provided over the Internet in
a cost-efficient manner
◼ Cost savings: no need to maintain dedicated compute
   power
◼ Elasticity: better adaptivity to changing workload
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                             17
Thị Hà_PTIT
Data Delivery Alternatives
◼    Delivery modes
       ❑   Pull-only
       ❑   Push-only
       ❑   Hybrid
◼    Frequency
       ❑   Periodic
       ❑   Conditional
       ❑   Ad-hoc or irregular
◼    Communication Methods
       ❑   Unicast
       ❑   One-to-many
◼    Note: not all combinations make sense
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                             18
Thị Hà_PTIT
Outline
◼    Introduction
       ❑
       ❑   Distributed DBMS promises
       ❑
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           19
Thị Hà_PTIT
Distributed DBMS Promises
 Transparent management of distributed, fragmented,
 and replicated data
 Improved reliability/availability through distributed
 transactions
 Improved performance
 Easier and more economical system expansion
© 2020, M.T. Özsu & P. Valduriez TS. Pan
Thị Hà_PTIT
Transparency
◼    Transparency is the separation of the higher-level
     semantics of a system from the lower level
     implementation issues.
◼    Fundamental issue is to provide
                  data independence
     in the distributed environment
       ❑   Network (distribution) transparency
       ❑   Replication transparency
       ❑   Fragmentation transparency
            ◼ horizontal fragmentation: selection
            ◼ vertical fragmentation: projection
            ◼ hybrid
© 2020, M.T. Özsu & P. Valduriez TS. Pan
Thị Hà_PTIT
Example
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           22
Thị Hà_PTIT
  Transparent Access
                                                                     Tokyo
SELECT    ENAME,SAL
FROM      EMP,ASG,PAY                             Boston                         Paris
WHERE     DUR > 12                                                                    Paris projects
                                                                                      Paris employees
AND       EMP.ENO = ASG.ENO                                      Communication        Paris assignments
                                                                   Network            Boston employees
AND       PAY.TITLE = EMP.TITLE
                                            Boston projects
                                            Boston employees
                                            Boston assignments
                                                                                 Montreal
                                                          New
                                                                                 Montreal projects
                                                          York                   Paris projects
                                                    Boston projects              New York projects
                                                    New York employees             with budget > 200000
                                                    New York projects            Montreal employees
                                                    New York assignments         Montreal assignments
 © 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                                                     23
 Thị Hà_PTIT
Distributed Database - User View
                                           Distributed Database
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                  24
Thị Hà_PTIT
Distributed DBMS - Reality
                                             User
                                             Query
                                           User
                         DBMS
                                           Application
                         Software
                                                                      DBMS
                                                                      Software
                     DBMS                      Communication
                     Software                   Subsystem
                                                                    User
                         DBMS                 User                  Application
                         Software             Query
                                                         DBMS
                                                         Software
                 User
                 Query
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                                  25
Thị Hà_PTIT
Types of Transparency
◼    Data independence
◼    Network transparency (or distribution transparency)
       ❑   Location transparency
       ❑   Fragmentation transparency
◼    Fragmentation transparency
◼    Replication transparency
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                           26
Thị Hà_PTIT
    Reliability Through Transactions
◼     Replicated components and data should make distributed
      DBMS more reliable.
◼     Distributed transactions provide
      Concurrency transparency
       ❑
   ❑ Failure atomicity
• Distributed transaction support requires implementation of
   ❑ Distributed concurrency control protocols
   ❑ Commit protocols
◼     Data replication
       ❑   Great for read-intensive workloads, problematic for updates
       ❑   Replication protocols
    © 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                         27
    Thị Hà_PTIT
Potentially Improved Performance
◼    Proximity of data to its points of use
       ❑   Requires some support for fragmentation and replication
◼    Parallelism in execution
       ❑   Inter-query parallelism
       ❑   Intra-query parallelism
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                     28
Thị Hà_PTIT
Scalability
◼    Issue is database scaling and workload scaling
◼    Adding processing and storage power
◼    Scale-out: add more servers
       ❑   Scale-up: increase the capacity of one server → has limits
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                        29
Thị Hà_PTIT
Outline
◼    Introduction
       ❑
       ❑   Design issues
       ❑
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           30
Thị Hà_PTIT
Distributed DBMS Issues
◼    Distributed database design
       ❑   How to distribute the database
       ❑   Replicated & non-replicated database distribution
       ❑   A related problem in directory management
◼    Distributed query processing
       ❑   Convert user transactions to data manipulation instructions
       ❑   Optimization problem
             ◼   min{cost = data transmission + local processing}
       ❑   General formulation is NP-hard
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                         31
Thị Hà_PTIT
Distributed DBMS Issues
◼    Distributed concurrency control
       ❑   Synchronization of concurrent accesses
       ❑   Consistency and isolation of transactions' effects
       ❑   Deadlock management
◼     Reliability
       ❑   How to make the system resilient to failures
       ❑   Atomicity and durability
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                32
Thị Hà_PTIT
Distributed DBMS Issues
◼    Replication
       ❑   Mutual consistency
       ❑   Freshness of copies
       ❑   Eager vs lazy
       ❑   Centralized vs distributed
◼    Parallel DBMS
       ❑   Objectives: high scalability and performance
       ❑   Not geo-distributed
       ❑   Cluster computing
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                          33
Thị Hà_PTIT
Related Issues
◼    Alternative distribution approaches
       ❑   Modern P2P
       ❑   World Wide Web (WWW or Web)
◼    Big data processing
       ❑   4V: volume, variety, velocity, veracity
       ❑   MapReduce & Spark
       ❑   Stream data
       ❑   Graph analytics
       ❑   NoSQL
       ❑   NewSQL
       ❑   Polystores
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                     34
Thị Hà_PTIT
Outline
◼    Introduction
       ❑
       ❑   Distributed DBMS architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           35
Thị Hà_PTIT
DBMS Implementation Alternatives
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           36
Thị Hà_PTIT
Dimensions of the Problem
◼    Distribution
       ❑   Whether the components of the system are located on the same machine or
           not
◼    Heterogeneity
       ❑   Various levels (hardware, communications, operating system)
       ❑   DBMS important one
             ◼   data model, query language,transaction management algorithms
◼    Autonomy
       ❑   Not well understood and most troublesome
       ❑   Various versions
             ◼   Design autonomy: Ability of a component DBMS to decide on issues related to its
                 own design.
             ◼   Communication autonomy: Ability of a component DBMS to decide whether and
                 how to communicate with other DBMSs.
             ◼   Execution autonomy: Ability of a component DBMS to execute local operations in
                 any manner it wants to.
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                                                               37
Thị Hà_PTIT
Client/Server Architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           38
Thị Hà_PTIT
Advantages of Client-Server
Architectures
◼    More efficient division of labor
◼    Horizontal and vertical scaling of resources
◼    Better price/performance on client machines
◼    Ability to use familiar tools on client machines
◼    Client access to remote data (via standards)
◼    Full DBMS functionality provided to client workstations
◼    Overall better system price/performance
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                               39
Thị Hà_PTIT
Database Server
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           40
Thị Hà_PTIT
Distributed Database Servers
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           41
Thị Hà_PTIT
Peer-to-Peer Component Architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           42
Thị Hà_PTIT
MDBS Components & Execution
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           43
Thị Hà_PTIT
Mediator/Wrapper Architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           44
Thị Hà_PTIT
Cloud Computing
On-demand, reliable services provided over the Internet in
a cost-efficient manner
◼ IaaS – Infrastructure-as-a-Service
◼ PaaS – Platform-as-a-Service
◼ SaaS – Software-as-a-Service
◼ DaaS – Database-as-a-Service
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                                             45
Thị Hà_PTIT
Simplified Cloud Architecture
© 2020, M.T. Özsu & P. Valduriez TS. Pan
                                           46
Thị Hà_PTIT