Database Management System
Unit 9: Advanced Database
           Concepts
         Lecture 1
Unit 9: Advanced Database Concepts
9.1 Concept of Objet-Oriented and Distributed Database Model
9.2 Properties of Parallel and Distributed Databases
9.3 Concept of Data warehouse Database
9.4 Concept of Spatial Database
STONEBRAKER’S APPLICATION MATRIX
                     No Query                       Query
Complex Data
                       OODBMS                   ORDBMS
 Simple Data
                       File System                RDBMS
       Dr. Stonebraker claims that it is a matrix for classifying DBMS
        applications. Beautiful in its simplicity, the matrix has become
        part of the database culture and "knowledge.“
           Revised : STONEBRAKER’S DBMS MATRIX
• The primary error in Stonebraker's DBMS Matrix is the apparent
  assumption that ODBMSs do not have a query capability
                             MOTIVATION
• Relational model (70’s):
  • Clean andsimple.
  • Great for administrative and transactional data.
  • Not as good for other kinds of complex data (e.g., multimedia,
    networks,CAD).
• Object-Oriented models(80’s):
  • Complicated, but some influential ideas from Object Oriented
  • Complex data types.
• Idea: BuildDBMS based on OOmodel.
       Programming languages have evolved from Procedural to
              Object Oriented. So why not DBMSs ???
                 RELATIONAL MODEL
• Relations are the key concept, everything else isaround relations
• Primitivedata types, e.g., strings,integer, date, etc.
• Great normalization, query optimization, and theory
• Whatismissing??
   •   Handling of complexobjects
   •   Handling of complex data types
   •   Codeisnot coupled with data
   •   No inherence, encapsulation,etc.
      RELATIONAL MODEL OF A ‘CAT’
Relational database of a cat:
                                At query time, try toput
                                things together as you
                                        want !!!!
   OBJECT ORIENTED MODEL OF A ‘CAT’
Object-oriented database of a cat:   The first areas where ODBMS were widely used were:
                                         ! CASE :Computer aided software engineering
                                         ! CAD :Computer aided design
                                         ! CAM :Computer aided manufacture
                                     Increasingly now used in:
                                         ! telecommunications
                                         ! healthcare
                                         ! finance
                                         ! multimedia
                                         ! text/document/quality management
            TWO APPROACHES
• Object-Oriented Model (OODBMS)
 • Pure OO concepts
• Object-Relational Model (ORDBMS)
 • Extended relational model with OO concepts
 FIRST APPROACH: OBJECT-ORIENTED
                MODEL
• Relations are not the central concept, classes and objects are
  the main concept
• Object-Oriented DBMS(OODBMS) are DBMS based on an Object-
  Oriented Data Model inspired by OO programming languages
• Main Features:
  •   Powerful type system
  •   Classes
  •   Object Identity
  •   Inheritance
• OODBMS are capable of storing complex objects, I.e., objects
  that are composed of other objects, and/or multi-valued
  attributes.
 FEATURE 1: POWERFUL TYPE SYSTEM
• Primitive types
  • Integer, string, date, Boolean, float,etc.
• Structure type
  • Attribute can be a record with a schema
                              Struct {integer x, string y}
• Collection type
  • Attribute can be a Set, Bag, List, Array of other types
• Reference type
  • Attribute can be a Pointer to another object
              FEATURE 2: CLASSES
• A ‘class’ is in replacement of ‘relation’
• Same concept as in OO programming languages
  • All objects belonging to a same class share the same
    properties and behavior
• An ‘object’ can be thought of as ‘tuple’ (but richer
  content)
• Classes encapsulate data +methods +relationships
  • Unlike relations that contain data only
• In OODBMSs objects are persistency (unlikeOO
  programming languages)
     FEATURE 3: OBJECT IDENTITY
• OID is a unique identity of each object regardless of
  its content
  • Even if all attributes are the same, still objects have different
    OIDs
• Easier for references
• An object is made of twothings:
  • State: attributes (name, address, birthDate of a person)
  • Behaviour: operations (age of a person is computed from
    birthDate and currentdate)
          FEATURE 4: INHERITANCE
• A class can be defined in terms of
                                                  Person
  another one.                         name: {firstName: string,
                                         middleName: string,
                                         lastName: string}
                                       address: string
                                       birthDate: date
• Person is super-class and Student    age(): Integer
  is sub-class.
                                       changeAddress(newAdd: string)
• Student class inherits attributes               Student
  and operations of Person.            regNum: string {PK}
                                       major: string
                                       register(C: Course): boolean
 STANDARDS FOR OBJECT-ORIENTED
               MODEL
• ODMG: Object Data Management Group (1991)
  • provide a standard where previously there was none
  • support portability between products
  • standardize model, querying and programming issues
• Language of specifying the structure of object
  database
  • ODL: Object Definition Language
  • OQL: Object Query Language
• ODL is somehow similar to DDL (Data Definition
  Language) in SQL
      SECOND APPROACH: OBJECT-
          RELATIONAL MODEL
• Object-oriented model tries to bring the main
  concepts from relational model to the OO domain
 • The heart is OO concepts with some extensions
• Object-relational model tries to bring the main
  concepts from the OO domain to the relational
  model
 • The heart is the relational model with some extensions
 • Extensions through user-defined types
     CONCEPTUAL VIEW OF OBJECT-
         RELATIONAL MODEL
• Relation is still the fundamental structure
• Relational model extended with the following features
  • Type system with primitive and structure types (UDT)
    • Including set, bag, array, list collection types
    • Including structures like records
  • Methods
    • Special operations can be defined over the user-defined types
      (UDT)
    • Specialized operators for complex types, e.g., images, multimedia,
      etc.
  • Identifiers for tuples
    • Unique identifiers even for identical tuples
  • References
    • Several ways for references and de-references
             CONCEPTUAL VIEW OF OBJECT-
                 RELATIONAL MODEL
                                                  •    Allow of nestedrelations
                                                  •    Repeating movies inside the stars
                                                       records is redundancy
Star(name, address(street, city), birthdate,      •    To avoid redundancy, usepointers
     movies(title, year, length))                      (references)
                                               Movie
            Star                                                                           37
       SUPPORT FROM VENDORS
• Several major software companies including IBM,
  Informix, Microsoft, Oracle, and Sybase have all
  released object-relational versions of their products
• Extended SQL standards called SQL-99 or SQL3
                 Summary
• First Approach: Object-Oriented Model
 • Concepts from OO programming languages
 • ODL: Object Definition Language
   • OQL: Object Oriented Query Language
• Second Approach: Object-Relational Model
 • Conceptual view
 • Data Definition Language (Creating types, tables, and
   relationships)
 • Querying object-relational database (SQL-99)
                WHEN TO CONSIDER
                OODBMS OR ORDBMS
• Complex Relationships
  • A lot of many-to-many relationships, tree structures or network (graph)
    structures.
• Complex Data
  • Multi-dimensional arrays, nested structures, or binary data, images,
    multimedia, etc.
• Distributed Databases
  • Need for free objects without the rigid table structure.
• Repetitive use of Large Working Sets of Objects
  • To make use of inheritance and reusability
• Expensive Mapping Layer
  • Expensive decomposition of objects (normalization) and re-
    composition at querytime
          KEY BENEFITS OF ODBMS
• Persistence & Versioning
  • Created objects are maintained across different database runs
    (persistent)
  • Different evolving copies of the same object can be created over
    time (versioning)
                        PersistentObject Superclass
                        Approach
           •   Superclass encapsulates any class for
               storage and retrieval
           •   This superclass implements all
               functionalities of read/write
               operations
 KEY BENEFITS OF ODBMS (CONT’D)
• Sharing in highly distributed environment
  • Easier to share and distribute objects than tables
 KEY BENEFITS OF ODBMS (CONT’D)
• Better memory usage and less paging
 • Bringing only objects of interest
      ODBMS                       Relational DBMS
     OBJECT-ORIENTED VS. OBJECT-
             RELATIONAL
• Object-oriented DBMSs
 • Did not achieve much success (until now) in the market
   place
 • No query support(Indexing, optimization)
 • No security layer
• Object-relational DBMSs
 • Better support from big vendors
 • Tries to make use of all advances in RDBMSs
   • Indexes, views, triggers, query optimizations, security layer,etc.
   • Work in progress --- Long way to go
Distributed Database
Distributed Database System
• A distributed database (DDB) is a collection of multiple, logically
  interrelated databases distributed over a computer network.
• A distributed database management system (D–DBMS) is the
  software that manages the DDB and provides an access mechanism
  that makes this distribution transparent to the users.
• Distributed database system (DDBS) = DDB + D–DBMS
Centralized DBMS on a Network
Distributed DBMS Environment
Distributed Database
• Data is stored at several sites, each managed by a DBMS that can run
  independently
• Desired properties:
   • Distributed Data Independence
   • Distributed Transaction Atomicity
Distributed Data Independence
• Users should not have to know where data is located
   • – no need to know the locations of references relations, their copies or
     fragments
   • extends Physical and Logical Data Independence principles
• Queries spanning multiple sites should be optimized in a cost-based
  manner
   • taking into account communication costs and differences in local
     computation costs
Distributed Transaction Atomicity
• Users should be able to write transactions accessing multiple sites
  just like local transactions
• The effects of a transaction across sites should be atomic
   • all changes persist if transaction commits
   • none persist if transaction aborts
      Types of Distributed Databases
• Homogeneous:
  • Every site runs same type of DBMS
• Heterogeneous:
  – Different sites run different DBMSs
  – different RDBMSs or even non-relational DBMSs
              More on Heterogeneous
               Distributed Databases
• Database servers are accessed through well-accepted and
  standard Gateway protocols
   – masks the differences of DBMSs (capability, data format etc.)
   – e.g. ODBC, JDBC
• However, can be expensive and may not be able to hide all
  differences
   –    e.g. when a server is not capable of supporting distributed
       transaction management
                                                             Gateway
   DBMS1                   DBMS2                  DBMS3
  Distributed DBMS Architectures
• Three alternative approaches
1. Client-Server
2. Collaborating Server
3. Middleware
                   Client-Server Systems
• One or more client (e.g. personal computer) and one or more server processes
  (e.g. a mainframe)
    – A client process can ship a query to any server process
    – Clients are responsible for user interfaces
    – Server manages data and executes queries
• Advantages
    – clean separation and centralized server
    – expensive server machines are not underutilized by simple user interactions
    – users can run GUI on clients that they are familiar with
• Challenges
    – need to carefully handle communication costs
    – e.g. fetching tuples one at a time might be bad – need to do caching on client side
                        QUERY
                                   CLIENT               CLIENT
                                SERVER         SERVER            SERVER
    Collaborating Server Systems
• Queries can span multiple sites
  – not allowed in client-servers as the clients would have
    had to break queries and combine the results
• When a server receives a query that requires
  access to data at other servers
  – it generates appropriate subqueries
  – puts the result together
• Eliminates distinction between client and server
                                  SERVER
                    SERVER
                                  SERVER
                      QUERY
             Middleware Systems
• Allows a single query to span multiple servers
• But does not require all db servers to be capable of
  handling multi-site execution strategies
   – need just one db server capable of managing queries and
     transactions spanning multiple servers (called middleware)
   – the remaining servers can handle only the local queries and
     transactions
• The middleware layer is capable of executing joins and
  other operations on data obtained from other servers, but
  typically does not maintain any data
• Useful when trying to integrate several “legacy systems”
   – whose basic capabilities cannot be extended
Storing Data in a Distributed DBMS
• Relations are stored across several sites
• Accessing data at a remote site incurs message-
  passing costs
• To reduce this overhead, a single relation may be
  partitioned or fragmented across several sites
   – typically at sites where they are most often accessed
• The data can be replicated as well
   – when the relation is in high demand
                           Fragmentation
  • Break a relation into smaller relations or fragments
        – store them in different sites as needed
                                        TID
                                         t1
                                         t2
                                         t3
• Horizontal:                            t4
    – Usually disjoint
    – Can often be identified by a selection query (employees in a city – locality of
      reference)
    – To retrieve the full relation, need a union
• Vertical:
    –   Identified by projection queries
    –   Typically unique TIDs added to each tuple
    –   TIDs replicated in each fragments
    –   Ensures that we have a Lossless Join
                            Replication
• When we store several copies of a relation or relation fragments
    – can be replicated at one or more sites
    – e.g. R is fragmented into R1, R2, R3; one copy of R2, R3; but two copies
      at R1 at two sites
• Advantages
    – Gives increased availability – e.g. when a site or communication link goes
      down
    – Faster query evaluation – e.g. using a local copy
• Synchronous and Asynchronous (later)
    – Vary in how current different copies are when a relation is modified
                                       SITE A                          SITE B
                 R1         R3
                                                       R1         R2
   Distributed Catalog Management
• Must keep track of how data is fragmented and replicated across sites
    – in addition to usual schema, authorization, and statistical information
• Must be able to uniquely identify each replica of each fragment
    – Globally unique name may compromise autonomy of servers
    – To preserve local autonomy: Global relation name = <local-name, birth-
      site>
    – To identify a replica, add a replica-id field (now called global replica
      name)
• Site Catalog: Describes all objects (fragments, replicas) at a site +
  Keeps track of replicas of relations created at this site
    – To find a relation, look up its birth-site catalog
    – Birth-site never changes, even if relation is moved
        Peer-to-Peer Replication
• More than one of the copies of an object can be a
  master
• Changes to a master copy must be propagated to
  other copies somehow
• If two master copies are changed in a conflicting
  manner, conflict resolution needed
   – e.g., Site 1: Joe’s age changed to 35; Site 2: to 36
• Best used when conflicts do not arise:
   – E.g., Each master site owns a disjoint fragment
   – E.g., Updating rights held by one master at a time – then
     propagated to other sites
Advantages of DDBMS
• 1. Increased reliability and availability
   • A distributed database system is robust to failure to some extent.
     Hence, it is reliable when compared to a Centralized database system.
• 2. Local control
   • The data is distributed in such a way that every portion of it is local to
     some sites (servers). The site in which the portion of data is stored is
     the owner of the data.
• 3. Modular growth (resilient)
   • Growth is easier. We do not need to interrupt any of the functioning
     sites to introduce (add) a new site. Hence, the expansion of the whole
     system is easier. Removal of site is also does not cause much problems.
• 4. Lower communication costs (More Economical)
   • Data are distributed in such a way that they are available near to the
     location where they are needed more. This reduces the
     communication cost much more compared to a centralized system.
Advantages of DDBMS conti..
• 5. Faster response
   • – Most of the data are local and in close proximity to where they are
     needed. Hence, the requests can be answered quickly compared to a
     centralized system.
• 6. Reflects the organizational structure
   • Normally, database is fragmented into various locations wherever we
     have controls.
• 7. Secured management of distributed data
   • Various transparencies like network transparency, fragmentation
     transparency, and replication transparency are implemented to hide
     the actual implementation details of the whole distributed system. In
     such way, Distributed database provides security for data.
Advantages of DDBMS conti..
• 8. Robust
   • The system is continued to work in case of failures. For example,
     replicated distributed database performs in spite of failure of other
     sites.
• 9. Complied with ACID properties
   • Distributed transactions demands Atomicity, Consistency, Isolation,
     and Reliability.
• 10. Improved performance and Parallelism in executing
  transactions can be achieved.
Disadvantages of DDBMS
• 1. Complex Software
   • Complex implementation. Costs more in terms of software cost compared to a
     centralized system. Additional software might be needed in most of the cases over a
     centralized system.
• 2. Increased Processing overhead
   • It costs many messages to be shared between sites to complete a distributed
     transaction.
• 3. Data integrity
   • Data integrity becomes complex. Too much network resources may be used.
• 4. Different data formats might be used
   • This may cost time.
• 5. Deadlock is difficult to handle compared to a centralized system.
Disadvantages of DDBMS Conti..
    • 6. May cause much more network traffic in case of write operation
      in a replicated form of distributed database
    • 7. Distributed System supported Operating System is required to
      implement distributed database system
    • 8. The data shared between sites over networks are vulnerable to
      attack. Hence, network oriented security protocols to be used based
      on the sensitivity of data shared
    • 9. More complex in terms database design – According to various
      applications, we may need to fragment a database, or replicate a
      database or both
    • 10. Handling failures is a difficult task. In some cases, we may not
      distinguish site failure, network partition, and link failure.
Concepts of Data Warehouses
 Data Warehouses
• Stores static data that has been extracted from different
  databases (data sources) in an organization
    • Central source of data that has been cleaned,
      transformed, and cataloged
    • Data is used for data mining, analytical processing,
      analysis, research, decision support
• Data warehouses may be divided into data marts
    • Subsets of data that focus on specific aspects
      of a company (department or business process)
                             53
Data Warehouse Components
                  54
Applications and Data Marts
                     55
Data Warehouse : Characteristics
    •     Separate from operational databases
    •     Subject oriented: provides a simple, concise view on one or more
          selected areas, in support of the decision process
    •     Constructed by integrating multiple, heterogeneous data sources
    •     Contains historical data: spans a much longer time horizon than
          operational databases
    •     (Mostly) Read-Only access: periodic, infrequent updates
    •     Include metadata
        Data Warehousing is process of constructing and using data warehouses
        which requires data integration, data cleaning, and data consolidation.
                                                                                  56
Types of Data Warehouses
• Enterprise Warehouse: covers all areas of interest for an
  organization
• Data Mart: covers a subset of corporate-wide data that is of
  interest for a specific user group (e.g., marketing).
• Virtual Warehouse: offers a set of views constructed on demand on
  operational databases. Some of the views could be materialized
  (precomputed)
                                                                  57
Data Warehouse—Subject-Oriented
 • Organized around major subjects, such as customer, product, sales
 • Focusing on the modeling and analysis of data for decision makers, not on
   daily operations or transaction processing
 • Provide a simple and concise view around particular subject issues by
   excluding data that are not useful in the decision support process
                                                                               58
Data Warehouse—Integrated
 • Constructed by integrating multiple, heterogeneous data sources
    • relational databases, flat files, on-line transaction records
 • Data cleaning and data integration techniques are applied.
    • Ensure consistency in naming conventions, encoding structures, attribute measures, etc.
      among different data sources
        • E.g., Hotel price: currency, tax, breakfast covered, etc.
    • When data is moved to the warehouse, it is converted.
                                                                                            59
Data Warehouse—Time Variant
 • The time horizon for the data warehouse is significantly longer than that of
   operational systems
    • Operational database: current value data
    • Data warehouse data: provide information from a historical perspective (e.g., past 5-10
      years)
 • Every key structure in the data warehouse
    • Contains an element of time, explicitly or implicitly
    • But the key of operational data may or may not contain “time element”
                                                                                                60
Data Warehouse—Nonvolatile
 • A physically separate store of data transformed from the operational
   environment
 • Operational update of data does not occur in the data warehouse
   environment
    • Does not require transaction processing, recovery, and concurrency control mechanisms
    • Requires only two operations in data accessing:
        • initial loading of data and access of data
                                                                                          61
Data Mining
              62
Data Mining
• Process of discovering interesting patterns or knowledge from a
  (typically) large amount of data stored either in databases, data
  warehouses, or other information repositories
• Interesting: previously unknown, potentially useful
• Alternative names:knowledge discovery/extraction,
  information harvesting, business intelligence
• In fact, data mining is a step of the more general process of
  knowledge discovery in databases (KDD)
                                                                      63
Data Mining --2
    • Data in data warehouses are analyzed to reveal
      hidden patterns and trends
        • Market-basket analysis to identify new
          product bundles
        • Find root cause of qualify or manufacturing
          problems
        • Prevent customer attrition
        • Acquire new customers
        • Cross-sell to existing customers
        • Profile customers with more accuracy
                               64
65
Steps of a KDD Process
 • Learning the application domain:
    • relevant prior knowledge and goals of application
 • Creating a target data set: data selection
 • Data cleaning and preprocessing: (may take 60% of effort!)
 • Data reduction and transformation:
    • Find useful features, dimensionality/variable reduction, invariant
      representation.
 • Choosing functions of data mining
    • summarization, classification, regression, association, clustering.
 • Choosing the mining algorithm(s)
 • Data mining : search for patterns of interest
 • Pattern evaluation and knowledge presentation
    • visualization, transformation, removing redundant patterns, etc.
 • Use of discovered knowledge
                                                                            66
Why is Data Mining necessary?
• Make use of your data assets
• There is a big gap from stored data to knowledge; and the transition
  won’t occur automatically.
• Many interesting things you want to find cannot be found using
  database queries
   “find me people likely to buy my products”
   “Who are likely to respond to my promotion”
                                                                         67
Data Mining Applications
    • Credit ratings/targeted marketing:
       • Given a database of 100,000 names, which persons are the least likely
         to default on their credit cards?
       • Identify likely responders to sales promotions
    • Fraud detection:
       • Which types of transactions are likely to be fraudulent, given the
         demographics and transactional history of a particular customer?
    • Customer relationship management:
       • Which of my customers are likely to be the most loyal, and which are
         most likely to leave for a competitor?
       And Explore More!!!!
                                                                                 68
Self Study
• Parallel Database System
• Spatial Database
End of Unit 9
• Good Luck for your Examination!!