DDBMS Handbook (MSE 1)
DDBMS Handbook (MSE 1)
(2170714)
Hand Book
                        Year: 2020-21
CE & IT Department
        A database is an ordered collection of related data that is built for a specific purpose. A
        database may be organized as a collection of multiple tables, where a table represents a real
        world element or entity. Each table has several different fields that represent the characteristic
        features of the entity.
        For example, a company database may include tables for projects, employees, departments,
        products and financial records. The fields in the Employee table may be Name, Company_Id,
        Date_of_Joining, and so forth.
               MySQL
               Oracle
               SQL Server
               dBASE
               FoxPro
               PostgreSQL, etc.
        Database Schema
                A database schema is a description of the database which is specified during database
                design and subject to infrequent alterations.
               It defines the organizationof the data, the relationships among them, and the constraints
                associated with them.
               Databases are often represented through the three-schema architecture or ANSISPARC
                architecture. The goal of this architecture is to separate the user application from the
                physical database. The three levels are −
               Internal Level having Internal Schema − It describes the physical structure, details of
                internal storage and access paths for the database.
               Conceptual Level having Conceptual Schema − It describes the structure of the whole
                database while hiding the details of physical storage of data. This illustrates the entities,
                attributes with their data types and constraints, user operations and relationships.
               External or View Level having External Schemas or Views − It describes the portion of a
                database relevant to a particular user or a group of users while hiding the rest of database.
               Types of DBMS
       There are four types of DBMS.
           Hierarchical DBMS
       In hierarchical DBMS, the relationships among data in the database are established so that one
       data element exists as a subordinate of another. The data elements have parent-child relationships
       and are modelled using the “tree” data structure. These are very fast and simple.
           Relational DBMS
       In relational databases, the database is represented in the form of relations. Each relation models
       an entity and is represented as a table of values. In the relation or table, a row is called a tuple
       and denotes a single record. A column is called a field or an attribute and denotes a characteristic
       property of the entity. RDBMS is the most popular database management system.
               Distributed DBMS
           A distributed database is a set of interconnected databases that is distributed over the computer
           network or internet. A Distributed Database Management System (DDBMS) manages the
           distributed database and provides mechanisms so as to make the databases transparent to the
           users. In these systems, data is intentionally distributed among multiple nodes so that all
           computing resources of the organization can be optimally used.
M            operations on a DBMS
            The four basic operations on a database are Create, Retrieve, Update and Delete.
                   CREATE database structure and populate it with data − Creation of a database relation
                   involves specifying the data structures, data types and the constraints of the data to be
                   stored.
NAME VARCHAR2(25),
YEAR INTEGER,
               Once the data format is defined, the actual data is stored in accordance with the format in
               some storage medium.
Example SQL command to insert a single tuple into the student table −
               Example − To retrieve the names of all students of the Computer Science stream, the
               following SQL query needs to be executed −
               UPDATE information stored and modify database structure – Updating a table involves
               changing old values in the existing table’s rows with new values.
UPDATE STUDENT
               Modifying database means to change the structure of the table. However, modification of
               the table is subject to a number of restrictions.
               Example − To add a new field or column, say address to the Student table, we use the
               following SQL command −
               Example − To delete all students who are in 4th year currently when they are passing
               out, we use the SQL command −
Example − To remove the student table completely, the SQL command used is −
        This chapter introduces the concept of DDBMS. In a distributed database, there are a number of
        databases that may be geographically distributed all over the world. A distributed DBMS
        manages the distributed database in a manner so that it appears as one single database to users.
        In the later part of the chapter, we go on to study the factors that lead to distributed databases,
        its advantages and disadvantages.
Distributed DBMS
A distributed database is a collection of multiple interconnected databases, which are spread physically
across various locations that communicate via a computer network.
       1) Databases in the collection are logically interrelated with each other. Often they represent a
       single logical database.
       2) Data is physically stored across multiple sites. Data in each site can be managed by a DBMS
       independent of the other sites.
       3) The processors in the sites are connected via a network. They do not have any multiprocessor
       configuration.
A distributed database management system (DDBMS) is a centralized software system that manages a
distributed database in a manner as if it were all stored in a single location.
           Features
       1) It is used to create, retrieve, update and delete distributed databases.
       2) It synchronizes the database periodically and provides access mechanisms by the virtue of
       which the distribution becomes transparent to the users.
       4) It is used in application areas where large volumes of data are processed and accessed by
       numerous users simultaneously.
       Distributed Nature of Organizational Units − Most organizations in the current times are
       subdivided into multiple units that are physically distributed over the globe. Each unit requires
       its own set of local data. Thus, the overall database of the organization becomes distributed.
       Need for Sharing of Data − The multiple organizational units often need to communicate with
       each other and share their data and resources. This demands common databases or replicated
       databases that should be used in a synchronized manner.
       Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and Online
       Analytical Processing (OLAP) work upon diversified systems which may have common data.
       Distributed database systems aid both these processing by providing synchronized data.
       Database Recovery − One of the common techniques used in DDBMS is replication of data
       across different sites. Replication of data automatically helps in data recovery if database in any
       Support for Multiple Application Software − Most organizations use a variety of application
       software each with its specific database support. DDBMS provides a uniform functionality for
       using the same data among different platforms.
       Modular Development − If the system needs to be expanded to new locations or new units, in
       centralized database systems, the action requires substantial efforts and disruption in the existing
       functioning. However, in distributed databases, the work simply requires adding new computers
       and local data to the new site and finally connecting them to the distributed system, with no
       interruption in current functions.
       More Reliable − In case of database failures, the total system of centralized databases comes to a
       halt. However, in distributed systems, when a component fails, the functioning of the system
       continues may be at a reduced performance. Hence DDBMS is more reliable.
       Better Response − If data is distributed in an efficient manner, then user requests can be met
       from local data itself, thus providing faster response. On the other hand, in centralized systems,
       all queries have to pass through the central computer for processing, which increases the
       response time.
       Lower Communication Cost − In distributed database systems, if data is located locally where it
       is mostly used, then the communication costs for data manipulation can be minimized. This is
       not feasible in centralized systems.
              Need for complex and expensive software − DDBMS demands complex and often
              expensive software to provide data transparency and co-ordination across the several sites.
              Data integrity − The need for updating data in multiple sites pose problems of data
              integrity.
       In this part of the tutorial, we will study the different aspects that aid in designing distributed
       database environments. This chapter starts with the types of distributed databases. Distributed
       databases can be classified into homogeneous and heterogeneous databases having further
       divisions. The next section of this chapter discusses the distributed architectures namely client –
       server, peer – to – peer and multi – DBMS. Finally, the different design alternatives like
       replication and fragmentation are introduced.
2) The sites use identical DBMS or DBMS from the same vendor.
3) Each site is aware of all other sites and cooperates with other sites to process user requests.
       Autonomous − Each database is independent that functions on its own. They are integrated by a
       controlling application and use message passing to share data updates.
       Non-autonomous − Data is distributed across the homogeneous nodes and a central or master
       DBMS co-ordinates data updates across the sites.
       2) The system may be composed of a variety of DBMSs like relational, network, hierarchical or
       object oriented.
       5) A site may not be aware of other sites and so there is limited co-operation in processing user
       requests.
       Un-federated − The database systems employ a central coordinating module through which the
       databases are accessed.
Distribution − It states the physical distribution of data across the different sites.
       Autonomy − It indicates the distribution of control of the database system and the degree to
       which each constituent DBMS can operate independently.
       RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL, and
       for all modern database systems like MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft
       Access.
       The data in an RDBMS is stored in database objects which are called as tables. This table is
       basically a collection of related data entries and it consists of numerous columns and rows.
       Remember, a table is the most common and simplest form of data storage in a relational
       database. The following program is an example of a EMPLOYEE table −
+----+----------+-----+-----------+----------+
+----+----------+-----+-----------+----------+
| 6 | Komal | 22 | MP | 4500.00 |
+----+----------+-----+-----------+----------+
        Every table is broken up into smaller entities called fields. The fields in the CUSTOMERS
        table consist of ID, NAME, AGE, ADDRESS and SALARY.
        A field is a column in a table that is designed to maintain specific information about every
        record in the table.
      For example, a column in the CUSTOMERS table is ADDRESS, which represents location
      description
        NULL value
       A NULL value in a table is a value in a field that appears to be blank, which means a field with a
       NULL value is a field with no value.
       It is very important to understand that a NULL value is different than a zero value or a field that
       contains spaces. A field with a NULL value is the one that has been left blank during a record
       creation.
       Constraints
       Constraints are the rules enforced on data columns on a table. These are used to limit the type of
       data that can go into a table. This ensures the accuracy and reliability of the data in the database.
       Constraints can either be column level or table level. Column level constraints are applied only
       to one column whereas, table level constraints are applied to the entire table.
Following are some of the most commonly used constraints available in SQL
NOT NULL Constraint − Ensures that a column cannot have a NULL value.
DEFAULT Constraint − Provides a default value for a column when none is specified.
UNIQUE Constraint − Ensures that all the values in a column are different.
              CHECK Constraint − The CHECK constraint ensures that all values in a column satisfy
              certain conditions.
INDEX − Used to create and retrieve data from the database very quickly.
       Domain Integrity − Enforces valid entries for a given column by restricting the type, the format,
       or the range of values.
Referential integrity − Rows cannot be deleted, which are used by other records.
       User-Defined Integrity − Enforces some specific business rules that do not fall into entity,
       domain or referential integrity.
Database normalization
       Database normalization is the process of efficiently organizing data in a database. There are two
       reasons of this normalization process −
Eliminating redundant data, for example, storing the same data in more than one table.
       Both these reasons are worthy goals as they reduce the amount of space a database consumes and
       ensures that data is logically stored. Normalization consists of a series of guidelines that help
       guide you in creating a good database structure.
       If a database design is not perfect, it may contain anomalies, which are like a bad dream for any
       database administrator. Managing a database with anomalies is next to impossible.
          anomalies
              Update anomalies − If data items are scattered and are not linked to each other properly,
              then it could lead to strange situations. For example, when we try to update one data item
              having its copies scattered over several places, a few instances get updated properly while
              a few others are left with old values. Such instances leave the database in an inconsistent
              state.
Insert anomalies − We tried to insert data in a record that does not exist at all.
Normalization
It is a method to remove all these anomalies and bring the database to a consistent state.
       Normalization guidelines are divided into normal forms; think of a form as the format or the way
       a database structure is laid out. The aim of normal forms is to organize the database structure, so
       that it complies with the rules of first normal form, then second normal form and finally the third
       normal form.
       It is your choice to take it further and go to the fourth normal form, fifth normal form and so on,
       but in general, the third normal form is more than enough.
       First Normal Form is defined in the definition of relations (tables) itself. This rule defines that all
       the attributes in a relation must have atomic domains. The values in an atomic domain are
       indivisible units.
Each attribute must contain only a single value from its pre-defined domain.
       If we follow second normal form, then every non-prime attribute should be fully functionally
       dependent on prime key attribute. That is, if X → A holds, then there should not be any proper
       subset Y of X, for which Y → A also holds true.
        We broke the relation in two as depicted in the above picture. So there exists no partial
        dependency.
        For a relation to be in Third Normal Form, it must be in Second Normal form and the following
        must satisfy −
             X is a superkey or,
             A is prime attribute.
To bring this relation into third normal form, we break the relation into two relations as follows −
       In the above image, Stu_ID is the super-key in the relation Student_Detail and Zip is the super-
       key in the relation ZipCodes. So,
and
Zip → City
       Relational algebra defines the basic set of operations of relational database model. A sequence
       of relational algebra operations forms a relational algebra expression. The result of this
       expression represents the result of a database query.
 Projection
 Selection
 Union
 Intersection
 Minus
               Join
       Projection
       Projection operation displays a subset of fields of a table. This gives a vertical partition of the
       table.
                                        π<AttributeList>(<TableName>)
                         For example, let us consider the following Student database –
STUDENT
                                        πName,Course(STUDENT)
       Selection
       Selection operation displays a subset of tuples of a table that satisfies certain conditions. This
       gives a horizontal partition of the table.
σ<Conditions>(<TableName>)
       For example, in the Student table, if we want to display the details of all students who have
       opted for MCA course, we will use the following relational algebra expression −
                                         σCourse="mCA"(STUDENT)
    Combination of Projection and Selection Operations
    For most queries, we need a combination of projection and selection operations. There are two
    ways to write these expressions −
For example, to display names of all female students of the BCA course −
πName(σGender="Female"ANDCourse="BCA"(STUDENT))
                   FemaleBCAStudent←σGender="Female"ANDCourse="BCA"(STUDENT)
                                Result←πName(FemaleBCAStudent)
       Union
       If P is a result of an operation and Q is a result of another operation, the union of P and Q
               (p∪ Qp∪ Q) is the set of all tuples that is either in P or in Q or in both without duplicates.
                                Sem1Student←σSemester=1(STUDENT)
                                BCAStudent←σCourse="BCA"(STUDENT)
                                    Result←Sem1Student∪BCAStudent
        Intersection
        If P is a result of an operation and Q is a result of another operation, the intersection of P and Q
        ( p∩ Qp∩ Q ) is the set of all tuples that are in P and Q both.
EMPLOYEE
PROJECT
To display the names of all cities where a project is located and also an employee resides −
                                       CityEmp←πCity(EMPLOYEE)
                                       CityProject←πCity(PROJECT)
                                      Result←CityEmp∩CityProject
      Minus
      If P is a result of an operation and Q is a result of another operation, P - Q is the set of all tuples
      that are in P and not in Q.
      For example, to list all the departments which do not have an ongoing project (projects with status
      = ongoing) –
                                    AllDept←πDepartment(EMPLOYEE)
      ProjectDept←πDepartment(σStatus="ongoing"(PROJECT)) Result←AllDept−ProjectDept
      Join
        Join operation combines related tuples of two different tables (results of queries) into a single
              table.
BRANCH
          SQL queries are translated into equivalent relational algebra expressions before optimization. A
          query is at first decomposed into smaller query blocks. These blocks are translated to equivalent
          relational algebra expressions. Optimization includes optimization of each block and then
          optimization of the query as a whole.
          Examples
          Let us consider the following schemas −
EMPLOYEE
PROJECT
WORKS
Example 1
            To display the details of all employees who earn a salary LESS than the average salary, we write
            the SQL query −
This query contains one nested sub-query. So, this can be broken down into two blocks.
AvgSal←IAVERAGE(Salary)EMPLOYEE
σSalary<AvgSal>EMPLOYEE
Example 2
       To display the project ID and status of all projects of employee 'Arun Kumar', we write the SQL
       query −
WHERE PID =( SELECT FROM WORKS WHERE EMPID =( SELECT EMPID FROM EMPLOYEE
     This query contains two nested sub-queries. Thus, can be broken down into three blocks, as      f
       follows −
               ArunEmpID←πEmpID(σName="ArunKumar"(EMPLOYEE))
 ArunPID←πPID(σEmpID="ArunEmpID"(WORKS)) Result←πPID,Status(σPID="ArunPID"(PROJECT))
           Multi-database View Level − Depicts multiple user views comprising of subsets of the
           integrated distributed database.
           Multi-database Internal Level − Depicts the data distribution across different sites and multi-
           database to local data mapping.
Local database Conceptual Level − Depicts local data organization at each site.
Local database Internal Level − Depicts physical data organization at each site.
The distribution design alternatives for the tables in a DDBMS are as follows −
 Fully replicated
 Partially replicated
 Fragmented
 Mixed
          Fully Replicated
      In this design alternative, at each site, one copy of all the database tables is stored. Since, each
      site has its own copy of the entire database, queries are very fast requiring negligible
      communication cost. On the contrary, the massive redundancy in data requires huge cost during
      update operations. Hence, this is suitable for systems where a large number of queries is
      required to be handled whereas the number of database updates is low.
          Partially Replicated
      Copies of tables or portions of tables are stored at different sites. The distribution of the tables is
      done in accordance to the frequency of access. This takes into consideration the fact that the
      frequency of accessing the tables vary considerably from site to site. The number of copies of the
      tables (or portions) depends on how frequently the access queries execute and the site which
      generate the access queries.
          Fragmented
      In this design, a table is divided into two or more pieces referred to as fragments or partitions,
      and each fragment can be stored at different sites. This considers the fact that it seldom
      happens that all data stored in a table is required at a given site. Moreover, fragmentation
      increases parallelism and provides better disaster recovery. Here, there is only one copy of each
      fragment in the system, i.e. no redundant data.
 Vertical fragmentation
 Horizontal fragmentation
 Hybrid fragmentation
      In the last chapter, we had introduced different design alternatives. In this chapter, we will study
      the strategies that aid in adopting the designs. The strategies can be broadly divided into
      replication and fragmentation. However, in most cases, a combination of the two is used.
     Replication
       Data replication is the process of storing separate copies of the database at two or more sites. It
       is a popular fault tolerance technique of distributed databases.
       Reduction in Network Load − Since local copies of data are available, query processing can be
       done with reduced network usage, particularly during prime hours. Data updating can be done
       at non-prime hours.
       Quicker Response − Availability of local copies of data ensures quick query processing and
       consequently quick response time.
       Simpler Transactions − Transactions require less number of joins of tables located at different
       sites and minimal coordination across the network. Thus, they become simpler in nature.
          Increased Cost and Complexity of Data Updating − Each time a data item is updated, the
          update needs to be reflected in all the copies of the data at the different sites. This requires
          complex synchronization techniques and protocols.
          Undesirable Application – Database coupling − If complex update mechanisms are not used,
          removing data inconsistency requires complex co-ordination at application level. This results
          in undesirable application – database coupling.
 Snapshot replication
 Near-real-time replication
 Pull replication
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the table are
called fragments.
horizontal,
vertical, and
Fragmentation should be done in a way so that the original table can be reconstructed from the
fragments. This is needed so that the original table can be reconstructed from the fragments whenever
required. This requirement is called “reconstructiveness.”
    Advantages of Fragmentation
       Since data is stored close to the site of usage, efficiency of the database system is increased.
Local query optimization techniques are sufficient for most queries since data is locally available.
       Since irrelevant data is not available at the sites, security and privacy of the database system
       can be maintained.
    Disadvantages of Fragmentation
       When data from different fragments are required, the access speeds may be very high.
In case of recursive fragmentations, the job of reconstruction will need expensive techniques.
       Lack of back-up copies of data in different sites may render the database ineffective in case of
       failure of a site.
       In vertical fragmentation, the fields or columns of a table are grouped into fragments. In order
       to maintain reconstructiveness, each fragment should contain the primary key field(s) of the
       table. Vertical fragmentation can be used to enforce privacy of data.
       For example, let us consider that a University database keeps records of all registered students
       in a Student table having the following schema.
STUDENT
       Now, the fees details are maintained in the accounts section. In this case, the designer will
       fragment the database as follows −
SELECT Regd_No,Fees
FROM STUDENT;
Horizontal fragmentation
       Horizontal fragmentation groups the tuples of a table in accordance to values of one or more
       fields. Horizontal fragmentation should also confirm to the rule of reconstructiveness. Each
       horizontal fragment must have all columns of the original base table.
       For example, in the student schema, if the details of all students of Computer Science Course
       needs to be maintained at the School of Computer Science, then the designer will horizontally
       fragment the database as follows −
CREATE COMP_STD AS
hybrid fragmentation
       At first, generate a set of horizontal fragments; then generate vertical fragments from one or
       more of the horizontal fragments.
       At first, generate a set of vertical fragments; then generate horizontal fragments from one or
       more of the vertical fragments.
Transparency
       Distribution transparency is the property of distributed databases by the virtue of which the
       internal details of the distribution are hidden from the users. The DDBMS designer may choose
       to fragment tables, replicate the fragments and store them at different sites. However, since
       users are oblivious of these details, they find the distributed database easy to use like any
       centralized database.
 Location transparency
 Fragmentation transparency
 Replication transparency
Location transparency
       Location transparency ensures that the user can query on any table(s) or fragment(s) of a table
       as if they were stored locally in the user’s site. The fact that the table or its fragments are stored
       at remote site in the distributed database system, should be completely oblivious to the end
       user. The address of the remote site(s) and the access mechanisms are completely hidden.
       In order to incorporate location transparency, DDBMS should have access to updated and
       accurate data dictionary and DDBMS directory which contains the details of locations of data.
       Fragmentation transparency enables users to query upon any table as if it were unfragmented.
       Thus, it hides the fact that the table the user is querying on is actually a fragment or union of
       some fragments. It also conceals the fact that the fragments are located at diverse sites.
       This is somewhat similar to users of SQL views, where the user may not know that they are
       using a view of a table instead of the table itself.
        Replication transparency
       Replication transparency ensures that replication of databases are hidden from the users. It
       enables users to query upon a table as if only a single copy of the table exists.