Advtop
Advtop
Copyright
Copyright 2002 by Gupta Technologies LLC. All rights reserved.
Advanced Topics Guide
20-2119-1002
January 2002
Advanced Topics Guide
Contents
Audience. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12
What is in this manual. . . . . . . . . . . . . . . . . . . . . 1-12
Notation conventions . . . . . . . . . . . . . . . . . . . . . 1-13
Other helpful resources . . . . . . . . . . . . . . . . . . . 1-14
Send comments to.... . . . . . . . . . . . . . . . . . . . . . 1-15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
ANSI/SPARC three schema architecture . . . . . . . . . . 1-2
Structure of this section. . . . . . . . . . . . . . . . . . . . . . . . 1-4
5 Transaction Definition . . . . . . . . . . . . . . . . . . . .1
Benefits of defining transactions . . . . . . . . . . . . . . . . . 5-2
Physical database design . . . . . . . . . . . . . . . . . . . 5-2
Data integrity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Defining transactions. . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Transaction name, number, and description . . . . 5-3
Transaction type and complexity . . . . . . . . . . . . . 5-4
Transaction volumes. . . . . . . . . . . . . . . . . . . . . . . 5-4
Transaction performance requirements . . . . . . . . 5-5
Relative priority . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Transaction SQL statements . . . . . . . . . . . . . . . . 5-6
8 Database Pages . . . . . . . . . . . . . . . . . . . . . . . . . .1
Pages and page types . . . . . . . . . . . . . . . . . . . . . . . . 8-2
Data pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
Row pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
Extent pages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
Long varchar pages . . . . . . . . . . . . . . . . . . . . . . 8-11
Overflow pages. . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Index pages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Control pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
Database control block . . . . . . . . . . . . . . . . . . . . 8-16
Free extent list . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Group free page list . . . . . . . . . . . . . . . . . . . . . . 8-17
Group extent map . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Bitmap pages . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Row count page . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Table data page . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Page allocation and garbage collection . . . . . . . . . . 8-18
9 B-Tree Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . .1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
Binary search trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
Multiway trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
B-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
The order of a B-Tree . . . . . . . . . . . . . . . . . . . . . . 9-5
Sequencing within a node. . . . . . . . . . . . . . . . . . . 9-5
Balancing B-Trees . . . . . . . . . . . . . . . . . . . . . . . . 9-6
B-Tree operation costs . . . . . . . . . . . . . . . . . . . . . 9-9
B+-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10
Prefix B+-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
SQLBase implementation . . . . . . . . . . . . . . . . . . . . . 9-12
Index node size. . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
Performance considerations . . . . . . . . . . . . . . . . 9-13
Non-unique indexes . . . . . . . . . . . . . . . . . . . . . . 9-15
10 Hashed Indexes . . . . . . . . . . . . . . . . . . . . . . . . . .1
Hash table overview . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
Row location in hashed tables . . . . . . . . . . . . . . 10-2
Performance benefits of hashed tables . . . . . . . 10-2
Potential pitfalls of hashing . . . . . . . . . . . . . . . . . 10-3
Hashing theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
The hash function . . . . . . . . . . . . . . . . . . . . . . . . 10-4
Bucket hashing . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
Collisions and overflows . . . . . . . . . . . . . . . . . . . 10-5
Collision avoidance techniques . . . . . . . . . . . . . 10-6
Hashing disk space utilization and performance 10-7
SQLBase hashing implementation . . . . . . . . . . . . . 10-11
Location of table rows. . . . . . . . . . . . . . . . . . . . 10-11
Hash function . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11
Specifying the packing density . . . . . . . . . . . . . 10-13
12 Result Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Characteristics of result sets . . . . . . . . . . . . . . . . . . . 12-2
Physical structure of result sets . . . . . . . . . . . . . 12-2
Logical contents of result sets . . . . . . . . . . . . . . 12-3
Population of result sets . . . . . . . . . . . . . . . . . . . 12-7
Flushing of result sets. . . . . . . . . . . . . . . . . . . . . 12-8
Processing result sets . . . . . . . . . . . . . . . . . . . . . . . . 12-9
Lock duration on result set rows . . . . . . . . . . . . 12-10
Concurrency and derived result sets . . . . . . . . 12-10
Concurrency and non-derived result sets . . . . . 12-11
13 Concurrency Control . . . . . . . . . . . . . . . . . . . . .1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
Transaction serialization . . . . . . . . . . . . . . . . . . . . . . 13-2
Lock types and scope . . . . . . . . . . . . . . . . . . . . . . . . 13-5
S-locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
X-locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
U-locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
Lock compatibilities. . . . . . . . . . . . . . . . . . . . . . . 13-7
Locking level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
Locking scope. . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
Document Title 11
Audience
The Advanced Topics Guide is written for anyone using the advanced features of
SQLBase. This includes:
• Application developers
Application developers build client applications that access databases using
frontend products like SQLWindows and the SQL/API.
• Database administrators (DBAs)
Database administrators perform day-to-day operation and maintenance of
the DBMS and an organization’s databases. They install maintenance
releases, load data, control access, perform backup and recovery, and monitor
performance.
• Database designers
Database designers, or senior database administrators, are responsible for the
logical and physical design of databases. Their responsibility is to ensure the
database meets the operational and performance requirements of the system.
This manual assumes you have:
• Knowledge of relational databases and SQ
• Familiarity with the material in the other SQLBase manuals.
Database Design Chapters 1-7 discuss logical and physical database design.
12 Document Title
Notation conventions
The table below show the notation conventions that this manual uses.
Notation Explanation
bold type Menu items, push buttons, and field names. Things that you select.
Keyboard keys that you press.
Precaution Warning:
Vital Important:
information
Supplemental Note:
information
Alt+1 A plus sign between key names means to press and hold down the first
key while you press the second key.
Document Title 13
Other helpful resources
Gupta Books Online. The Gupta document suite is available online. This document
collection lets you perform full-text indexed searches across the entire document
suite, navigate the table of contents using the expandable/collapsible browser, or print
any chapter. Open the collection by selecting the Gupta Books Online icon from the
Start menu or by double-clicking on the launcher icon in the program group.
Online Help. This is an extensive context-sensitive online help system. The online
help offers a quick way to find information on topics including menu items,
functions, messages, and objects.
World Wide Web. Gupta’s World Wide Web site contains information about Gupta
Technologies LLC’s partners, products, sales, support, training, and users. The URL
is http://www.guptaworldwide.com.
To access Gupta technical services on the Web, go to http:/
www.guptaworldwide.com/support. This section of our Web site is a valuable
resource for customers with technical support issues, and addresses a variety of topics
and services, including technical support case status, commonly asked questions,
access to Gupta’s Online Newsgroups, links to Shareware tools, product bulletins,
white papers, and downloadable product updates.
For information on training, including course descriptions, class schedules, and
Certified Training Partners, go to http://www.guptaworldwide.com/training.
Other Publications. Depending on your requirements, you also use publications for
these products:
• A Guide to the SQL Standard, C.J. Date
This book describes the SQL relational database language with an emphasis
on both the official standard version of SQL and on the use of SQL for
programmed (versus interactive) database access.
• SQL & Its Applications, Raymond A. Lorie and Jean-Jacques Daudenarde,
Prentice Hall, 1991.
• The Art of Computer Programming, D.E. Knuth, Addison-Wesley, 1973.
• The Ubiquitous B-tree, D. Comer, ACM Computing Surveys, Vol. 11, No. 2
(June 1979).
• Files and Databases, an introduction, P. Smith and G. Barnes, Addison-
Wesley, 1987.
• Algorithms + data structures = programs, N. Wirt, Prentice-Hall, 1976.
• Relational Completeness of Data Base Sublanguages, E.F. Codd, in Data
Base Systems, Courant Computer Science Symposia Series, Vol. 6. Prentice-
Hall, 1972.
14 Document Title
• Concurrency Control and Recovery in Database Systems, Philip Bernstein,
Vassos Hadzilacos, and Nathan Goodman, Addison-Wesley, 1987.
• An Introduction to Database Systems, C.J. Date, Addison-Wesley, 1990 (5th
edition).
• Relational Database: Writings 1985-1989, C.J. Date, Addison-Wesley,
1990.
• Relational Database: Writings 1989-1991, C.J. Date, Addison-Wesley,
1992.
• The Relational Model for Database Management: Version 2, E.F. Codd,
Addison-Wesley, 1990.
• Computer Data-Base Organization, James Martin, Prentice-Hall, 1977 (2nd
edition).
• Query Evaluation Techniques for Large Databases, Goetz Graefe, ACM
Computing Surveys, Vol. 25, No. 2 (June 1993).
• Join Processing in Relational Databases, Priti Mishra and Margarate Eich,
ACM Computing Surveys, Vol. 24, No. 1 (March 1992).
• On Optimizing an SQL-like Nested Query, Won Kim, ACM Transaction on
Database Systems, Vol. 7, No. 3 (September 1982).
• Join Processing in Database Systems with Large Main Memories, Leonard
Shapiro, ACM Transaction on Database Systems, Vol. 11, No. 3 (September
1986).
Document Title 15
16 Document Title
Advanced Topics Guide
Chapter 1
Introduction
This chapter introduces the following chapters on database design:
• Chapter 2 - Conceptual Schema Creation
• Chapter 3 - Internal Schema Creation
• Chapter 4 - External Schema Creation
• Chapter 5 - Transaction Definition
• Chapter 6 - Refining the Physical Design
• Chapter 7 - Physical Design Control
Through the three schemas and the mapping between them, the ANSI/SPARC three
schema architecture provides a level of indirection which insulates user needs from
database performance and capacity concerns. The major advantage of using this
architecture is that as the physical characteristics of the data and/or the target DBMS
changes over the life of the database, database designers can adjust the physical
constructs of the internal schema without affecting existing program modules.
Similarly, the user’s views of the data can be changed as their requirements change
without affecting the manner in which the data is physically stored. Without the three
schema (or similar) architecture, changes to the physical database can require costly
effort wasted on maintenance of existing program modules.
Although no commercial DBMS now provides full support for the ANSI/SPARC
three schema architecture, this architecture provides an excellent goal for the data
base administrator (DBA) to work towards. It also provides a good framework for
organizing database design activities.
Chapter 2
Cust Order
Conceptual Schema Creation
Chapter 3
Logical Database Design
Internal Schema Creation
First-Cut Database
Chapter 4
User Views
Chapter 5
Transaction Definition
Transaction Definitions
Chapter 6
Better Database
Chapter 7
Optimized Database
tables in the internal schema. The risk associated with doing this is that alterations
made to the physical design have a greater chance of affecting existing DML and
causing some re-coding to keep the DML valid. Note that creating the external
schema does not totally eliminate this risk.
Chapter 2
Conceptual Schema
Creation
The conceptual schema is the primary or base schema within the ANSI/SPARC three
schema architecture and is sometimes referred to as the logical data model. It models
an organization’s data resource and the relationships between data as they naturally
exist without regard to how the data may be used or physically stored. Of the three
schemas, this is the least likely to radically change over time.
This chapter introduces the process of logical data modeling and several related
topics. In this chapter we:
• Introduce the data administration function and responsibilities.
• Cover logical database design.
• Explain types of entity-relationship data models.
• Explain the data normalization technique.
Data administration
The data administration (DA) function plays a consulting role in the process of
logical database design. This contrasts with the traditional mission of the database
administration (DBA) function, which is mainly responsible for physical database
design and implementation tasks. The mission of data administration is to ensure that
an organization’s data resources are managed responsibly and recognized as an
important asset. A data administration department is responsible for formulating the
plans that guide the evolution of all corporate databases. By providing this master
plan, the DA function ensures that corporate computing systems will be compatible in
the future.
This compatibility guarantees that future system development will not adversely
affect existing systems and users in some way, and that information in separate
databases today can be combined in the future without undue cost or delay. Without
this assurance, a company can be exposed to financial and operational risk. This can
arise from either the disruption of mission critical computer systems or through the
company’s inability to meet future information requirements effectively.
Many corporations have not formally established a DA function. In the absence of
this formal mandate, it is important for those personnel responsible for operating and
maintaining the database environment to establish an agreement to “self-rule.” This
helps to avoid the pitfalls caused by a lack of standardization. Since database
professionals are often the first people to realize the benefit of establishing DA
practices, it behooves them to either attempt to persuade management to formally
establish the DA function, or to make it part of their regular procedures as best they
can.
In this chapter we intend to provide a starting point for database administrators who
are working without formal DA functions and who have not yet adopted any DA
practices. There are many books and articles that treat these and other DA subjects in
great detail, and we encourage you to search out those subjects that interest you.
The following list represents some of the areas in which standards are generally
created:
• Abbreviations
Standards for the abbreviations used as roots for object names are useful in
maintaining naming consistency.
• Data element (or attribute) names
These standards benefit users who access the database using tools such as
Quest. These users must find the information they need in a consistent way.
When several people are creating new elements, variations of the same name
can be created. Good element and abbreviation standards can solve this
problem. Element naming standards should also address the capitalization
and word separators of multi-name elements.
• Table names
The previous comments on data element names apply here as well. Names
must be simple and meaningful to users.
• View names
Many sites want naming conventions that allow easy differentiation between
views and tables. Some sites like to highlight views that result in join
operations since these may be time consuming and have restrictions on update
operations.
• Program names
Most installations try to indicate application ownership of programs by
mnemonics in the name. Mnemonics are short abbreviations, often two or
three letters, which contain enough of the original phrase to prompt an
association to the original. An example of a mnemonic is “AR” for accounts
receivable. Also, some sites like to be able to tell whether a program is read-
only from the name alone.
• File names
File names should bear some relationship to the database that owns the file.
• User names
These standards have to do with who a person is and where that person can
be found, either by department or physical location.
The definition and utilization of standardized templates is valuable in
performing security authorizations for users based on the data requirements
of their job. This procedure is very useful when an audit is being performed.
• Database names
Standardizing database names simplifies operational control of the databases
for day-to-day use. The same applies to the following areas or items:
• Server names
• Workstation names
• Printer names
Data dictionary
One of the key tools of data administration is a data dictionary or repository. This is a
special database which contains metadata (data about data such as element names,
meaning, data type, null characteristics, masks, and validations) describing the
business’s database environment. Within the dictionary is information about each
table, its columns, indexes, views, and relationships to other tables. Through the use
of the dictionary, users and programmers can find the data items, tables, or views they
need. Also, data administrators can monitor naming standards, locate data, and
eliminate inconsistencies in the ways various departments describe and use data.
Since the dictionary is the main tool needed for data administration, database
administrators who attempt to provide basic DA functions for their organizations can
design and implement a simple dictionary for their own use. A database describing
tables, views, and columns can be created and then populated with data extracted
from the SQLBase system catalog tables. These are included in every SQLBase
database, and have names which start with “SYSADM.SYS”, such as
“SYSADM.SYSTABLES.” By combining this information across all databases, the
DBA can locate and correct non-standard names, inconsistent key definitions, and
duplicate table or column definitions.
The Team Object Manager component of Team Developer contains a repository
which you may use to store your metadata. Similarly, many CASE tools include a
data dictionary and you may find it beneficial to use the repository of the CASE tool
you have selected to perform database design tasks.
Once an organization reaches this stage, new applications can be developed rapidly
since no new programs must be written to add, change, and delete persistent data.
Organizations that have a formal DA function and have completed a strategic data
plan and an enterprise data model often move on to performing data modeling on
certain key business areas before starting development work on applications. These
business area data models are usually fully normalized (normalization is covered later
in this chapter) and complete, with specifications of all keys and attributes in each
entity included within the model. These data models serve as good sources of
material for a database design and can often be used as is, but two or more may have
to be merged to obtain the complete set of data required by one application.
binary relationships. These are often broadly classified according to how many
occurrences of an entity type can appear on each side of the relationship, as shown in
Figure A.
One-to-one
One-to-many
Many-to-many
CUSTOMER ORDER
has placed
by
has
within
ORDER
PRODUCT
has for ITEM
the use of arrows. In the case of minimum cardinality, the circle can either be hollow,
indicating zero, or filled in, indicating one. To describe maximum cardinality, the
absence of the arrow means a maximum of one, while the presence of the arrow
means a maximum of many (more than one).
In Bachman terminology, a relationship is called a partnership set, which is made up
of two partnerships. Each of these two partnerships is owned by one of the entities
involved in the relationship. In this example, CUSTOMER has a partnership with
ORDER, and ORDER also has a partnership with CUSTOMER. These two
partnerships together form the CUSTOMER-ORDER partnership set.
For CUSTOMER’s partnership with the ORDER entity, the minimum cardinality of
orders for a customer is specified by the hollow circle next to the CUSTOMER entity,
which is labelled “has.” The minimum cardinality is zero therefore this circle is
hollow. This means that a customer may exist without any orders. The maximum
cardinality of the ORDER entity from the perspective of the CUSTOMER is
described by the arrow on the ORDER side of the relationship, which specifies that a
customer may have many orders. If there were no arrow, then the maximum
cardinality would be one rather than many, and a customer could have only a single
order at any one time. The way to read this relationship in terms of the CUSTOMER
entity is to say, “A customer has from zero to many orders.”
From the ORDER perspective, the minimum cardinality of CUSTOMER is 1. This is
noted by the circle next to the ORDER entity which is labelled “placed by” and is
filled in. The maximum cardinality of the relationship is also one, since there is no
arrow on the CUSTOMER side of the relationship. This type of relationship, where
an entity has both minimum and maximum cardinality of one with another entity, is
often called a dependency. The ORDER entity is said to be dependent on
CUSTOMER, since it could not exist without a relationship to a single occurrence of
CUSTOMER. The reading from the ORDER entity’s perspective is, “An order is
placed by one, and only one, customer.”
In a similar manner, the other relationships read:
• An order has 1 or many items.
• An item is within 1, and only 1, order.
• A product has from zero to many items.
• An item is for 1, and only 1, product.
CUSTOMER
ORDER
ORDER_ITEM
PRODUCT
In the Information Engineering ERD, entities are also depicted by boxes and
relationships as lines. The significant difference from the Bachman ERD is the
graphic description of relationship cardinality. In this diagram, each entity’s minimum
and maximum cardinality in a relationship is described immediately adjacent to itself.
For instance, the CUSTOMER’s cardinality with respect to ORDER is shown in two
horizontal bars immediately below CUSTOMER. The symbol closest to an entity
describes the maximum cardinality and can be either a bar (indicating one), or a
crow’s foot (indicating many). Following the relationship line, the next symbol
describes the minimum cardinality and can be either a bar (indicating one), or a circle
(denoting zero). The meaning of each relationship in Figure C is the same as
described for Figure B.
CUSTOMER PRODUCT
1 N
CUST. ORDER
ORDER ITEM
M
ORDER
In the Chen ERD, there is a more significant deviation from the previous examples.
Entities are still represented by boxes, and lines between entities still play a part in
representing relationships, but now there are also diamonds that appear on the
relationship lines which contain the name of the relationship. Also, relationships may
have attributes assigned to them directly. These are called information bearing
relationships. In the Chen diagram, attributes may appear directly on the diagram as
bubbles which contain the attribute’s name and are attached to the entity or
relationship to which they belong. Attributes have been omitted from this diagram in
order to remain uncluttered and consistent with prior diagrams. When this happens,
the relationship containing the attributes would be transformed into a table in the
physical database design in the same way that entities become physical tables.
The ORDER ITEM relationship is an example of this. In previous diagrams, ORDER
ITEM was an entity rather than a relationship. In the Chen diagram, the relationship
between ORDER and PRODUCT is indicated as having a maximum cardinality of
“many” on both sides (shown by the “M” and “N”), and the ORDER ITEM
relationship between them is information bearing and would appear as a table in the
resulting physical database design. The table that results from two entities having a
many-to-many relationship is said to resolve the relationship. The data contained in
the table is called junction data, because it is the information that is unique for each
occurrence of the relationship between the two entities. In this example, the junction
data would be attributes such as QUANTITY-ORDERED. Also, while the Chen
diagram indicates maximum cardinality by means of a number or symbol (the “1”,
“N”, and “M” in Figure D) next to the entity involved in a relation, minimum
cardinality is not depicted. In Figure D, “N” orders occur for each CUSTOMER
(where “N” indicates some number greater than 1), thereby depicting the customer to
order relationship as one-to-many. The “one” side of this relationship is indicated by
the “1” appearing next to the CUSTOMER entity.
The two critical characteristics of a primary key are uniqueness and non-redundancy.
Uniqueness refers to the fact that the value of a primary key for a given occurrence of
an entity must not be duplicated in any other occurrence of the same entity type. The
physical design implication is that a unique index must be able to be placed on a
primary key successfully. The non-redundancy characteristic of a primary key, which
applies to concatenated keys only, means that each of the attributes which comprise
the primary key must be essential to its uniqueness, so that if any attribute was
removed, then the key would no longer be unique. This last rule prevents extraneous
information from being placed into keys. These rules for forming primary keys are
important in the normalization process.
CUSTOMER(CUSTOMER-NO,CUSTOMER-NAME,CUSTOMER-ADDRESS)
ORDER(ORDER-NO,TOTAL-SALES-DOLLARS,ORDER-DATE,CUSTOMER-NO)
ORDER_ITEM(ORDER-NO,LINE-NO,QUANTITY-ORDERED,PRODUCT-NO)
Legend:
PRODUCT(PRODUCT-NO,PRODUCT-DESC) Primary Key
Foreign Key
Figure 2E - Example of mapping between primary and foreign keys. Dependent entity
ORDER_ITEM inherits part of its composite key from ORDER. The ORDER-NO
attribute also serves as the foreign key to ORDER.
Foreign keys must be created in an entity type to indicate the specific occurrence of
another entity with which this entity has a relationship. The attributes of the foreign
key must match the primary key of the related entity exactly. The foreign key appears
in the entity that is on the many side of a one-to-many relationship, or may appear on
either side of a one-to-one relationship. Many logical data modeling diagram
techniques do not explicitly name foreign key attributes, since their existence can be
assumed through the relationships that the entity participates in. When the
relationship is many-to-many, an intermediate table will contain the foreign keys of
both entities involved in the relation. See Figure E for the relational notation of the
primary to foreign key mapping for the data model used in the previous examples.
Normalization
The normalization process is a logical design technique that is especially well suited
for use with SQLBase (or for that matter, any DBMS which similarly adheres to the
relational model of data) because of its capability to directly implement the results
without any translation. Performing normalization results in reduced data redundancy,
protects against update and deletion anomalies, and provides a more flexible, longer
lasting database design. The resulting database will be easier to maintain and will not
require major, high-cost structural changes as often as unnormalized designs. This is
due to the inherent stability of basing the data structures on the natural organization of
the basic business information flowing through the corporation rather than structuring
it with only a limited user or application perspective.
While the processes that a business performs on its data may change often, the data
items that are operated on by those processes can be structured to reflect the nature of
the company’s business. This means that the application processes, embodied in
program logic, can be more quickly changed to keep pace with company needs. The
underlying database remains the same, reflecting the basic information inherent in the
environment. Attaining this level of stability is the goal of data normalization.
One of the fundamental requirements of a designer who wishes to perform
normalization is to know the meaning of the data intimately and to have a thorough
understanding of its use in the real world of business. The way that a table’s attributes
interact with and depend on each other is the basis for normalization. The designer
performing the normalization process must be knowledgeable of the data at this
detailed level in order to successfully complete the process without undue risk of error
arising from misunderstandings or assumptions.
Normal forms
The normal form is a specific level of normalization achieved by a set of tables
comprising a database design. Each successive normal form represents a more stable
and controlled stage of design. Higher levels of normal forms always require the
design to be in each lower level of normalization as a prerequisite. As an example,
second normal form requires the design to be in first normal form, third normal form
requires both first and second forms, and so on.
Figure 2F - Summary of normal forms, rules, and steps needed to achieve the next
form
There are a number of normal forms defined and work is continuing so that more
forms may appear in the future. Most current texts on database design will address
normal forms one through five, although only the first three are needed for most
tables. For this reason, only the first three forms are summarized in Figure F. Consult
more detailed database design texts for a more in-depth discussion of additional
normal forms.
Figure 2I - Further normalizing our example to third normal form. The student
relation violated third normal form due to non-key attributes which are dependent on
other non-key attributes, as shown by the arrows. These attributes are broken out into
new relations, the original relation as a foreign key to the new relation. This foreign
key attribute is MAJOR_DEPT-CODE in the STUDENT relation.
One remaining transformation must be done on our data from Figure H in order to get
all the relations into third normal form. As shown in Figure I, the STUDENT relation
violates third normal form, and must be transformed into two tables. The two non-key
attributes Major_Dept_Name and Major_Dept_Location depend on another non-key
attribute, Major_Dept_Code. Even though the are currently in the STUDENT table,
they do not depend on the Student_Number primary key directly, but are dependent
through the Major_Dept_Code, which is dependent on the primary key. This is called
a transitive dependency, and is removed by creating a new table with a primary key of
the transitive attribute in the original relation. This transitive attribute then becomes a
foreign key to the new table. The attributes in the original relation that were
transitively dependent are now placed in the new table, where they are fully dependent
on the primary key only. The MAJOR relation in Figure I is formed through this
process. The data originally described in Figure G is now entirely in third normal
form.
Denormalization
Now that we have looked at the benefits associated with data normalization, a few
words about denormalization are in order. Denormalization is the process of taking a
database design out of third normal form (or some higher form) for performance
reasons. One may denormalize by either redundantly storing prime attributes, or by
adding aggregate attributes to the database.
Some database designers routinely denormalize during logical database design.
However, the only reason to denormalize is to achieve greater performance. In
addition, because of the added complexities of a denormalized database (refer to
chapter 7 for further details), denormalization should only occur as a last resort after
all other means of increasing performance have failed. Consequently,
denormalization should only take place during physical design control. The logical
design should still be maintained in (at least) third normal form in order to facilitate
clear thinking about its structure and meaning.
Chapter 3
Creating tables
The first step in building the ANSI/SPARC internal schema is to identify the
relational tables which will be managed by SQLBase. The primary input to this task
is a completed ERD containing entities and relationships. On completion, a set of
CREATE TABLE skeleton statements may be coded that will serve as the starting
point for the addition of columns and other details.
A base table will be created for each entity on the ERD. Base tables are the major
information-bearing objects within a relational database, and contrast with views in
that no projection or selection operations may restrict the visibility of any columns or
rows.
For each base table identified, define a long identifier which uniquely names the table
within this database. This name should follow site standards for base table names
while also being as descriptive as possible about the business information contained
within the table. Also, use an explicit authorization-id instead of allowing the
authorization-id to default to the user name of the current user. Select an
authorization-id that meets your site standards, and can serve as a consistent owner
for all application objects within the database.
When you have finished, verify that the number of tables matches the count of entity
boxes in the E-R diagram.
• If the usage of a numeric data item requires it to be of a fixed width, use the
DECIMAL data type with the required precision and an appropriate scale. If
an integer is desired, use DECIMAL with a scale of zero. This helps to
communicate external formatting requirements more clearly.
• Avoid the use of INTEGER and SMALLINT data types in general. Use these
only for internal control fields such as cycle counters. Formatting these fields
for external display use can be problematic due to the indeterminate display
width.
• Avoid using LONG VARCHAR unless absolutely required. Each LONG
VARCHAR column must be stored on a separate page, apart from the rest of
the row to which it belongs. Also, each of these columns will occupy at least
an entire page by itself, regardless of how much free space on the page may
be wasted. Therefore, using LONG VARCHAR increases both the number
of I/O operations to read a row, as well as the amount of space required to
store the data. In addition, some @ functions cannot be used with LONG
VARCHAR columns.
• Avoid using CHAR or VARCHAR for data that will always be numeric.
Using a numeric data type will ensure that the field will always be populated
with valid numeric data.
• Avoid the use of the REAL (or FLOAT or DOUBLE PRECISION) data type
except for situations calling for scientific measurement information
containing a wide range of values. The exponent-mantissa format of these
data types lacks the exact precision associated with other numeric data types,
and also causes floating point operations to be performed for their
manipulation. On most CPUs, floating point arithmetic incurs greater cycle
times than integer operations.
• Use the DATE and TIME formats when expressing chronological data.
• Use DATETIME (or TIMESTAMP) for control purposes. Many designers
include a TIMESTAMP field for tracking row creation and updates.
Null usage
When defining data types for columns, you will also need to specify the column’s
ability to assume the null value. Note that null is a special value that indicates either
“unknown” or “not applicable”. Users and programmers not accustomed to the use of
nulls may have problems remembering to include host variable declarations and tests
for nulls in columns that allow them. Nulls can also cause problems when they are in
columns that are used for joining tables together. When the join is performed, any
rows that contain nulls in the join column of either table will not be returned to the
result set. This “missing data” may be disconcerting to a user of the query, who may
have expected to see all of the rows of at least one of the tables. Of course, an outer
join specification (using the “+” operator) may be used to force all rows of one of the
tables to be returned, but this requirement could be easily overlooked. Consider the
following factors when deciding whether to allow a null value in a column:
• Columns participating in primary keys should always be defined with NOT
NULL, since they are required to be populated with unique values for each
row.
• Foreign keys should generally be defined with NOT NULL. Whenever a
child table is dependent on a parent, the foreign key to the parent table cannot
be null, since the existence of a child occurrence without a parent would
violate the dependency rule of the relationship. Only foreign keys to tables
with optional relationships can be considered as candidates for the use of
nulls, to signify that no relationship exists. Also, foreign keys defined with a
delete rule of SET NULL must be defined as nullable.
• Use NOT NULL WITH DEFAULT for columns with DATE, TIME, or
DATETIME data types to allow current date and time data to be stored in the
field automatically. This is particularly useful for row creation timestamps.
• Allow nulls for those columns that will actually have an anticipated need for
the unknown or inapplicable meaning of the null value. The null value works
especially well in numeric fields that may be subjected to aggregation uses
such as SUM or AVERAGE.
• Use NOT NULL WITH DEFAULT for all columns not meeting one of the
above criteria.
Example 1
We will carry forward the example ERD used in the previous chapter to demonstrate
diagramming styles to show the transformation from external to internal schema. The
following SQL statements represent the current state of the design at this point. Note
that a site standard requires base table names to be suffixed with a “_T” to facilitate
easy identification.
CREATE TABLE OESYS.CUSTOMER_T (
CUSTOMER_NO DECIMAL(6,0) NOT NULL,
CUST_NAME VARCHAR(45) NOT NULL,
CUST_ADDR1 VARCHAR(35) NOT NULL,
CUST_ADDR2 VARCHAR(35) NOT NULL WITH DEFAULT,
CUST_ADDR3 VARCHAR(35) NOT NULL WITH DEFAULT,
CUST_CITY VARCHAR(45) NOT NULL,
CUST_STATE_CD CHAR(2) NOT NULL,
CUST_ZIP DECIMAL(5,0) NOT NULL WITH DEFAULT)
PCTFREE 10;
CREATE TABLE OESYS.ORDER_T
(ORD_SALES_CENTER_NO DECIMAL(2,0) NOT NULL,
ORD_SEQ_NO DECIMAL(6,0) NOT NULL,
CUSTOMER
could be performed more efficiently, in the vast majority of cases, the DBMS itself is
able to maintain a higher level of consistency with a lower cost. The two most
important factors contributing to this are its ability to perform integrity checks for all
database accesses, and its ability to always use the most efficient access technique
available to perform the referential checks.
Selection of appropriate referential integrity constraints requires an understanding of
the business rules governing the way applications can delete and update data. Use of
this knowledge allows the database designer to select the appropriate delete rules
from among the three options available in SQLBase. These options are:
• Restrict, which disallows deletion of parent rows if any child rows exist. This
is the default, and is generally the safest option.
• Set Null, which sets the foreign key values of child rows to null when parent
rows are deleted.
• Cascade, which deletes child rows when parent rows are deleted.
There are also implications of delete rules resulting from the transitive relationships
between three or more tables that can cause restrictions to be placed on the designer’s
selection. Furthermore, some DML coding options can be constrained through the use
of referential integrity. The designer must consider all these factors when selecting
appropriate referential integrity constraints for the database.
Defining keys
Name the columns comprising the primary key of each table in the PRIMARY KEY
clause of the CREATE TABLE statement. Since the primary key is also required to be
unique, the next step is required before the primary key definition is complete.
Following each table’s CREATE TABLE statement, code a CREATE UNIQUE
INDEX statement that builds a unique index on the primary key columns of the table.
Include only those columns that are in the primary key, and index them in the same
order as they are specified in the PRIMARY KEY definition clause of the CREATE
TABLE statement. Consider using some indicator (such as a preceding, or following,
“XPK”) in the index name to signify that this is a primary key index, since
inadvertently dropping this index would have detrimental effects on the desired
referential integrity for the database. It is important to remember that both the
PRIMARY KEY clause of the CREATE/ALTER TABLE statement and the CREATE
UNIQUE INDEX are required before SQLBase will consider the primary key
definition for the table to be complete.
At the end of the DDL file (following all CREATE TABLE/INDEX statements), code
an ALTER TABLE statement that adds the foreign key definitions needed for each
child table in the database. Adding foreign keys through this ALTER technique
guarantees that each table’s primary key definition is completed (with its associated
index) prior to being referenced in a foreign key clause by some other table.
Example 2
Continuing from example 1, we now incorporate the required referential integrity
definitions to define the relationships in the ERD. These modifications include the
PRIMARY KEY clause of the CREATE TABLE statement, the CREATE UNIQUE
index statement for each primary key, and finally, the ALTER TABLE statements
needed to define the foreign keys.
CREATE TABLE OESYS.CUSTOMER_T (
CUSTOMER_NO DECIMAL(6,0) NOT NULL,
CUST_NAME VARCHAR(45) NOT NULL,
CUST_ADDR1 VARCHAR(35) NOT NULL,
CUST_ADDR2 VARCHAR(35) NOT NULL WITH DEFAULT,
CUST_ADDR3 VARCHAR(35) NOT NULL WITH DEFAULT,
CUST_CITY VARCHAR(45) NOT NULL,
CUST_STATE_CD CHAR(2) NOT NULL,
CUST_ZIP DECIMAL(5,0) NOT NULL WITH DEFAULT,
PRIMARY KEY(CUSTOMER_NO))
PCTFREE 10;
CREATE UNIQUE INDEX OESYS.XPKCUSTOMER
ON OESYS.CUSTOMER_T
(CUSTOMER_NO)
PCTFREE 10;
CREATE TABLE OESYS.ORDER_T (
ORD_SALES_CENTER_NO DECIMAL(2,0) NOT NULL,
ORD_SEQ_NO DECIMAL(6,0) NOT NULL,
ORD_CUSTOMER_NO DECIMAL(6,0) NOT NULL,
ORD_PLACED_DATE DATE NOT NULL WITH DEFAULT,
ORD_CUST_PO VARCHAR(35) NOT NULL WITH DEFAULT,
ORD_CUST_PO_DATE DATE NOT NULL WITH DEFAULT,
ORD_STATUS CHAR(1) NOT NULL,
ORD_CHANGED_DATE DATE,
ORD_INVOICED_DATE DATE,
PRIMARY KEY(ORD_SALES_CENTER_NO,ORD_SEQ_NO))
PCTFREE 10;
Chapter 4
External Schema
Creation
In this chapter we cover the creation of the initial external schema. This requires
coding the data definition language necessary to build views on the underlying tables
of the internal schema, called base tables. It also means establishing the procedures
needed to allow the external schema to evolve as requirements for new user views of
the data become known. Security mechanisms may also be put into place to insure
that all data access is accomplished through the external schema, rather than by direct
access to base tables. In this way, the base tables are allowed to change while isolating
the user views from some of the impact of these modifications.
and rows that may be returned from a view with the result set that would exactly
satisfy the user’s request. For example, a rule of thumb could be set that specifies a
desired result set being within 75% of the columns and rows returned from a view as
acceptable.
The security administration work involved in enforcing the use of views can be
minimized as well. Techniques such as adopting automated procedures for the
creation and execution of the data control language (DCL, the SQL GRANT and
REVOKE statements) can greatly improve the consistency of security authorizations
as well as ease the workload. Making routine use of these procedures at key points in
the application development life cycle lowers the requirement for unscheduled, on-
demand security administration.
No programs or queries should be written which access the base tables, either by the
DBAs or anyone else. In order to enforce the use of the external schema as the only
means of accessing the database, do not grant authorization on the base tables to
anyone other than DBAs. Even the DBAs should exercise restraint in their use of the
base tables, using them only for utilities which exclude views from being named. The
views themselves can either be granted to PUBLIC or to individual users, depending
on how restrictive the organization’s security policies are.
While these security restrictions may seem excessive to some, it is only through their
usage that the goal of using the external schema to isolate user data views from
internal implementation can be approached and eventually achieved.
Example
Carrying forward the example of the previous two chapters, we will now create the
mirror views of the base tables:
CREATE VIEW OESYS.CUSTOMER
AS SELECT
CUSTOMER_NO,
CUST_NAME,
CUST_ADDR1,
CUST_ADDR2,
CUST_ADDR3,
CUST_CITY,
CUST_STATE_CD,
CUST_ZIP
FROM OESYS.CUSTOMER_T;
CREATE VIEW OESYS.ORDERS
AS SELECT
ORD_SALES_CENTER_NO,
ORD_SEQ_NO,
ORD_CUSTOMER_NO,
ORD_PLACED_DATE,
ORD_CUST_PO VARCHAR(35),
ORD_CUST_PO_DATE DATE,
ORD_STATUS CHAR(1),
ORD_CHANGED_DATE,
ORD_INVOICED_DATE
FROM OESYS.ORDER_T;
CREATE VIEW OESYS.ORDER_ITEM
AS SELECT
OITM_SALES_CENTER_NO,
OITM_SEQ_NO,
OITM_LINE_NO,
OITM_PRODUCT_NO,
OITM_ORDER_QTY,
OITM_SHIPPED_QTY,
OITM_MEASURE_UNIT,
OITM_COST_AMT,
OITM_SELL_PRICE_AMT
FROM OESYS.ORDER_ITEM_T;
CREATE VIEW OESYS.PRODUCT
AS SELECT
PRODUCT_NO,
PROD_DESC,
PROD_DESC_SHORT,
PROD_MEASURE_UNIT_SMALL,
PROD_COST_SMALL_AMT,
PROD_MEASURE_UNIT_MED,
PROD_COST_MED_AMT,
PROD_MEASURE_UNIT_LARGE,
PROD_COST_LARGE_AMT,
PROD_STATUS,
PROD_ON_HAND_QTY,
PROD_AVAIL_QTY,
PROD_ON_ORDER_QTY
FROM OESYS.PRODUCT_T;
Now we will create two views for known user requirements. The first view is for the
aged open order line inquiry and the second view is for the daily cancelled order
report:
CREATE VIEW OESYS.OPENLINES
AS SELECT
ORD_SALES_CENTER_NO,
ORD_SEQ_NO,
OITM_LINE_NO,
OITM_PRODUCT_NO,
OITM_ORDER_QTY - OITM_SHIPPED_QTY AS ORDER_QTY,
ORD_PLACED_DATE
FROM OESYS.ORDERS,OESYS.ORDER_ITEM
WHERE ORD_SALES_CENTER_NO = OITM_SALES_CENTER_NO
AND ORD_SEQ_NO = OITM_SEQ_NO
AND ORD_STATUS = ‘O’
AND OITM_ORDER_QTY > OITM_SHIPPED_QTY;
CREATE VIEW OESYS.CANCELED_ORDERS
AS SELECT
CUSTOMER_NO,
CUST_NAME,
ORD_SALES_CENTER_NO,
ORD_SEQ_NO,
ORD_CUST_PO,
ORD_CUST_PO_DATE,
ORD_PLACED_DATE,
ORD_CHANGED_DATE,
OITM_SELL_PRICE_AMT,
OITM_ORDER_QTY - OITM_SHIPPED_QTY AS OPEN_ORDER_QTY,
OITM_PRODUCT_NO,
PROD_DESC
FROM OESYS.CUSTOMER,OESYS.ORDER,OESYS.ORDER_ITEM, OESYS.PRODUCT
WHERE ORD_CHANGED_DATE = CURRENT DATE
AND CUSTOMER_NO = ORD_CUSTOMER_NO
AND ORD_SALES_CENTER_NO = OITM_SALES_CENTER_NO
AND ORD_SEQ_NO = OITM_SEQ_NO
AND ORD_STATUS = ‘C’
AND OITM_ORDER_QTY > OITM_SHIPPED_QTY
AND OITM_PRODUCT_NO = PRODUCT_NO;
Chapter 5
Transaction Definition
This chapter:
• Explains the purpose of well-defined transactions:
• Database design benefits
• Data integrity considerations
• Covers the detailed information required for each transaction:
• Identifying information
• Categorical information
• Volume information
• Performance requirements
• Data requirements
Data integrity
Database transactions usually involve multiple database accesses. For example,
consider the banking transaction of withdrawing $1000 from a savings account and
depositing the funds into a checking account. This transaction entails two separate
database operations: subtracting $1000 from the savings table and adding $1000 to
the checking table. If either one of these database operations successfully completes
but the other does not, then the account balances would be wrong. Both of these
operations must complete successfully or neither of them should complete
successfully.
Thus, a database transaction is a logical unit of work which causes the database to
advance from one consistent state to another. In other words, a transaction moves the
database from one state which does not violate the data integrity requirements of the
database to another state which does not violate the data integrity requirements of the
Defining transactions
Transaction definitions can take many forms. Some organizations may possess an
upper CASE tool with a transaction definition facility as part of a data repository.
Other organizations may choose to define transactions using paper-based or
automated forms. Regardless of the media, all good transaction definitions include
several key ingredients. Each of these is discussed below.
In Decision Support Systems (DSS), transactions are not known ahead of time. These
systems are characterized by ad hoc and exception reporting. Therefore, it is not
possible to describe each and every transaction. With these systems, the transaction
definitions are illustrative of the actual transactions which will be submitted to
SQLBase once the system is implemented. DSS systems require the database
designer to predict the type of transactions which the users will likely run.
Transaction volumes
Transaction definitions must also include volume information. This typically includes
the average and peak frequency of each transaction. For example, a branch bank
might estimate that the funds transfer transaction described in the introduction of this
chapter will occur 50 times an hour with peak loads reaching 65 transactions per
hour.
The transaction volume statistics are extremely important for physical database
design; consequently, care should be taken to be as accurate as possible. For example,
the database designer will handle a transaction which occurs 1000 times per hour
quite differently from one that will occur only once per hour.
If the new computer system is replacing an existing system, it may be possible to
derive transaction volumes from the existing system. On the other hand, new systems
rarely duplicate the transaction logic of their predecessors so these differences should
be considered.
Typically, the hardest situation in which to deduce accurate transaction volumes is one
in which no automated or manual predecessors exist. These situations occur when
organizations enter new lines of business. In this case, volume statistics may be no
more than educated guesses.
Relative priority
Transaction definitions also should rank each transaction in terms of relative priority,
which describes how important to the business one transaction is compared to all
other transactions. Physical database design ensures that each transaction meets or
exceeds its performance requirements. Consequently, the transactions must be
evaluated one at a time. The relative priority of each transaction provides the
sequence in which transactions are evaluated. This ensures that the most important
transactions receive the most attention.
In addition, during the process of physical database design, the database designer will
tailor the physical database design so that one given transaction executes faster.
Making one transaction go faster frequently negatively affects the performance of
other transactions. Relative priority ranking also provides the basis for any
transaction arbitration scheme.
The ranking can take many forms. For example, some organizations assign a number
from one (high) to ten (low) to each transaction.
Also keep in mind that the DML should be coded against the external, rather than the
internal, schema. The initial external schema created in chapter 4 will probably
contain many of the views needed to support the transactions. Other views may need
to be created in order to meet the tailored needs of some transactions. If the programs
are coded directly against the base tables contained in the internal schema, there will
be a greater risk of impacting them during the physical design control phase of the
database design.
Example: The table below shows how the SQL for the funds transfer
transaction might appear in a transaction definition.
#
SQL Statement Notes
Rows
1 select amt from account This SQL command retrieves the balance of the debit
account. If it is less than the transfer amount, the program
where acct_nbr = :DebitAcct
should report an “insufficient funds” error to the teller.
Also, the SQL statement should use either the RR
isolation level or the SELECT FOR UPDATE command
to ensure that no other transactions withdraw funds from
the debit account while this transaction is in progress.
1 update account This SQL command decreases the funds in the debit
account by the transfer amount.
set amt = amt - :TransferAmt
where acct_nbr = :DebitAcct
1 update account This SQL command increases the funds in the credit
account by the transfer amount.
set amt = amt +:TransferAmt
where acct_nbr = :CreditAcct
Chapter 6
Introduction
In this chapter we present rules which may be used to refine the first cut physical
database design. These rules are typically applied in the following manner:
• The database designer recognizes a scenario in the database which is often
the subject of a particular design technique.
• The designer then applies that technique to the database on the assumption
that the benefits will be both substantial and required.
As such, these rules do not offer specifically proven benefits for every case, but
represent common database design rules that are usually incorporated into physical
designs by experienced DBAs. You should carefully evaluate every database as a
unique application, where some unusual characteristics may cause the assumptions
on which these rules are based to be violated. For these cases, as well as for further
refinement of any design, the more rigorous predictive techniques presented in
Chapter 7 may be applied.
These guidelines detailed in this chapter are discussed under the following general
categories:
• Splitting tables into two or more tables
• Clustering of table rows
• Indexing
• Database partitioning
• Freespace calculation
Splitting tables
One of the most common denormalization techniques that may be applied to a
physical database design is the vertical or horizontal splitting apart of tables. Vertical
splitting is the process of moving some of a table’s columns to a new table that has a
primary key identical to that of the original table. Horizontal splitting is the moving
of some of the rows of one table to another table that has a structure identical to the
first. The resulting tables can be managed in ways that tailor each to its particular
performance requirements. The basic motivation for performing a split is that either
the original table has some performance problem that must be solved, or one or more
subsets of the original table has significantly different performance requirements that
cannot be met in combination.
When the row length is larger than the database page size
Some entities in the ERD may contain a large number of attributes or a few unusually
long attributes. When these entities are converted to a SQLBase table, the size of a
row can be actually larger than the usable database page size. In this case, when a row
is inserted into the table, SQLBase will allocate one or more extent pages on which to
store the row (see chapter 8, Database Pages). Consequently, multiple physical I/O
will be required to retrieve the row, one for the base page and one for each extent page
(worst case scenario). These multiple I/O operations will degrade performance. To
resolve this situation, the table can be vertically split in order to bring the row size
down to a more manageable size.
SQLBase’s usable page size is approximately 938 bytes. If the average row length for
a table is greater than the usable page size and optimum performance is required when
accessing the table, the database designer should split the table into two or more
tables. The method used to perform the split is simple, since the split is simply the
physical manifestation of performing a relational projection on the table. This means
that some of the columns in the original table will be moved to a new table, while the
remaining columns will reduce the row length of the table. The new tables that are
created using this technique have not changed any inter-column relationships in any
way (which means each column should still be fully dependent on the entire primary
key), therefore each new table should have an exact copy of the original table’s
primary key as its own primary key.
Determining which columns to place in each table is the most critical decision
affecting the successful application of this technique. This is because the split can
only result in a benefit to transactions that can satisfy their data requirements by
accessing only one of the resulting tables. Any transactions that must perform a join
to reassemble the original row will fail to benefit from the higher blocking factor due
to the additional I/O operations required to read the new table. To determine which
columns to place in each table, the database designer should analyze the access
patterns of all high priority transactions and place together in the same table those
columns which are accessed together the most frequently.
pagesize
NumberOfRows = -----------------------
rowsize
The purpose of splitting a table apart is to attempt to achieve a number of rows per
page that is at least as great as the number of rows that will be retrieved together by
common transactions. Since the size of the page is fixed by the implementation of
SQLBase on each hardware platform, the only way to influence the number of rows
per page is by adjusting the size of the row. Since there is usually no such thing as too
many rows per page, the direction this adjustment takes is to reduce the row size in an
attempt to increase the number of rows on a page. After eliminating any spurious
column definitions and reducing the maximum width of VARCHAR and CHAR
columns to a minimum, the only remaining technique is to split the table apart.
size varies significantly over time, the benefits of creating the clustered
hashed index may be lost. A possible technique to get around this problem,
though, would be to drop and recreate the table and index often, with the row
occurrence estimate updated each time the index is recreated. This method
also requires an unload and load of the table as well, and for large tables may
be too time consuming.
In order to determine the average size of the foreign key grouping in terms of rows,
the following formula may be used:
rows
grpsize = --------------
fkeys
The numerator, which is the total number of rows in the table, may be determined
through estimation or through the use of the following query if the table already
exists:
SELECT COUNT(*) FROM tablename;
The denominator, which is the total number of unique foreign key values, may be
estimated based on the number of rows in the parent table where this foreign key is
the primary key. Alternatively, the following query may be run if the child table
already exists:
SELECT COUNT(DISTINCT ForeignKeyColumnName) FROM tablename;
This calculation yields the average number of rows for each foreign key group
occurrence.
If the resulting group size is very large (a good rule of thumb is greater than 20), then
the table is probably not a good candidate for clustering. The group size may be offset
somewhat by the row length, however. The smaller the row length, the greater the
group size that can be tolerated. This is because the true measure of how quickly
SQLBase will be able to retrieve all the rows in a foreign key grouping is determined
by how many pages are required to store these rows. The following formula can be
applied to determine the number of pages which will typically be required (remember
to calculate the correct page size for the implementation platform on which you will
run SQLBase):
( grpsize × rowlength )
pages = ---------------------------------------------------------
pagesize
If this formula yields a page spread of more than one, overflow will become a
problem and this table is probably not a very good candidate for a hashed clustered
index based on this foreign key.
Indexing
An index in a relational database is a data structure that allows rapid access of table
rows based on the value of data in one or more columns of those rows. (For a detailed
explanation of SQLBase’s implementation of BTree indexes, see chapter 9). There
are only two reasons to build an index on a relational table. The first is to ensure the
uniqueness of the values being indexed, which is why all primary key columns must
have a unique index defined on them. The other reason for creating indexes (and the
only reason to create a nonunique index) is to enhance the performance of the
database.
The drawback of indexes is the amount of resources that they consume. Every index
(other than a clustered hashed index) requires disk space for the storage of the index
tree structure. Each of these indexes also requires CPU cycles in order to be created
and maintained. In particular, indexes consume processing time whenever a
transaction performs an INSERT or DELETE operation against an indexed table, or
performs an UPDATE of a table row column that participates in one or more indexes.
Exactly how much overhead is created depends on the characteristics of the index and
subject table, but the overhead is not trivial in any case.
The challenge for the database designer is to create indexes only for those cases
where the benefits realized from the access time improvements outweigh the
penalties suffered from the overhead of building and maintaining the index. While
some scenarios can be identified as potentially good (or bad) candidates for an index,
no irrevocable ‘rules of thumb’ exist. For each index that a database designer creates,
the costs and benefits should be measured in order to justify the index’s continued
existence. If the designer discovers that an index is no longer worth its cost, then it
should be dropped.
As complex as this cost/benefit analysis is, it is further complicated by the fact that
the benefits are not actually within the control of the DBA. In SQLBase (as with any
DBMS implementation using the SQL standard), the decisions controlling how the
database is accessed are not made by the programmer, the user, or the DBA. Rather,
they are made by a software module within the DBMS called the optimizer.
Therefore, the indexes which are finally selected by the DBA are subsequently
reviewed through a pre-determined statistical analysis by the optimizer, where they
may be found devoid of value and not used at all. Actually, the optimizer will select
some indexes for some DML statements, other indexes for other statements, and no
indexes for still other statements. For this reason, an accurate evaluation of the
benefits side of the index problem requires a good understanding of the optimizer and
how to monitor the decisions made by it.
The purpose of this section is to help the designer select possible indexes that are
worth considering within a physical database design. These must then be evaluated
for their effectiveness with particular DML statements through an analysis of the
decisions made by the optimizer.
The way that the optimizer considers the effect of cardinality is through the use of a
statistic called the selectivity factor. The selectivity factor of an index is simply the
inverse of the combined cardinality of all index columns:
1
SelectivityFactor = ------------------------------
Cardinality
The selectivity factor is critical because this is the criteria used by the SQLBase
optimizer to decide whether an index is acceptable for use in the access path that is
built to satisfy the data requirements of a given DML statement. The optimizer uses a
formula that combines the selectivity factor of the chosen index, along with
information about the index’s width and depth, in order to compute an estimate of the
amount of I/O needed for a particular access path. The lower the selectivity factor, the
lower the resulting I/O estimate will be, which increases the possibility of the index
being selected for use.
These characteristics of columns and indexes will be used during the descriptions of
good and poor index candidates which follows.
Composite indexes
When an index is created on two or more columns, it is called a composite index.
When creating composite indexes, consider the affect that the order of columns
within the index has on performance. The best performance is achieved by placing the
column that usually has the most restrictive selection criteria used against it in the
high order index position. If you do not know which column meets this criteria, then
place the column with the highest cardinality first. Following these guidelines will
result in the lowest possible selectivity factor to be achieved by the index. Of course,
if the intended use of the index is to provide a generic search capability, then the
columns which will be always fully specified must be placed first in the index or the
searching capability will be lost.
One good reason to use composite indexes is to achieve index only access. This is
when all of the data which is required by a query can be found within the index itself.
A good example is a lookup table that finds tax rates for various county codes. If a
composite index is placed on the county code followed by the tax rate, then a query
that specifies an equality predicate on the county code within the WHERE clause, and
requests retrieval of the associated tax rate may be completed with an access to the
index entry without any access to the page containing the actual row.
Example:
A table is used by transactions to perform a lookup of county tax rates. Since
this is done often, the DBA wishes to allow the operation to be performed
with a minimum number of I/O operations. This is the SQL that is the most
efficient in this case.
• This index was required to complete the primary key definition of the
table:
CREATE UNIQUE INDEX XPKCNTYTAX
ON COUNTY_TAX (COUNTY_CODE);
• The DBA realized that adding a new index which includes the
TAX_RATE column, in the low order of the key, would allow the lookup
to perform index only access, resulting in an I/O savings:
CREATE UNIQUE INDEX XCNTYTAX
ON COUNTY_TAX (COUNTY_CODE, TAX_RATE);
An alternative to composite indexes is creating separate indexes for each column. One
advantage is that these separate indexes may be used when processing the indexed
columns in a WHERE clause that specifies predicates that are OR’d together, since a
composite index cannot be used in this case. Another advantage is that if all the
columns are indexed separately, then you do not have to concern yourself with having
the leading columns of a composite index always being specified in WHERE
predicates.
Example:
Suppose that the following set of queries access the STUDENT table, and
each query is contained within a number of transactions whose response time
is critical. These are the SQL query statements:
SELECT NAME FROM STUDENT
WHERE ID = :id;
SELECT ID FROM STUDENT
WHERE NAME = :name
OR HOME_CITY = :home_city;
SELECT NAME FROM STUDENT
WHERE HOME_CITY = :home_city
AND MAJOR = :major;
SELECT AVG(GPA) FROM STUDENT
WHERE MAJOR = :major;
• Since ID is the primary key of the STUDENT table, query number one will
never be a problem. A unique index must exist on ID in order for the primary
key definition to be complete.
• A composite index on NAME and HOME_CITY can never be used by query
number 2, regardless of the column order, since the logical connector is OR.
This statement can only be satisfied by a table scan, or an index merge
between two indexes where the specified predicates occupy the high order
position.
• A composite index built on HOME_CITY followed by MAJOR could be
used by query number 3, but would not be acceptable for query number 4,
since MAJOR is not in the most significant position.
• The following set of indexes could be considered for use by every one of the
above queries, and represents the most efficient way to satisfy the above
queries:
CREATE UNIQUE INDEX XPKSTUDENT
ON STUDENT (ID);
CREATE INDEX XNKSTUNAME
ON STUDENT (NAME);
CREATE INDEX XNKSTUHOME
ON STUDENT (HOME_CITY);
CREATE INDEX XFKSTUMAJOR
ON STUDENT (MAJOR);
• Columns that are often the subject of aggregate functions (involving SUM,
AVG, MIN, and MAX).
• Columns often used for validity editing in programs, where a value entered
by the operator is checked for its existence in some table.
Skewness of keys
Another index pitfall is when the index’s key value distribution is significantly
skewed. In this situation, the cardinality of the index will appear to be quite high,
allowing the optimizer to compute a sufficiently low selectivity factor to make use of
the index often. In spite of this, the performance results of queries that utilize the
index are often unacceptable. This is because when there is a large percentage of rows
that have the same value for the index key, the index performance when accessing
these rows is much worse than when retrieving rows whose index key value occurs
less often. The following query will show the distribution of index values in a subject
table:
SELECT key1, key2...keyn, COUNT(*) FROM tablename GROUP BY
key1,key2..keyn ORDER BY LastColumnNo
This query creates a report of index keys ordered from those occurring least often to
those occurring most frequently. If the difference between the occurrences at the top
of the list and those at the bottom is several orders of magnitude or greater, then this
index is probably going to cause performance problems when the rows being retrieved
fall into the more often occurring values. The fact that the total cardinality of this
index appears to be acceptable, and therefore its use is considered desirable by the
optimizer, makes this an insidious situation. It is only when the skewness of the keys
themselves is examined that the database designer discovers the performance
penalties that will accrue from the creation of this index.
Database partitioning
The ability to partition a database across a number of database areas allows for the
throughput of the database to be greatly increased. Much of the benefit depends on
the way the operating system environment allows SQLBase to manage these database
areas. This varies depending on the platform on which the server is implemented
(such as NetWare, or OS/2). In general SQLBase will create a unique I/O thread for
each database area, which allows for a greater level of parallelism to be achieved in
the disk subsystem. Whether or not this will create bottlenecks at the disk controllers
or the drives themselves depends on the physical configuration of the I/O subsystem.
The other benefit realized from database partitioning is the ability to have databases
which are greater than the physical size of any individual disk drive. For very large
databases, there is no alternative to database partitioning since there may be no disk
drive available which is sufficient to store the entire database on a single drive.
The drawback to partitioned databases is that the physical disk location of a database
may be harder to identify. This becomes even more true if the DBA decides to allow
database areas to be shared by different storage groups, which are in turn used by
different databases. Monitoring free space in database areas is more difficult that
monitoring free space on a physical device, and determining the total free space
available to a storage group may involve the examination of a number of database
areas. Also, the DBA must be careful to maintain synchronization between the MAIN
database, which stores all of SQLBase’s database area extent information, and all of
the databases which are partitioned.
The bottom line to partitioning is that it is the best way to enhance throughput and
capacity in an appropriate operating environment, but it should only be undertaken
when the benefits are truly needed. Those DBAs who decide to implement database
partitioning must be prepared to spend time ensuring that their monitoring, backup,
and recovery procedures are all up to the task of handling the more complex details
associated with this environment.
When deciding how to partition a database consider the following factors:
• Do not share database areas between databases and logs. Always specify
different storage groups for the data and for the log files, and ensure that no
database areas are shared between these two storage groups.
• Place database areas intended for logs on a different physical device than any
of the database areas which are used for the data storage of tables. If a
database’s data files share a disk drive with their own log files, then the log
files may be damaged along with the database in a disk drive failure, and will
not exist to perform a full recovery of the database.
• Consider not sharing storage groups between different databases. This
prevents databases from being interspersed in the same database area files, so
that if a failure occurs, the affected database will be more easily isolated.
• If you are implementing database partitions in UNIX raw partitions, make
sure you have a technique developed to monitor space in the partition.
• Do not set a default storage group unless you are absolutely sure that you are
prepared to manage all databases as partitioned.
• The following query may be executed against the MAIN database using the
guest/guest account in order to ascertain available free extents in a database’s
database areas:
select a.name,b.stogroup,b.areaname,
@trim(c.pathname), c.areasize,
@nullvalue(100*(d.extsize/c.areasize),0)
from syssql.databases a, syssql.stoareas b,
syssql.areas c, syssql.freeexts d
where b.areaname = c.name
and c.name = d.name(+)
and ((a.stogroup = b.stogroup)
or (a.logstogroup = b.stogroup))
order by 1,2
When determining the appropriate values for the PCTFREE specification, consider
the following factors:
• Tables which are read-only require no free space, and neither do indexes on
these tables. Specify PCTFREE 0.
• If many table columns are often increased in length, specify a high (30-50)
PCTFREE value to accommodate this growth.
• If a table is expected to have columns added to it after its initial creation,
specify a high PCTFREE to avoid the creation of many extent pages when the
new columns are added. However, if the new column will be populated with
non-null values only gradually, then specify a medium PCTFREE value (10-
30).
• If transactions are experiencing problems with waiting for locks or deadlocks
due to a large number of active rows being located on the same page, the
concurrency for this table may be increased through the PCTFREE value.
Alter the table definition with a high PCTFREE value (80-90), then perform
a reorganization. The table will now have very few rows on each page
(possibly as few as one), which will decrease the likelihood of page locks
contending with the access of other rows. The downside is that the table will
occupy more room on the disk, and any table scan operations will take longer.
Following the implementation of the database, the DBA needs to observe its
performance over time to determine if any degradation is being realized prior to
reorganizations. If there is a significant performance drop due to excessive extent
pages being created, then appropriate action should be taken. The reorganizations
could be done on a more frequent basis. Alternatively, the PCTFREE specifications
could be increased prior to the next reorganization to stretch the effective performance
life of the database.
Chapter 7
Introduction
As explained in chapter 1 of this section, database design is a six step process. These
six steps, which correspond to chapters two through seven of this section, are:
1. Build the logical data model, or the conceptual schema, of the ANSI/SPARC
three schema architecture.
2. Convert the logical model to a “first cut” physical database design which is the
internal schema in the ANSI/SPARC three schema architecture. This conversion
is accomplished by applying a few simple rules to the standard Entity
Relationship diagram.
3. Define the external schema, or DBMS views.
4. Define the database transactions.
5. Refine the “first cut” physical database design by applying some heuristics to
common situations.
6. Perform physical design control.
Physical design control is the iterative process whereby the physical database design
is adjusted, or tuned, so that the execution time of a database transaction, which is not
meeting its performance objective, is decreased. The goal of physical database design
is to ensure that every database transaction meets or exceeds its performance
objectives. (In reality, it is sometimes prohibitively costly for every transaction to
meet or exceed its performance objectives. In these cases, the goal of physical design
is to ensure that every high priority transaction meets its performance requirements,
and all other transactions do not deviate unacceptably from their performance
requirements.) Consequently, physical design control is complete when every
transaction meets or exceeds its performance objectives.
This chapter presents the procedure of physical design control. In addition, common
tuning scenarios are discussed along with tips and techniques which may be used to
speed up specific types of transactions.
The process
The process of physical design control is accomplished in three steps:
1. Determine the execution time of each transaction and locate the highest priority
transaction which is not meeting its performance objectives.
2. Adjust the physical constructs of the database so that the transaction takes less
time to execute.
3. Go back to step one, unless all transactions are now performing within their
required response times.
Each of these steps are discussed in the following sections.
Step one
The execution time of individual transactions may be determined by actually
executing the transactions or by estimating the execution time using some forecasting
model. The former method is referred to as the “brute force method” and the latter as
the “forecast method.”
time. For example, several different indexes may have to be tried before a specific
transaction meets its performance objectives. Similarly, the database designer may
choose to cluster a table on a different key to see if performance increases. To
accomplish this, the table must be dropped, the data reloaded and all the indexes
rebuilt. Sometimes a design iteration may require the data load programs to be
altered. This can occur, for example, if the database was denormalized in an attempt
to increase the performance of a transaction. Of course, re-coding application
programs may involve an enormous amount of effort.
Consequently, each design iteration in the physical design control process potentially
requires the data load procedures to be repeated. On a complex database, ten or more
design iterations may be required before all the transactions are meeting or exceeding
their performance objectives. Hence, the database tables and indexes may have to be
built ten or more times. Therefore, as stated above, the usefulness of the brute force
method is limited to relatively small databases.
Forecast method
The forecast method is ideally suited for medium to large databases since many
different design strategies may be tried without having to actually load data into the
database. This technique involves estimating the execution time of a particular
transaction through the use of some forecast model.
Forecast models typically try to predict how many physical block reads or writes and
CPU cycles a specific transaction will consume. These numbers are then multiplied
by the average execution time of a physical I/O or CPU cycle. For example, most disk
drives commonly used today in Intel-based database servers have a total data access
time of less than 20 milliseconds. This is the average time required to position the
read/write head on the desired track plus the average rotational delay time (the time
required for the desired sector to spin around to the read/write head). According to
this model, if a given transaction required 1000 physical I/O, on average the
transaction will execute in twenty seconds (1000 * .02). We are ignoring CPU time
here for the sake of simplicity. In fact, CPU time is often omitted from forecast
models because CPU time is usually a small percentage of overall transaction
execution time.
Of course, these forecast models are an over-simplification of true performance and
overlook many factors such as disk controller caches, operation system file buffers,
and database page buffers. However, the measurements are usually sufficiently
accurate for the desired purpose. It is not necessary to know exactly the execution
time of a given transaction. If the forecast is only 50% to 75% accurate, we still have
an adequate measuring rod to perform physical design control. The estimate of
execution time provides a relative measurement, as long as the estimating algorithm
is applied evenly. This relative measure can then be used to identify those transactions
which are expected to grossly exceed their performance objectives.
Before the relational model and SQL became the de facto standard, forecast models
were used almost universally by database designers. This is because most DBMS of
that time employed a navigational database manipulation language (DML). With
these navigational DMLs, the programmer determined the access path for each
database access. Once the access path is known, it is relatively straightforward to
estimate physical I/O. Hence, forecast models were relatively easy to develop.
However, with the rise of the relational DBMS and SQL, forecast models have
become much more difficult to build. This is because SQL is a declarative DML. With
SQL, the programmer does not determine the access path. This task is delegated to the
query optimizer. So, in order to accurately forecast the number of physical I/O, the
forecaster must first “guess” which access path the query optimizer will employ for a
given database operation. The cost model of query optimizers are extremely complex
and beyond the reach of most database designers.
In SQLBase, all statistics used by the query optimizer are now stored in the catalog as
columns of the various system tables automatically created and maintained by
SQLBase. In addition, the UPDATE STATISTICS command may be used to store
user-desired statistics in these columns in order to influence the choices made by the
optimizer.
Consequently, the procedure which the database designer should follow when
attempting to use the query optimizer as a forecasting tool is:
• Compile the physical schema DDL.
• For each table and index, estimate the values for the optimizer statistics
which will be stored in the catalog once production data volumes are loaded.
• Store these user-determined statistics in the catalog using the UPDATE
STATISTICS command.
• Put the database in explain mode using either the sqlexp function or the
SQLTalk command, SET PLANONLY ON.
• Submit SQL commands to SQLBase in the usual way and record the access
path and cost returned by the query optimizer.
• Calculate the total forecasted cost of the transaction by summing the cost of
each SQL command in the transaction.
Step two
During step one of physical design control, the execution time of each transaction is
determined and the highest priority (during transaction definition, each transaction is
ranked in terms of its relative priority and its performance requirements specified)
transaction which does not meet its performance objectives is located. Thus, the input
to step two is one transaction. The objective of step two is to adjust the physical
constructs of the database so that this transaction’s execution time is decreased. This
is often referred to as “tuning the schema” for a transaction.
Most database transactions consist of many SQL commands. Consequently, the first
task in step two is to determine which of the SQL commands that make up the
transaction are contributing the most to the overall cost of the transaction. The
designer should then focus his attention on these commands first.
Once a specific SQL command is targeted, the database designer may then begin to
consider alterations to the physical schema or the SQL syntax which may increase
performance. The general design goal is to reduce the number of physical I/Os
required to satisfy the SQL command. Three fundamentally different alterations may
be made to the physical schema:
• Add or change row clustering, primarily via hashing
• Add or change an index
• Denormalize a table
Similarly, the SQL syntax may be changed in two fundamental ways:
• Restrict the selection criteria
• Replace the SQL command with an equivalent one
Denormalize a table
Denormalizing a database is a common technique to improve the performance of a
SQL command. Denormalization refers to the process of converting a database which
is in third normal form to some lower level of normalization for performance reasons.
The concepts of first, second, and third normal form are presented in chapter two and
you are encouraged to review this material.
While in the majority of situations, the benefits of direct implementation of
normalized data structures outweigh any disadvantages, in a few cases there may be
sufficient justification to perform denormalization in a controlled, deliberate manner.
For example, whenever many joins will be required across large tables on a frequent
basis, the performance penalty of these join operations may be lessened by combining
some of the tables or introducing selective data redundancy among the tables. This
denormalization must be done with careful consideration to the costs involved in
working around update and deletion anomalies, performing processing to update
redundant data, and increasing the recurring maintenance costs for the database. Of
course, the benefits realized from the denormalization include fewer join operations
and fewer foreign keys along with their usual indexes. When these benefits are more
critical to the success of the system than the drawbacks of a non-normal database,
then denormalization is done as a part of the physical database design process.
Denormalizing with prime attributes is one common form of denormalization.
This involves redundantly storing a “prime” (meaning not derived or calculated) data
element in both a primary table and a secondary table. The primary table is the table
where the data element “belongs” in terms of the normalization rules. By storing the
data element on the secondary table, the normalization rules are violated.
For example, consider the query, “display a list of all customer zip codes for which
there currently exist open leads, and rank the zip codes in descending order by the
number of open leads.” The SQL command for this query might be:
SELECT C.ZIP_CODE, COUNT(C.ZIP_CODE)
FROM CUSTOMER C, LEAD L
WHERE L.CUST_ID = C.CUST_ID
AND L.LEAD_STATUS = ‘OPEN’
GROUP BY C.ZIP_CODE
ORDER BY 2 DESC;
Note that the ZIP_CODE of the customer is stored in the CUSTOMER table because
that is where the attribute belongs. For any one CUST_ID, there exists only one
ZIP_CODE. (We are ignoring here the fact that a company may have many locations
and each location may have a postal and delivery zip code.) However, to execute this
query, the LEAD table must be joined with the CUSTOMER table (see the WHERE
clause above). If the CUSTOMER table contains several hundred thousand rows and
at any one point in time there exists only a relatively small number of open leads, then
the performance of this query could be significantly increased by denormalizing the
LEAD table.
To denormalize the LEAD table to speed up this query, the customer ZIP_CODE
field is redundantly stored in the LEAD table. The new SQL command to produce the
same result is:
SELECT ZIP_CODE, COUNT(ZIP_CODE)
FROM LEAD L
WHERE LEAD_STATUS = ‘OPEN’
GROUP BY ZIP_CODE
ORDER BY 2 DESC;
Thus, the join of the CUSTOMER table with the LEAD table is avoided resulting in
increased performance of the target query. However, the LEAD table is now not in
third normal form because the ZIP_CODE attribute is not dependent on the entire
primary key of the LEAD table. Presumably, the primary key of the LEAD table is
(CUST_ID, LEAD_NBR). Given this, the ZIP_CODE attribute depends only on the
CUST_ID; regardless of the LEAD_NBR, the customer’s ZIP_CODE remains the
same. Thus, the LEAD table now violates the second normal form rule.
Violating the second normal form rule, in this situation, does increase performance of
the target query. However, this increased performance comes at the price of an update
anomaly. We cannot now change the ZIP_CODE of a customer in the CUSTOMER
table without changing every lead for that customer. Hence, the update customer
transaction has to be written so that any change of a customer’s ZIP_CODE is
reflected in all leads (otherwise the data in the LEAD and CUSTOMER tables would
be inconsistent). Denormalizing the LEAD table to speed up the target transaction
also resulted in:
• Decreased performance of the update customer transaction.
• Placing the data integrity of the CUSTOMER and LEAD tables in the hands
of the application.
This particular example of denormalization results in an update anomaly. Insertion
and deletion anomalies are also possible. (See C.J. Date, An Introduction to
Database Systems, Addison-Wesley for a detailed explanation of update, insertion
and deletion anomalies.) Every time a table is denormalized, the database designer
must be aware of, and devise a strategy to deal with, update, insertion, and deletion
anomalies. In addition, one must rely just a little bit more on the application programs
to enforce the data integrity of the database. If the programs contain bugs relating to
their handling of these anomalies, data integrity will be compromised. For these
reasons, denormalization should only be considered as a last resort. The database
designer should try all other alternatives to speed up the query before resorting to
denormalization.
Further, one should resort to denormalization only in the case when a specific
transaction cannot otherwise meet its performance objectives. Some database design
methodologies suggest that the designer denormalize the database as a general rule. In
other words, certain tables are denormalized based on broad guidelines before any
transactions are timed. The table is thus denormalized whether it is needed or not. The
only valid reason to denormalize a table is to achieve increased performance of a
given transaction that cannot otherwise be obtained. To denormalize a table before it
has been demonstrated that the transaction performance objectives cannot otherwise
be met is reckless given the cost of dealing with update, insertion, and deletion
anomalies or a loss in data integrity.
Denormalizing with aggregate data is another common form of
denormalization. This deals with adding aggregate or derived data attributes to
database tables. Aggregate data are data items derived from other lower level data and
are frequently required in decision support and executive information systems in
order to satisfy the large degree of summary and exception reporting these type
systems provide.
For example, consider the query “produce a list of salesmen and their total year-to-
date sales in descending order from the top selling salesman to the salesman with the
least sales.” To produce this report, all orders for the current year must be read. Most
order entry systems store order data in two tables: the ORDER_HEADER and
ORDER_DETAIL. The ORDER_HEADER table contains data germane to the entire
order, such as SALESMAN_ID, SHIPTO_ADDRESS and CUST_ID. The
ORDER_DETAIL table stores information pertaining to one order line item.
Consequently, to produce this report, the ORDER_HEADER table must be joined
with the ORDER_DETAIL table. The SQL statement to produce this report is:
SELECT H.SALESMAN_ID, SUM(D.AMOUNT)
FROM ORDER_HEADER H, ORDER_DETAIL D
WHERE H.ORDER_ID = D.ORDER_ID
GROUP BY H.SALESMAN_ID
ORDER BY 2 DESC;
If there are 50,000 rows in the ORDER_HEADER table and 300,000 rows in the
ORDER_DETAIL table (on average each order has six lines items), the performance
of this query can be significantly improved by adding an aggregate data element,
TOTAL_ORDER_AMOUNT, to the ORDER_HEADER table. Whenever a row in
the ORDER_DETAIL table is inserted, updated or deleted, the
TOTAL_ORDER_AMOUNT is decremented or incremented appropriately so that
this summary field contains the total sales amount of the entire order (which is the
sum of all the line items). The SQL command can then be written to avoid the join:
SELECT SALESMAN_ID, SUM(TOTAL_ORDER_AMOUNT)
FROM ORDER_HEADER
GROUP BY SALESMAN_ID
ORDER BY 2 DESC;
Technically, the ORDER_HEADER table is now in violation of the normalization
rules and update, insert, and delete anomalies exist. The create/modify order
transaction must be written so that this total field in the ORDER_HEADER table is
maintained as order lines are added, changed, or deleted.
When denormalizing by adding aggregate data to the database, the decision the
designer must make is whether to allow SQLBase or programs to calculate these
values each time they are needed, or to actually store them in the database so that they
can be queried without the overhead incurred from their derivation. The key factors
involved in this decision are the frequency of access for these data items, the response
time required, and the frequency and types of update operations that are performed on
the lower level data from which the aggregate data is derived.
Consider the case of the student’s GPA. This information is probably needed for each
academic inquiry transaction that the system performs, and these transactions may be
executed quite frequently and require a reasonably low response time (probably two
seconds or so). The lower level data, which are the grades a student has received in
each course he has taken, could be quite large for a student with a long academic
career. Not only would the on-line inquiry program have to retrieve all this data to
perform the calculation, but the logic needed is also quite complex.
Furthermore, this grade data is updated infrequently, probably only at the end of the
semester. This update could be performed with a mass insert program that can be
designed to perform GPA calculations for students in an efficient manner. On-line
transactions which perform grade changes are not highly used and can suffer from
some slower response time while the program is calculating and updating the GPA.
This is an excellent scenario for allowing aggregate data to be placed in the design,
since the benefits realized will outweigh the costs.
The example of a customer’s total sales for a year probably falls on the other side of
the trade-off. This data is usually not used for on-line inquiry transactions, but more
often appears on reports for sales or marketing departments. As such, there is little
benefit from speeding up the derivation of this data. Also, it is easily computable by a
SQL query which performs a GROUP BY operation. The values from which the
information is derived are the sum of all customer order total sales, which could be
frequently updated for active customers. Every order transaction that is processed
through the system would have to perform the updating of the customer’s total sales,
which would cause overhead in a transaction which probably has stringent response
time and throughput requirements. When everything is considered, this data item
should not be placed in the customer entity and those programs needing it will have to
pay the performance penalty of performing the aggregation query needed to derive its
value.
Each case of aggregate data should be subject to the same level of analysis as
demonstrated previously. Only in those cases where there are clearly significant
benefits from actually storing the derived data should it be maintained in the database.
The general rule of thumb should be to eliminate derived data and, as with the
denormalization trade-off, whenever the rule is broken, the results of the analysis that
lead to that decision should be thoroughly documented within the database design.
Thus, this form of normalization has the same pitfalls as redundantly storing prime
attributes. However, unlike prime attributes, derived attributes add information to the
database. Consequently, denormalizing with aggregate data is perhaps preferable to
redundantly storing prime attributes.
Chapter 8
Database Pages
In this chapter we:
• Introduce the concept of the database page, which is the basic unit of disk
storage used by SQLBase.
• Cover the following database page types in detail:
• Data row pages
• Data extent pages
• Long varchar pages
• Overflow pages
• Index pages
• Control pages
• Discuss SQLBase’s page allocation and garbage collection algorithms.
pages from the operating system by specifying logical file offsets. With the exception
of Unix raw devices, it is the job of the file system and disk controller to convert this
number into a physical storage device and the actual track and sector (block) where
the database page is stored, using the specific device geometry information coded into
the driver module supplied by the manufacturer of the disk subsystem.
Database pages are also linked to form lists of pages. These lists are referred to as
linked lists. Each page contains a pointer to the next, and sometimes prior, page in the
list (see Figure B). The pointer consists of the physical page number and facilitates
the sequential access of all the pages within a specific list.
Figure 8B - Database pages are linked together in lists known as linked lists.
Each database page is a member of one page type. The basic page types are data
pages, index pages and control pages. Data pages contain actual table row data.
SQLBase has four types of data pages: row pages, extent page, long varchar pages
and overflow pages. Index pages are used to store B+Tree indexes. Control pages
maintain information for internal purposes such as which pages are empty or full.
Each of these page types is discussed in detail in the following sections.
A database page contains data only for one page type. For example, BTree pages only
contain data (consisting of symbolic keys) for a particular BTree index. Likewise, row
pages only contain data.
A database page also belongs to one and only one group and contains data for only
one group. A group is a database object such as a table or index, and a collection of
pages of different page types. For example, a table is made up of n number of row
pages, extent pages, LONG VARCHAR pages, and control pages.
Each database page is also assigned a logical page number. The logical page number
is assigned sequentially within the group: the first page within the group has the page
number of 1, the second page in the group 2, and so on.
Page Group Log. Prior Nex t Page Group Log. Prior Next Page Group Log. Prior Next
Type No Ptr Page Page Type No Ptr Page Page Type No Ptr Page Page
Data 28 1 0 15 15 Data 28 2 1 27 27 Data 3 15 0
1st page in the customer table 2nd page in the customer table 3 rd page i n the c ustomer table
Figure 8C - All the pages of a specific group form a separate linked list. Each group
has one page type and stores data only for that page type.
Data pages
As mentioned previously, SQLBase contains four basic types of data pages: row
pages, extent pages, LONGVARCHAR pages, and overflow pages. Each of these
pages types is discussed in the following sections.
Row pages
When a table row is originally stored in the database, it is stored on a row page. The
specific location occupied by a database row on the page is called a slot. The number
of slots that may reside on a page is a function of the size of the rows and the page
size. SQLBase keeps track of the location and size of each slot within a database page
through the use of a slot table. The slot table serves as an index to all the slots
contained within a database page.
Each row page contains a generic page header, row page header, slot table, variable
number of slots and a page footer. The slot table grows down the page from the row
page header and the slots grow up from the page footer. The freespace within the page
is between the slot table and the last slot added (see Figure D).
Page header
The page header is 53 bytes long and contains the physical page number, page type,
group number, logical page number, next and previous page pointers, and a log record
pointer (see Figure E). The log record pointer points to the record in the log that
contains the before and after image of the page for the last SQL operation that
modified the page.
Figure 8D - Row pages contain a page header, row page header, slot table, variable
number of slots and a page footer. The free space within a row page is between the
last entry in the slot table and the last slot.
the slot is empty. The remaining bits contain the byte offset of the slot within the page
(see Figure E).
Freespace
861 900
Slot 1 (10 bytes) Page Footer
Log Pointer
921 1020 1021 1024
Figure 8E - Each table row is stored in a slot. The slot contains the slot header and
the set {column id, length, data value} for each non-null (and non-long varchar)
column.
The row serial number is a unique identifier assigned to each row in a table. The row
serial number is assigned sequentially: the first row inserted into the table is assigned
the row serial number of one, the second row is assigned two, and so on. The update
serial number is a unique identifier assigned to every row in a common UPDATE or
INSERT command. The update serial number is also assigned sequentially: all the
rows updated/inserted with the first UPDATE/INSERT command are assigned an
update serial number of one, all the rows updated/inserted with the second UPDATE/
INSERT command two, and so on. The update serial number is used to prevent lost
updates with the Cursor Stability and Release Lock isolation levels (see chapter 13)
and infinite looping with UPDATE and INSERT commands.
Consider the SQL UPDATE command:
UPDATE TABLE Y SET X = X + 1 WHERE X > 10;
With this command, if the DBMS did not flag the rows in table y that met the where
clause condition before the update command began, then the update command will
result in an infinite loop. The update serial number is used for this purpose in
SQLBase, which contrasts with some other RDBMSs which disallow the placement
of columns within the WHERE clause of an UPDATE statement which affects their
contents.
Page footer
The page footer redundantly stores the log page pointer. This field serves as a
consistency check and is used during transaction recovery to protect against partial
page writes. Many Intel-based computers traditionally block hard disks with 512 byte
sectors. This means the hard disk controller reads and writes in units of 512 bytes.
However, the SQLBase page size is 1024 bytes. Thus, it is theoretically possible on
such a computer for only a portion of a database page to be written to disk (when an
undetected disk error occurs after the first sector is written but before the second 512
bytes sector is written). Thus, by storing the log page pointer in both the page header
and footer, SQLBase can detect when this situation occurs and respond accordingly.
Rowid
SQLBase uses a rowid for all external references to table rows. A rowid is composed
of the physical page number, slot number, row serial number and the update serial
number. The physical page number is used to locate the row page containing the row.
The slot number is used to locate the specific entry in the slot table that contains the
offset of the slot (see Figure E).
The slot table/slot data structure is an indirection technique that allows a specific slot
to move within a page without affecting external references (namely rowids). It is
necessary to relocate slots within a page after the deletion of rows to reclaim
freespace and avoid fragmentation of the page (see page garbage collection later in
this chapter).
The row serial number is required to verify the row is actually stored in the specified
slot. Since slots are reused, the row could have been deleted and another row stored in
the same slot. The row serial number allows SQLBase to determine when a row
pointed to by a rowid has been deleted. Likewise, the update serial number allows
SQLBase to determine when a row referenced by a rowid has been updated by
another user.
Extent pages
Extent pages are extensions to row pages and are used to store some of a table row’s
columns when all of the columns will not fit in one row page. Columns can become
displaced in two circumstances, the most common of which occurs when column
values are updated such that the length of the columns increase significantly in size.
As mentioned previously, when a table row is originally inserted into a table, all
columns (except LONG VARCHAR columns) are stored in a row page (assuming the
row will fit into one page). However, later the row may be updated resulting in an
increased row length. This can occur, for example, when a column that was null at the
time of insertion is updated with a non-null value. When the row is updated such that
its length is increased, sufficient freespace may not exist on the row page (the current
base page) to store the new column values. When this situation occurs, SQLBase
stores the new column values on an extent page.
The second situation which can cause columns to be displaced from their base page
occurs when you attempt to insert a row with a row size greater than the usable row
page size. Usable page size is the page size minus the sum of the page size reserved
for later row expansions (through the use of the PCTFREE specification on the
CREATE TABLE statement) and the page overhead. For example, for the NLM
version of SQLBase, the page size is 1024 bytes and the page overhead is 86 bytes.
Consequently, some columns will be displaced from the base page and stored on
extent pages when you attempt to store a row with a row width of greater than 938
bytes, assuming the PCTFREE parameter is zero.
Extent pages, like row pages, contain a page header, row page header, slot table,
variable number of slots and a page footer. However, the next and previous page
pointers in the page header are not used; extent pages are linked together and to the
base page in a separate linked list using the extent page pointer in the row page header
(see Figure F).
Slot 2
Slot 2
Slot 2 Slot 2
Figure 8F - Every row page may have up to ten extent pages which are linked to the
base row page. Extent pages share the same group number with the base page.
Displaced columns are stored in the same slot as the columns on the base page.
When an update or insert occurs such that the entire row cannot be stored on a base
row page, SQLBase searches for the first extent page with sufficient freespace to
contain the entire displaced columns. If no extent page is found which meets this
criteria, SQLBase allocates one new extent page (up to a maximum of ten).
A given column, consisting of the set {column identifier, length, value}, is never split
across pages. The slot in the base page contains no reference to the extent page or the
displaced columns. The displaced columns are stored in the same position in the slot
table in the extent page as the row to which they belong is stored in the base page.
This means that a row’s slot number remains unchanged from the base page value
regardless of how many extent pages it may span. When retrieving a row, SQLBase
first retrieves the base page and accesses all columns stored in the slot. If some
columns are displaced, SQLBase searches sequentially through the chain of extent
pages until all displaced columns are found.
Effect on performance
Extent pages negatively affect performance by requiring additional page I/O to access
row data. If a row is stored without the benefit of extent pages (fitting completely on
one row page), then the row can be accessed with one physical I/O. However, if the
row is spread over one row page and one or more extent pages then multiple I/Os may
be required. Since extent pages are allocated on an as needed basis, it is unlikely the
extent pages for a given row page would be stored in the same physical block (except
following a reorganization). However, it is possible several extent pages all belonging
to the same row page are stored in the same physical block.
For example, assume a “narrow” row (meaning one with a row width less than the
usable row size minus reserved space) is inserted into the database and stored either
on an existing page with sufficient freespace or a new page which has no extent pages
allocated at the time of the insert. Now assume that at some later point in time the row
is updated with increased column widths which must be stored on an extent page and
that SQLBase allocates a new extent page for this purpose. At this time, it is very
likely that the pages physically adjacent to the base page have already been allocated
for other purposes (such as additional row pages or extent pages for other base
pages). Consequently, the page which SQLBase does allocate for the extent page will
not be physically adjacent to its base page.
On the other hand, consider the following. Assume a “wide” row (meaning one with a
row width greater than the usable row size minus reserved space) is inserted into the
database. Also assume the row can be stored on one base page and one extent page.
Since the row width is greater than the usable row size minus the reserved space,
SQLBase will be forced to store the row on a new page (this is because the existing
page will not contain sufficient freespace to store the row). Consequently, at the time
of the insert, both the base page and extent page will be allocated. If the first two
pages listed in the group’s free extent list are physically adjacent, then the base page
and extent page will be collocated. Otherwise, the base page and extent page will be
physically stored in separate blocks on the I/O device and two physical I/Os will be
required to retrieve the entire row.
One way that this latter situation can occur is if the group’s free extent list does not
contain two free pages and SQLBase must allocate additional pages to the group, or if
some pages have been de-allocated and placed back in the group’s free extent list.
Conversely, this latter situation cannot occur during a database reorganization since
no other users are connected to the database and thus are not competing for page
allocations. During a reorganization, SQLBase collocates all row pages with their
respective extent pages. However, it is possible that the row page and its extent page
are collocated but not stored in the same physical block. This situation would occur
when the base page was physically adjacent to the extent page but also crossed a
block boundary.
In any event, a database with too many allocated extent pages usually results in
decreased performance. A general design goal should be to minimize the number of
extent pages. To accomplish this, the DBA is encouraged to reorganize the database
frequently. In addition, if it is known that most of the rows in a given table will
significantly grow in width over time, then the CREATE TABLE’s PCTFREE
parameter should be adjusted accordingly. Further, if the average row width is close to
the usable page size, the database designer should consider splitting the table into two
tables.
Extent Pages
Base Row Pages Page
Type
Group Extent
No Page
Page
Type
Group Extent
No Page
Pag e
Ty pe
Group Extent
No Page
5 EXTENT 28 15 15 EXTENT 28 16 16 EXTENT 28 -
Group Extent
No Page
1 ROW 28 5 Slot 2
Slot 2 Slot 2
Enlargement
Figure 8G - Unlike EXTENT pages, LONG VARCHAR pages are linked to their base
page by pointers in the slots. LONG VARCHAR pages are also linked together with
next and prior pointers. Please note in the enlargement of the base slot above that
displaced columns, those stored on EXTENT pages, are not referenced in the base slot
but LONG columns are referenced.
The purpose of long varchar pages is to separate and isolate long varchar data from
columns of other data types so that the storage of this type of data does not negatively
affect the access of the other columns. Long varchar columns are used to store large
quantities of unstructured data such textual comments, bitmaps, video clips, and so
on. Consequently, one long varchar column, because of its size, will likely saturate
many pages. If these columns were stored in row pages, the result would be heavy use
of extent pages that would decrease performance to columns of all data types. By
storing long varchar columns on separate pages, access to non-long varchar columns
is unaffected.
On the other hand, a database access to a long varchar column as well as other
columns requires at least two physical I/Os (one for the row page and one for the long
varchar page). In addition, SQLBase stores only one column on long varchar pages. If
a column is defined as LONG or LONGVAR but its value is only one byte, an entire
long varchar page is used to store this one byte of data. Consequently, the LONG data
type should only be used when the values of the attribute are known to exceed 254
characters; otherwise, the VARCHAR or CHAR data type should be used.
Overflow pages
Overflow pages store rows when an overflow condition occurs in conjunction with a
clustered hashed table (see chapter 10). When a row hashes to a page which is already
full, SQLBase stores this row on an overflow page. There are two types of overflow
pages, designated overflow pages and dynamically created overflow pages. All
overflow pages, regardless of type, participate in the same linked list as base row
pages (see Figure C).
the clustered hashed index is created on a single number field, N is 100; otherwise, N
is 10.
Rows which overflow in this page Rows which overflow in this page
range, are stored in the following range, are stored in the following
designated overflow page. designated overflow page.
Designated
OF Page
Figure 8H - Designated overflow pages are created for every N number of row pages
and participate in the same linked list as the base row pages.
overflow pages are inserted in the linked list between the designated overflow page
and the next base row page.
SQLBase follows these steps when attempting to retrieve a row from a hashed table:
1. Hashes the key to obtain the target row page number;
2. Retrieves the target row page and searches the slots for the desired row;
3. If the desired row is not found on the target row page, SQLBase then retrieves the
designated overflow page (for that page range) and again searches the slots for the
desired row;
4. If the desired row is not found in the designated overflow page, SQLBase searches
in turn the dynamically created overflow pages.
Hence, too many overflow pages degrades performance. Care must be taken when
choosing a clustered hashed index key to ensure too many overflow pages will not be
required for any specific page range (see chapter 10).
Index pages
Index pages contain B+_tree indexes (refer to chapter 9, BTree Indexes, for a detailed
description of B+-trees). Index pages have the same basic organization of all other
page types. Each index page contains a page header, index page header, slot table,
slots and page footer. The page header, slot table and page footer is identical to data
pages (see Figure J).
Freespace
858 897
Slot 1 (10 bytes) Page Footer
Log Pointer
921 1020 1021 1024
Figure 8J - Index pages have the same basic layout as row pages: each page contains
a page header, index page header, slot table, and multiple slots.
The index page header and slots within an index page differ from data pages. The
index page header contains the slot size, free space and the level in the index (see
Figure J). The slot contains the slot overhead, one page pointer and one or more key
attributes depending on whether the index is created on one or more table columns. In
the leaf pages, the page pointer points to a data page where the key values may be
found. Otherwise, the page pointer points to the next level in the index.
All index pages belong to the same group; however, non-leaf pages are not linked
together. Therefore, the next and prior page pointers in the page header are not used.
Leaf pages are linked together in the usual way and form what is known as the
sequence set.
Control pages
Each SQLBase database contains a number of control pages for a variety of
housekeeping chores. Some of these control pages store various information about
the database as a whole. For example, the database control block is used to store
database parameters. Each SQLBase database also has one free extent list which is
used to keep track of un-allocated pages. Other control pages are used for groups and
tables. The group extent map is used to track pages allocated to a specific group.
SQLBase also has several types of control pages for each table defined in the
database. Each of these types of control pages is discussed in the following sections.
Bitmap pages
SQLBase uses bitmap pages to track which row pages have freespace (remember the
row page header contains the amount of freespace. See Figure E). Each table within
the database has a set of bitmap pages for this purpose. Each of the 8192 bits (1024
bytes * 8 bits per byte) of the bitmap page (less overhead) maps to one row page.
When the bit is off, this signals the page has room for additional rows. When the bit is
on, the page does not have room for additional rows. There is one bitmapped page for
approximately every 8192 row pages allocated to the table’s group (a bitmap page
actually cannot track the freespace of 8192 pages since a portion of the bitmapped
page is used for overhead purposes, these details are ignored in this example). The
first logical page within the group is a bitmap page and every 8192 pages thereafter is
a bitmap page. As SQLBase inserts, updates and deletes rows from a specific row
page, the freespace field in the row page header is adjusted accordingly. In addition,
when the page is 80% full, the corresponding bit in the corresponding bitmap page is
turned on. When allocated space drops to 50% or below, the bit is turned off.
Page allocation
When SQLBase receives a command to insert a row into a table, SQLBase accesses
the corresponding table data page in order to retrieve the page number of the last row
page in the linked list. SQLBase then retrieves this page and checks the freespace
field in the row page header to determine if room exists to store this specific row. If
so, the new row is inserted into the last page in the linked list and the freespace field
in the row page header and corresponding bitmap page are updated accordingly.
If the last page in the linked list does not contain sufficient freespace to accommodate
the new row, SQLBase begins searching each bitmap page in turn, starting with the
last bitmap page accessed and wrapping around to the first bitmap page if necessary,
for an existing row page with its bit turned off. If one is found, SQLBase retrieves this
page and checks the freespace remaining in the row. If sufficient space exists, then the
row is stored on the page. If sufficient space does not exist, SQLBase continues its
search of the bitmap pages. It is also possible for SQLBase to encounter a fragmented
page; that is, a page with enough total free space to store the incoming data, but not
enough contiguous free space. In this case, garbage collection is performed to
compact the page and coalesce all of the free space into one chunk (see “Garbage
collection” below for details).
If all bitmap pages are searched and no page is found with sufficient freespace,
SQLBase then checks the free extent list. If un-allocated pages exist in the free extent
list, SQLBase allocates a set of pages, known as an extent, to the table’s group by
removing them from the free extent list and adding them to the group extent map. At
this time, SQLBase also updates the appropriate bitmap page and/or creates new
bitmap pages if necessary. Groups grow in intervals of 10% or five pages, whichever
is larger. For example, if the group currently consist of 100 pages, 10 new pages are
allocated to the group. On the other hand, if the group only contained 40 pages, then
the extent would consist of five pages.
If no unallocated pages exist in the free extent list, SQLBase performs the following
steps:
1. Obtains additional space from the file system. The SQLBase database file is not
fixed to a certain size; it grows as additional space is required. When a SQLBase
database requires additional storage space, additional pages are added to the
database file. This process of adding additional pages to the database file is
resource intensive, increasing response time for currently active transactions.
Consequently, pages are not added one at a time but in groups of pages called
extents. The extent size may be set by the user with the SET EXTENSION
command. For non-partitioned databases, the default extent size is 100 pages.
2. Adds extent pages to the free extent list.
3. Removes pages from the free extent list and added to the group extent map.
4. Updates the row and freespace field stored in the new page and the corresponding
bit in the bitmap page.
Garbage collection
When a row is deleted from a page:
• The utilization bit in the slot table is turned off, indicating that the slot is
available for re-use.
• The freespace value in the row page header is increased by the row size.
• If freespace is now greater than 50% of the usable page size minus reserved
space, the corresponding bit in the bitmap control page is turned off.
At the time of the deletion, the slot data is not removed and the page is not
compacted. Slots are reshuffled to free up space when an insert or update (which
increases the size of the subject row) occurs. In this way, garbage collection is only
performed when needed.
Chapter 9
B-Tree Indexes
In this chapter we:
• Introduce the concept of tree structured indexes, particularly the B-Tree
index, a variation of which is used by SQLBase.
• Discuss the general algorithms used to maintain B-Tree indexes.
• Introduce the SQLBase variation of B-Tree indexes, the B+-Tree index, by
contrasting it to the B-Tree index.
• Explain the variations between the algorithms used by SQLBase to maintain
B+-Trees and the basic algorithms already covered.
• Discuss the unique considerations that apply to SQLBase indexes.
Introduction
B-Tree indexes have become the de facto standard for random access to records (or
rows) within a database file. All major RDBMS products use one variety of B-Trees
or another for general purpose indexing. SQLBase uses the B+-Tree variation first
proposed by Knuth in 1973.
Concerning the history of B-Tree indexes, Douglas Comer documents:
The B-tree has a short but important history. In the late 1960s computer
manufacturers and independent research groups competitively
developed general purpose file systems and so-called “access methods”
for their machines. At Sperry Univac Corporation H. Chiat, M.
Schwartz, and others developed and implemented a system which
carried out insert and find operations in a manner related to the Bl-tree
method. . . . Independently, B. Cole, M. Kaufman, S. Radcliffe, and
others developed a similar system at Control Data Corporation . . .
R. Bayer and E. McCreight, then at Boeing Scientific Research Labs,
proposed an external index mechanism with relatively low cost for most
[file] operations; they called it a B-tree. The origins of “B-tree” has
never been explained by the authors. As we shall see, “balanced,”
“broad,” or “bushy” might apply. Others have suggested that the “B”
stands for Boeing. Because of his contribution, however, it seems
appropriate to think of B-trees as “Bayer”-trees.
B+-Tree Before presenting the details of the B+-Tree variation used by SQLBase, it is
necessary for the reader to understand simple binary search trees, multiway trees and
B-Trees in general. This chapter presents a general introduction to binary search
trees, multiway trees, and B-Trees. Following this general introduction, B-Trees are
presented along with the SQLBase implementation.
42
7 81
Figure 9A - This figure shows a portion of a binary search operation. The path
followed for the target key ‘15’ is illustrated with dotted lines.
Each node in a binary search tree contains exactly one key and has a branching factor
of two: every node contains a maximum of two children. Thus, binary search trees are
“tall” trees. The minimum height of a tree containing N records is [log2N]+1. For
example, a tree with only 100 records will have at least seven levels.
When trees are used for index structures, “short” trees are preferable since they
typically exhibit better performance characteristics. This is because typically one
physical I/O is required for every level traversed in the index, and in order to speed
database access times, physical I/O must be kept to a minimum.
Note also that in index structures, a record in a node contains the set {Key Value,
Pointer}. The pointer is the physical address in the main file where the key value can
be found. Figure A above contains only the key values in the nodes of the tree in order
to simplify the illustration. In the remainder of this chapter we continue to follow this
same practice.
Multiway trees
Multiway trees are a generalization of the binary search tree data structure. With
multiway trees, each node contains N number of keys. For example, consider Figure
B which illustrates a multiway tree with a node size of 2.
node a
42 81
7 24 50 61 95 107
Figure 9B - This figure shows a search tree with two keys and three branches per
node. The path followed for the target key “15” is illustrated with dotted lines.
Consider the path followed to locate the target key “15.” First, node a, or the root of
the tree, is examined. The target key of “15” is less than “42”, the first key in node a,
so the left most branch is followed. Next, node b is examined. The target key is
greater than “7”, the first key in node b, so the target key is compared to the second
key in node b. Since “15” is less than “24” the middle branch is followed. This
algorithm is repeated until either an exact match is found or a leaf node is
encountered (a leaf node is any node which is the furthest from the root, in other
words, a leaf node has no child). If a leaf node is examined with no exact match
found, then the target key is not contained within the index.
Multiway trees are generally “shorter” than binary trees since their branching factor
is greater. However, complex algorithms are required to keep multiway trees
“balanced.” In a balanced tree, all leaf nodes appear at the same level, and the tree has
the same height regardless of which leaf node is measured. A balanced tree is
desirable since it results in uniform performance.
B-Trees
B-Trees are balanced multiway trees with the following additional key property: each
node need not contain exactly N number of keys. Nodes are allowed to “grow” from
half full to full. This additional property greatly simplifies the balancing algorithms
and significantly reduces the cost. B-Trees are a classic example of the disk space
versus performance trade off. Often, as in the case of B-Trees, additional disk space
can significantly increase performance. The rationale used in these cases is that,
presumably, additional disk space is relatively inexpensive and justified for the
desired increase in performance.
A B-Tree of order d can be defined formally as a tree with the following properties:
• Intermediate nodes (meaning nodes which are not leaves or the root) have:
• At least d records (keys) and no more than 2d records (keys);
• At least d+1 children (pointers) and no more than 2d+1.
• The root, unless the tree has only one node, has at least two children.
• All leaf nodes appear at the same level, that is, are the same distance from the
root (meaning the tree is balanced).
P0 R1 P1 ... RK PK
Figure 9C- This figure illustrates a typical node in a BTree index. Note that each node
has K records but K+1 pointers to child nodes.
• Records in the child node pointed to by P0 have key values less than the key
of R1. Records in the child node pointed to by PK have key values greater than
the key of RK. Records in the child node pointed to by Pi (where 0 < i < K)
have key values greater than the key of Ri and less than the key of Ri+1.
Balancing B-Trees
The popularity of B-Tree indexes results from their performance characteristics
which is derived from their balancing algorithms. Records are usually inserted into
and deleted from a database (and therefore an index) in an arbitrary fashion. Random
insertions and deletions can cause trees to become unbalanced (see Figure D).
Leaf Nodes
Figure 9D - Unbalanced trees have some short and some long paths from the root to
the leaf nodes.
With an unbalanced tree, some leaf nodes are further from the root than others;
consequently, path lengths and ultimately retrieval times vary. Some key values may
be located by only visiting a few nodes; other key values may require accessing a
great many nodes. Balanced trees, as mentioned previously, are trees whose leaves
are all the same distance from the root. This characteristic results in uniform search
paths and thus consistent retrieval times. However, in order to maintain a balanced
tree through random insertions and deletions special balancing algorithms must be
employed.
B-Tree balancing algorithms are specially derived to minimize balancing costs. First,
B-Tree balancing algorithms confine node changes to a single path from the root to a
single leaf. In addition, by utilizing extra storage (where some nodes are not 100%
full), balancing costs are also lowered. Since nodes are sometimes not full at the time
of an insert, the new record can simply be inserted into the node (therefore no
balancing action must take place). Similarly, when deletes occur, sometimes the
target node contains more than the minimum number of records so, again, the record
is simply deleted. Only when the node contains 2d records (inserts) or d records
(deletes) are balancing steps required.
Insertions
Figures E, F and G graphically illustrate the classic B-Tree insertion algorithm. Figure
E is the base tree before any insertions. To insert the key “Q” into the B-Tree of
Figure E, the find algorithm is followed to locate the leaf node where the new key
belongs. If we follow this algorithm we discover that node 7 is the node where the
new key belongs. Since this node has only two keys, the new key value of “Q” can
simply be inserted in the third record.
node 1
N
node 2 node 3
F L T
Figure 9E - This is a B-Tree of order 2, meaning each node has between two and four
children.
Figure F is the same tree as Figure E after the key of “Q” has been inserted. This
insertion represents the usual case: no balancing steps are required because the node
is not full. B-Trees trade extra storage space (when some nodes are not full) for added
performance.
node 1
N
node 2 node 3
F L T
Figure 9F - This is the same B-Tree as Figure E after the key “Q” is inserted.
Let’s examine a more complex example. Suppose we insert the key value of “U” into
the same tree. In this case, “U” belongs in node 8; however, this node already has four
keys. In this situation, a “split” must occur. To split a node, the keys with the lowest
values are placed in one node and the keys with the highest values are stored in a new
node. In addition, the “middle” key is propagated to the parent node and serves as a
separator. Figure G contains the same tree as Figure F after “U” is inserted. In this
example, the parent node contains free records for additional keys. If the “middle”
key cannot be placed in the parent because the parent node is also full, then the parent
node is also split. In the worst case, this splitting occurs all the way to the root and the
B-Tree grows by one level.
node 1
N
node 2 node 3
F L T X
Figure 9G - A representation of the same B-Tree as Figure E after the key “U” is
inserted.
Deletions
The delete algorithm, like the insert, begins with the find or retrieval operation in
order to locate the node which contains the target key. The target node may be found
either at the leaf level or at any other level within the tree.
If the target node is a leaf node, the key is simply removed. Since the majority of the
keys are stored at the leaf level (because there are many more leaf nodes than non-leaf
nodes), most of the time the target node will be found at the leaf level.
If the target node is not a leaf node then the key cannot simply be removed. The key
must be replaced with a new “partitioning” key so that future find operations are
guided properly to the child nodes. The general procedure is to replace the target key
with the next highest key value, or what is known as the “successor” key. The
successor key will always be found at the leaf level by traversing the right-most
branch of the tree. Once the successor key is located, it is simply moved from the leaf
node to the target node. In other words, the successor key replaces the target key (the
one to be deleted).
In either case, whether the target node is a leaf or not, the above step causes one leaf
node to shrink in size by 1 record. When this occurs, the affected leaf node may now
have too few keys, meaning it is less than half full. This situation is termed an
underflow. Two techniques are then used to deal with the underflow condition:
redistribution and concatenation.
If either one of the adjacent siblings of the affected leaf node contain excess records
(where the nodes are more than half full) then the redistribution technique may be
utilized. Redistribution simply moves keys among one sibling node, the affected node
and the parent such that the underflow condition is remedied.
On the other hand, if neither of the affected leaf node’s adjacent nodes have excess
records, then the concatenation technique must be applied. Concatenation involves
merging nodes and is, in effect, the opposite of the splitting operation used in the
insertion algorithm. Like splits, concatenations may propagate all the way to the root.
However, in this case, if the propagation travels to the root, the tree shrinks by one
level (rather than growing by one level as with splits).
Detailed examples of the deletion algorithm may be found in Comer, Smith and
Barnes, Knuth, as well as any introduction text to index organizations. Wirt provides
sample routines programmed in Pascal.
Retrieval costs
If we assume each node in a B-Tree is mapped to a block on a physical I/O device,
then each node visited in a find operation requires one physical I/O. Consequently,
retrieval cost is proportionate to the number of nodes visited to locate the target node.
The number of nodes visited during the find operation is directly related to the height
of the B-Tree since most nodes will exist at the leaf level. In other words, most of the
time, the find operation will traverse the tree all the way to the leaves before the target
key is located. Hence, the number of physical I/Os required to locate the target key is
approximately equal to the height of the tree. Knuth and others have shown that for a
B-Tree of order d with n total keys:
n+1
height ≤ log ------------
d 2
Thus, the cost of the find operation grows as the logarithm of the total number of
keys. The following table shows how efficient B-Trees are in terms of I/O due to their
logarithmic cost function, since the height of the index is generally equivalent to the I/
Os required to locate a leaf entry:
10 3 4 5 6 7
50 2 3 3 4 4
100 2 2 3 3 4
150 2 2 3 3 4
Even with 1 million keys, one can locate a target key within a B-Tree of order 50 in
only four physical I/Os. In fact, Knuth and others show that this cost can be improved
upon in practice by some simple implementation adjustments. The average cost is
somewhat less than three physical I/Os.
B+-Trees
As we have shown, B-Trees offer near optimal performance for random access to one
record. However, sequential processing using a standard B-Tree is both inefficient
and difficult. After locating the first record using the find operation, to locate the next
sequential record may require traversing from the leaf to the root and back down to
another leaf. Thus, in the worst case the cost of the next operation is:
MaxCostOfNext = log n
d
B+-Trees are derived from the standard B-Trees. They add efficient sequential
processing capabilities to the standard B-Tree yet retain the excellent random access
performance.
Before defining B+-Trees, it is necessary to define two terms:
• “Index nodes” refers to all nodes not at the leaf level, meaning the root and
all subsequent levels excluding the leaf level;
• “leaf Nodes” or “leaves” are all nodes at the leaf level.
B+-Trees differ from B-Trees in that (see Figure H):
• All pointers to the main file are removed from the index nodes;
• All key values are stored in leaves regardless of whether they exist in index
nodes;
• Leaf nodes are chained together into a linked list known as the sequence set.
Index Nodes
Leaves
Figure 9H - A B+-Tree is composed of index nodes and leaf nodes that are linked
together in a linked list known as a sequence set.
These structural differences result in several key implications:
• All find operations must travel to the leaf level to obtain the pointer to the
main file; thus, the index nodes serve as a “road map” to the appropriate entry
point to the sequence set.
• Since pointers to the main file are removed from the records in the index
nodes, more records can be stored in the index nodes; thus, the performance
is on a par with that of B-Trees.
• Sequence set provides an efficient mechanism for sequential processing: at
most the cost of the next operation is one I/O.
Prefix B+-Trees
Prefix B+-Trees are a slight variation of B+-Trees. Since the index nodes only serve as
a road map to the leaves and all operations must travel to the leaf nodes, the index
nodes do not have to store the full key values. The index nodes must store only
enough of the key value to distinguish one branch from another and direct the find
operation to the appropriate leaf (see Figure I).
beta b
Figure 9I - The tree fragment on the left contains the full key in the index node. The
tree fragment on the right contains only the “minimum distinguishing key.”
Prefix B+-Trees store the “minimum distinguishing key” in the index nodes. As a
result, many more index records may be packed into the index nodes thus increasing
the performance of the B+-Tree by decreasing the height of the tree.
SQLBase implementation
SQLBase B-Tree indexes are based on the B-Tree variant described above as the
prefix B+-Trees. This B-Tree variant provides excellent random and sequential
performance. Various aspects concerning SQLBase’s implementation of prefix B+-
Trees are discussed below.
implementations of SQLBase the pages size is 2048 bytes). Index page overhead is 74
bytes. The node size of leaf nodes is calculated as:
UsableIndexPageSize
NodeSize = ----------------------------------------------------------
IndexEntryLength
where
100 – PercentFree
UsableIndexPageSize = ( 1024 – 74 ) × -----------------------------------------------
100
and:
IndexEntryLength = 9 + NbrColumns + KeyLength
For example, if an index contains three attributes, each of which contains, on average,
20 characters, then the B-Tree node size is:
( 100 – 5 )
950 ----------------------
NodeSize = 100 = 950 ( 0.95 ) = 12
-------------------------------- ------------------------
9 + 3 + 60 72
Hence, in this example, the node size of leaf nodes is twelve and one would generally
expect the node size for non-leaf nodes to exceed 24, in this case.
Performance considerations
In an ideal situation, a database designer could set up an index and then rarely have to
perform any maintenance on it. However, index performance degrades over time due
to a number of factors, foremost of which is the pattern in which insertions and
deletions occur. When this happens, index reorganization becomes necessary. This
reorganization can take one of two possible forms:
• Reorganization of the entire database will rebuild all indexes within the
database as a subset of the process. This is overkill unless the entire database
has reached the point where total reorganization is required.
• Dropping and then re-creating an index will cause the index to be built fresh.
This will result in an optimal, balanced tree structure for the index, and is the
simplest and quickest way to return a degraded index to peak performance.
Deletions can cause balancing problems in the tree structure and fragmentation of the
index blocks. SQLBase does not perform redistributions or concatenations when
underflows occur in delete processing, based on the philosophy that these operations
would take too much time during regular transaction execution. Rather, it is left to the
DBA to rebuild the index when the accumulated underflows are sufficient to impact
performance.
Insertions are not always spread uniformly across the index tree structure. When
many insertions occur within a small range of symbolic key values within the tree, the
freespace within this area is rapidly used up. Once this occurs, there can be frequent
splits within this portion of the tree, sometimes reaching all the way back to the root.
While these operations are taking place, many exclusive locks will be held throughout
the index structure for the duration of the split operation. The locks may cause
contention problems for other transactions wishing to use the index structure during
this time, see chapter 13, Concurrency Control, for more details on the locking
implications of B-Tree indexes in SQLBase.
This imbalance is the primary reason why indexes placed on columns whose values
are incremental in nature, such as a timestamp, can cause performance problems
within an application. These indexes grow only along one edge. As rows are inserted
along this edge splits occur frequently, so that transactions updating this index find
themselves competing for locks within a small portion of the overall index structure.
One possible solution to this problem is to add columns to the index in order to make
the insertion pattern more evenly distributed throughout the tree.
4. If the value calculated above is greater than 50%, consider rebuilding the index
more frequently to lower the PCTFREE value.
This procedure can also serve as a starting point for an index that is known to have a
skewed insertion pattern. After obtaining a PCTFREE value from this procedure,
either increase the value by a factor of between 1.5 to 3, or simply plan on rebuilding
the index more frequently than the basis of your future row estimate would indicate.
The more skewed the insertion pattern, the more dramatic the adjustment should be to
ward off potential performance problems.
B-Tree locking
This topic is covered extensively in Chapter 13, Concurrency Control. To summarize,
indexes can often become points of lock contention when they are placed on an active
table, and should therefore be used sparingly in these cases. When many transactions
are being executed simultaneously, and rapid response time is critical, any insertion or
deletion activity in index structures will have a significant adverse impact.
Non-unique indexes
Indexes which lack the “UNIQUE” clause on the CREATE INDEX statement can
contain more than one row with identical symbolic key values. SQLBase handles
these entries as if they were unique anyway, in that each entry exists independently of
any other entry. This is in contrast to some DBMSs which build synonym chains for
duplicate entries, and then incur significant overhead while maintaining these long
synonym chains. For this reason, SQLBase can handle any number of duplicates
within an index without incurring any significant performance degradation, and in
fact some space savings will be realized due to the minimum distinguishing key
technique.
One potential drawback to non-unique indexes, though, is their lower selectivity
factor. This is a statistic that is kept for each index by SQLBase, and is used by the
query optimizer (covered in detail in Chapters 15 through 18) when selecting indexes
to use when building an access path to satisfy a query’s data requirements. If an index
has many duplicates as a percentage of its total rows, then there is a good chance that
the index may rarely, if ever, be selected for use by the optimizer. Such indexes are of
no benefit whatsoever to the system, and should be deleted.
An alternative action that will increase the selectivity factor of the index is to add one
or more additional columns to the low order end of the index which will enhance its
uniqueness. In fact, one technique that is sometimes used is to add the primary key of
the table to the low order of an index, which can then be created as a unique index.
Note, though, that this technique will not effect the optimizer’s selection of the index
unless these additional columns can actually be specified in the WHERE clause of the
SQL referencing the table. This is because the optimizer considers the selectivity
factor only for those index columns that are actually specified in the statement being
prepared for execution, as opposed to the entire composite index key.
Chapter 10
Hashed Indexes
In this chapter we:
• Introduce the concept of hashing, and contrast its effect on the location of
rows within a table’s storage space with that of non-hashed tables.
• Analyze the theory behind the performance benefits that can be expected
when hashing a table.
• Explain the details of SQLBase’s implementation of hashing including the
steps you need to take to ensure that your hashed tables will meet your
performance expectations.
needed for the file is 50,000/10 = 5,000 blocks, where x is a ceil-
ing of x, which rounds up the fraction part of x (for example, 3.1
= 4). Access of the file through a BTree index would require one I/
O for the leaf page, one for the actual data block, plus another I/O
for each level in the BTree structure (probably at least another two
I/Os, although would probably be improved through caching). This
would make a total of approximately four I/Os for the BTree access.
However, hashing makes it possible to find a record with very few
disk accesses, typically one disk I/O. This means that the hash access
would require only one-fourth of the I/O time that the BTree access
would.
Hashing theory
We will now examine the characteristics of the hashing storage technique in general.
We will cover the hash function, which performs the transformation of symbolic hash
key values into storage addresses, the concept of bucket hashing, where more than
one record is held at the same hash address, and provide a detailed analysis of the key
performance criteria used to evaluate the success of a hashed file implementation.
These criteria all relate to a hashed file’s disk space allocation, and the probabilities
of the hashing algorithm being able to accomplish its objective of storing and
retrieving records with no more than one I/O.
Bucket hashing
In practice, the I/O system retrieves a group of records from a disk, instead of a single
record, because records are grouped together in a block. We can extend the idea of a
record address in a file to the address of a block containing a set of records, and call
each block a hash bucket. In a bucket scheme, the hashing function generates a bucket
number instead of a record address as the hashing address for a given hash field. The
bucket number can usually be directly transformed into a relative block number in a
hashed file. When this relative block number is added to the starting disk address for
the file, any block in the file may be accessed directly via its disk address. Once the
bucket’s disk address is resolved, we need only a single block access to retrieve the
required record. The cost to search for the record within the block is minimal since it
can be carried out in a main memory buffer.
Bucket size
Bucket size is the number of records that can be held in a bucket. For instance, since a
bucket of 1,024 bytes can store 10 records each of which is 100 bytes long, the bucket
size is 10. Alternatively, if the record length is doubled from 100 bytes to 200 bytes,
the bucket size is decreased from 10 to 5. As bucket size increases, the probability of
overflow decreases but the time taken to search for a record in the bucket may
increase. However, the cost of accessing a record once the bucket is in main memory
is very small compared with the time required to read a bucket in secondary storage.
Packing density
One characteristic of hashing is that the number of I/Os required to perform an
insertion or retrieval tends to increase as the file fills. This seems intuitive since when
most of the buckets are full, it is more likely that the hashing function will generate
the address of a bucket which is already full. Overallocation of the file is an obvious
way to lessen this problem, however it is not desirable to allocate space for a lot of
empty buckets. A measurement called packing density is the ratio of the number of
records (R) in a file to the number of available slots (N): We know that N=B*S where
Records
PackingDensity = ------------------------------------------
NumberOfSlots
B is the number of home buckets (which equals the number of addresses that can be
generated by the hashing function), and S is the bucket size (the maximum number of
rows which may be stored in each bucket).
R
PD = -------------
B×S
Therefore, since, we also know that we can calculate the number of buckets (B)
required to store a particular hash table, given the number of rows R, and the bucket
size S (which is a function of the page size and the row length) after choosing a
packing density PD by solving the previous formula for B, which gives
R ----
S
B = ---------
PD
This concept is important because in SQLBase bucket size is a function of the page
size (which is fixed for a specific implementation platform), the row length, and the
PCTFREE specification on the CREATE TABLE statement. Most of these
parameters cannot be easily changed for hashing performance reasons, but are
determined by other factors. Your main parameter for tuning the performance of a
hash table in SQLBase is by altering the number of buckets, B, in the table’s storage
space. The number of home buckets (B) is the most easily adjusted, and therefore the
most important, parameter which affects packing density. By increasing the number
of home buckets, packing density drops and hashing performance improves due to a
decrease in the number of overflows.
Packing density itself is one of the most important parameters affecting hashing
performance. As a general rule of thumb, collisions are unacceptably frequent if the
packing density exceeds 70%. This rule can be greatly impacted by the bucket size,
which is determined by row and page size, however. This is why you should
understand how to determine a packing density figure that is appropriate for your
particular table.
Next we wish to calculate the probability of the remaining (R-r) records being
inserted into different home buckets other than b. Let 1 – ---1- be the probability that a
B
record will not be inserted into a particular bucket (called b). Hence, 1 – ---1- R – r is
B
the probability that (R-r) records are not inserted in bucket b.
We have now introduced the probability that r of R records will be inserted in a home
bucket (b) as well as the probability that the remaining (R-r) records are not inserted
in bucket b. Therefore, the probability of r records being inserted into bucket b and
the remaining (R-r) records not being inserted into bucket b is the product of ---1- r
B
and 1 – ---1- R – r (or ---1- r × 1 – ---1- R – r ).
B B B
However, there are different combinations that R records could be taken r at a time.
The number of different ways is:
R!
----------------------------
- .
r! × ( R – r )!
Thus the probability of a given bucket b having exactly r records (where r ≤ R ) is
R! 1 r 1- R – r .
p ( r ) = ----------------------------- × ---- × 1 – ---
r! × ( R – r )! B B
Given p(r), the probability of exactly j overflow records in a specific bucket with
bucket size S is p(S+j). Therefore the expected number of overflows for any given
bucket is :
R–S
p ( S + 1 ) + 2p ( S + 2 ) + 3p ( S + 3 ) + ... + ( R – S )p ( R ) = ∑ j × p(S + j)
j=1
Using this formula, we see that the expected number of overflows for all B home
buckets is:
R–S
B× ∑ j × p ( S + j)
.
j=1
We can express the above formula as a percentage of the records in the file:
R–S
B×
∑ j × p(S + j )
j=1
Ov erflowRatio = ---------------------------------------------------
- .
R
Since we assume a uniform distribution for any given packing density
R
( PD = ------------
- ), any bias toward certain addresses in the hash function is likely to
B×S
increase the rate of overflow. Be aware that a figure produced with this model is a
minimum percentage. Given the formulas above, we can derive an expected overflow
against packing density for a particular bucket size. The following table shows the
expected percentage of overflow for different combinations of bucket size and
packing density.
The following example illustrates the use of this table in the design of a hash table.
2 10.27 11.90 13.56 15.25 16.94 18.65 20.35 22.04 23.71 25.37
3 5.92 7.29 8.75 10.30 11.92 13.59 15.30 17.05 18.81 20.58
5 2.44 3.36 4.45 5.68 7.07 8.58 10.21 11.93 13.73 15.60
8 0.83 1.33 2.01 2.87 3.94 5.20 6.65 8.26 10.03 11.93
12 0.24 0.47 0.84 1.38 2.13 3.11 4.33 5.79 7.48 9.36
17 0.06 0.15 0.32 0.63 1.12 1.85 2.84 4.12 5.69 7.53
23 0.01 0.04 0.12 0.28 0.58 1.08 1.86 2.96 4.40 6.18
30 <0.01 0.01 0.04 0.11 0.29 0.63 1.22 2.14 3.45 5.16
Hash function
SQLBase contains a pre-programmed hash function which will provide excellent
results for the vast majority of symbolic keys. This function is internal to SQLBase
and cannot be changed by the database designer in any way. For efficiency, SQLBase
implements much of the hash function using bit operations, such as exclusive-or
(XOR), and bitwise complement (1’s complement). The following procedure is
followed by the SQLBase hash function when transforming a symbolic key into a
hash address.
~(0111,1100,0111,1001,0110,0011,0111,1000) =
1000,0011,1000,0110,1001,1100,1000,0111
(131,134,156,135 in ASCII)
2,206,637,191 when treated as a fullword binary integer and
converted to decimal.
2. Transform the binary fullword into an intermediate integer that has the same range
as the number of page addresses in the table’s space allocation. In this step, the
method used is based on performing integer division, then finding the remainder,
through the use of the MOD operation.
This integer is derived by performing a MOD of the output of step one with a
number A, where A is the number of hash buckets that are allocated. SQLBase
always sets A to a prime number to avoid the possibility of divisors producing a
zero result from the MOD operation.
Example 5: To continue from example 4, assume that 1,753 hash buckets have
been allocated for the employee table. This means that allowable
values for the results of step 2 would lie in the range of 0-1752. Di-
viding 2,206,637,191 by 1753 gives a result of 1,258,777 with a re-
mainder of 1,110. Therefore, the result of step 2 is 1,110.
3. Transform the output of step two into the physical page address. This is done by
adding a starting page number, which yields the final hash address of the page
containing the row.
Example 6: Suppose that the first page number used for data storage of the em-
ployee table rows is page 60. This means that 60 must be added to
the target bucket obtained in example 5 in order to get the physical
target page number of this row. Since 1,110 + 60 = 1,170, the row
will be stored on page 1,170. Similarly, if a SELECT ... WHERE
keyname = “9305EJSM” was performed against the employee table,
SQLBase would look for the row on page 1,170.
When determining what number to place in this DDL statement, an optimal choice of
packing density becomes critical in order to reduce the number of overflows while
still utilizing disk space efficiently. While the 70% rule of thumb figure is often used,
we encourage you to look in the expected overflow table explained earlier to find the
packing density that is most appropriate for the bucket size that will be in effect for
each of your hashed tables. By looking at this table, you can decide what packing
density is appropriate in order to obtain an expected low number of overflows. This
packing density can then be used as input to the calculation of the number of buckets
estimate for the CREATE INDEX statement. Remember that it is when the CREATE
INDEX statement (not CREATE TABLE) is executed that a fixed number of buckets
will be allocated for storing the hashed table. Also, SQLBase will round up the
number you specify in the BUCKETS form, or the number it calculates in the ROWS
form, of the CREATE INDEX statement to the next higher prime number.
information on how to find the length of each column, refer to Chapter 11, Estimation
Database Size):
PageSize – 86
S = ------------------------------------------------------------------------------------------------------------------
-
24 + ( 2 × NumCols ) + ΣColLengths × 100
------------------------------------------------------------------------------------------------------------
100 – PCTFREE
Example 7: Suppose a table has 7 columns, an average data width of 66 bytes for
all columns, and is projected to contain 60,000 rows. The database
is running on the SQLBase NLM server, which has a page size of
1024K. The PCTFREE parameter of the CREATE TABLE state-
ment has been allowed to default to 10%. The designer has decided
on a packing density of 70%. Bucket size is
1024 – 86
-------------------------------------------------------------------- = 8.
(-------------------------------------------------------------
24 + ( 2 × 7 ) + 66 ) × 100-
100 – 10
From the expected overflow table we see that the corresponding ex-
pected overflow is 3.94%. Using the formula given above, we can
compute the number of buckets required in the CREATE INDEX
statement as (60,000/8)/70% = 10,715 buckets. In other words,
SQLBase will allocate 10,715 pages for storing hashed data. Alter-
natively, the designer could choose to have an 80% packing density,
increasing the expected overflow to 6.65%. The number of buckets
in the CREATE INDEX statement would then be (60,000/8)/80%
= 9,375 pages would be required. Note that the use of a higher pack-
ing density will save space but that the overflow rate will also in-
crease, in this case from 3.94% to 6.65% while the space saved is
10,715 - 9,375 = 1,340 pages or approximately 1.3 megabytes.
Chapter 11
Introduction
The purpose of this chapter is to enable a DBA to estimate the total size of a complete
database. The raw data needed to perform this task are the detailed database design,
including tables, columns, indexes, and views, as well as row occurrence estimates
for each table in the database.
An accurate projection of a database’s ultimate size is more critical than ever before,
since gigabyte database systems are becoming increasingly common in today’s
market. A database designer can base his estimates of the growth of the database on
observable and measurable business metrics such as the number of new order
transactions processed daily. This information may be extrapolated to row count
estimates for the various database tables involved in each transaction. Following this,
the required disk capacity can be pre-determined by using the formulas provided in
this chapter. The results can then be used to determine appropriate hardware
configurations, both for the initial implementation of the application as well as for
periodic upgrades in capacity.
The starting point for this estimation is to calculate the total size for every table. To
do so, we need to (1) calculate each column width in a table; and (2) calculate the
table size based on the information derived from(1). Table size is calculated
differently for tables which are the subject of a clustered hashed index, due to the
change this has on the table’s space allocation technique:
• If the index is created by specifying the option of CLUSTERED HASHED
in the CREATE INDEX statement, we estimate the size of the hashed file as
the table size. This hashed file is allocated at full size immediately when the
CREATE INDEX statement is executed.
• Otherwise, we need to estimate the non-hashed table size. This is the size that
we estimate the table will grow to when fully populated with data. Data pages
for the table will be allocated as they are needed over time, as opposed to all
at once as is the case with hashed tables.
If the table is associated with one or more BTree indexes (indexes which do not
specify the option of CLUSTERED HASHED), then the calculation of the table size
should also consider the space required for the indexes. The total space required for
the table is then the sum of the table itself and all associated BTree indexes.
The database size can then be obtained by summing up all the table sizes, the
overhead of all views defined against these tables, and the overhead of the system
catalog (or data dictionary) tables in the database.
Spreadsheet usage
A number of spreadsheets are available to help the DBA in estimating table and index
sizes. These spreadsheets are available for download from the Gupta Web site. Users
of the hashed table spreadsheet, for example, can specify the percent free parameter
(PCTFREE), number of rows projected to be held in the table, and packing density.
Other inputs include column names, data types, average data length for each column,
and declared data length of each column. The spreadsheet calculates the number of
rows contained in a page, number of row pages required, and number of rows for the
CREATE CLUSTERED HASHED INDEX statement. With the availability of these
spreadsheets, the task of estimating the size of a database is largely automated. The
following example in Figure A illustrates one of the spreadsheets.
Three specific spreadsheets are available:
• Calculation of space requirements for a standard, non-hashed table.
• Calculation of space requirements for a hashed table (one which has a
CLUSTERED HASHED index placed on it). This is the example in Figure A.
• Calculation of space requirements for a BTree index.
Note that there is no spreadsheet available for calculating space requirements for a
CLUSTERED HASHED index itself. This is because there is no physical space
allocated uniquely to this type of index, rather the effect of the index is to change the
storage allocation method used for the subject table.
Figure 11A - An example of a spreadsheet which calculates the estimated size of a table.
Note that this table is the subject of a CLUSTERED HASHED index, therefore one of the
calculations made is for the row count to be used in the CREATE INDEX statement.
as, 3.1=3). Also, x is a ceiling of x, which rounds up the fraction part of x (such as, 3.1=4).
Also, hashed tables will need another calculation done using the maximum width for
each column in the table. This will be needed later in order to accurately calculate the
packing density for the hashed file. Since SQLBase reserves space for hashed tables
based on the row estimate in the CREATE CLUSTERED HASHED INDEX
statement, multiplied by the maximum possible row width (which is the sum of the
maximum widths of all non-LONG VARCHAR columns), the calculation of this
maximum row width allows the DBA the highest level of control over the this space
allocation process. This calculation does not need to be performed for non-hashed
tables.
Character Number of characters in the column (maximum 254 bytes). For example, a
character type data “Atlanta” is 7-bytes long.
(Note: Physically, SQLBase stores both CHAR and VARCHAR columns the
same way. However, you should continue to use both data type to explicitly
document the logical data type of the attribute. For example, CHAR(8)
implies the attribute always contains eight characters; whereas, VARCHAR(8)
means the attribute may contain up to eight characters.)
Date 5 bytes.
Date/Time 12 bytes.
Long varchar 12 bytes consisting of the first physical page number, last physical page num-
ber, and column size are stored in a base row page. These 12 bytes are consid-
ered as the column width stored in a base row page. The actual data resides in
long varchar pages. SQLBase stores only one column on each long varchar
page. For example, if a column is declared as LONGVAR but its length is only
one byte, an entire long varchar page is allocated to store this one-byte long
varchar data. A general formula for the number of long varchar pages required
for the data is (for each long varchar column):
Nbr of Long Pages per Long Column = size of long varchar in bytes/(1024-
61).
The number of long varchar pages is included in the calculation of a table size.
(Note: SQLBase stores zero length strings defined as either CHAR or VAR-
CHAR as NULL (i.e., requires no physical storage); on the other hand, col-
umns defined as LONG VARCHAR require a minimum of one LONG
VARCHAR page (i.e., 1024 or 2048 bytes depending on the page size.)
We define DataLength as the sum of all column widths in a row, which is used for the
table size calculation.
Parameter Formula
Row Length Includes the row overhead, such as row header, and slot table entry. Row
Length = 18 + (2 x number of columns) + DataLength.
Row Length with Slack Considers the PCTFREE parameter specified in the CREATE TABLE
statement (which defaults to 10%). Row Length with Slack = Row
Length x 100 /(100 - PCTFREE)
Rows per Page (Usable Row Page Size / Row Length with Slack).
Nbr Row Pages (NbrOfRows / Rows per Page), where NbrOfRows is the projected
number of rows in the table.
Nbr Long Pages NbrOfRows x Νbr Long Pages per Long Column.
Parameter Formula
Rows per Page (Usable Row Page Size / Row Length with Slack).
Nbr Long Pages NbrOfRows x Νbr Long Pages per Long Column.
Nbr Hashed Table (See packing density discussion above for detailed explanation)
Pages Nbr Hashed Table Pages = Nbr Row Pages / packing density.
Total Data Pages Nbr Hashed Table Pages + Nbr Long Pages
Parameter Formula
Key Length Key Length = ∑ average data lengths of columns in the key.
Usable Index Page Size (1024 - 74) x (100 - index PCTFREE) / 100 (Note that the default value
for PCTFREE is 10%).
Index Entries per Page (Usable Index Page Size / Index Entry Length).
Parameter Formula
Variable Overhead of The sum of the variable overhead of all views referenced by this
other views view:
∑ Variable Overhead for all views
Total View Overhead in (Fixed Overhead + Variable Overhead + Variable Overhead of other
Pages views) / 1024 pages.
There is a fixed amount of overhead for various physical objects. If the object is either
a regular or a hashed table, the overhead is 12 pages. For non-hash indexes, the
overhead is 2 pages. Therefore, the formula for the total amount of system overhead
for a database is:
Total Fixed Overhead Pages = (Nbr Tables x 12) + (Nbr non-hash
Indexes x 2) + (Size of START.DBS / 1024).
In actuality, there is also some variable overhead associated with each type of object
which depends on the current size of the object. As the size of the object increases, so
will this overhead rise slightly. For example, the number of bitmap pages which
SQLBase uses to manage page allocation for data rows will increase with the number
of rows, however this increase is very small, at approximately 1K for every 8MB of
data. Because of the tiny incremental nature of this overhead, formulas are not
presented here to accommodate their inclusion in the overall estimation.
XPK_CONTACT
XPK_CUSTOMER CUSTOMER_ID
CONTACT_NAME
CUSTOMER_ID
CONTACT 175,000
50,000 CUSTOMER
22,750,000 175,000 U
UH 2,500,000 50,000
130 25 192,500
50 15 7938
CUSTOMER_ID
CUSTOMER_ID
CONTACT_NAME
CUSTOMER_ID
Syntax Key
TABLE NAME
key length
total bytes occurances
rec len Pct free # pages INDEX NAME
cardinality
column 1
primary key column 2
foreign key U/N/H
Unique Non-unique Hashed
The SQLBase Data Definition Language (DDL) statements used to create this
database, and associated views, follows:
CREATE TABLE CUSTOMER
(CUSTOMER_ID CHAR(5) NOT NULL,
CUSTOMER_NAME VARCHAR(25),
CUSTOMER_ADDR VARCHAR(50),
CUSTOMER_RATING CHARACTER(10),
PRIMARY KEY(CUSTOMER_ID))
PCTFREE 15;
CREATE TABLE CONTACT
(CUSTOMER_ID CHAR(5) NOT NULL,
CONTACT_NAME VARCHAR(25) NOT NULL,
CONTACT_PHONE DECIMAL(10,0),
CONTACT_TEXT LONG VARCHAR,
PRIMARY KEY(CUSTOMER_ID, CONTACT_NAME),
FOREIGN KEY CUSTKEY (CUSTOMER_ID) REFERENCES CUSTOMER
ON DELETE RESTRICT)
PCTFREE 25;
CREATE UNIQUE CLUSTERED HASHED INDEX XPK_CUSTOMER
ON CUSTOMER (CUSTOMER_ID)
SIZE 47628 ROWS;
CREATE UNIQUE INDEX XPK_CONTACT
ON CONTACT (CUSTOMER_ID, CONTACT_NAME)
PCTFREE 10;
CREATE VIEW BAD_CUSTOMER AS
SELECT CUSTOMER_NAME, CUSTOMER_ADDR
FROM CUSTOMER
WHERE CUSTOMER_RATING=’POOR’;
CREATE VIEW BAD_CONTACTS AS
SELECT C.CUSTOMER_NAME, CN.CONTACT_NAME, CN.CONTACT_PHONE
FROM BAD_CUSTOMER C, CONTACT CN
WHERE C.CUSTOMER_ID = CN.CUSTOMER_ID;
After an analysis of the data that will be loaded to the database was performed, the
following data on average column widths was derived:
Maximum Average
Table Column
Width Width
Customer Customer_id 5 5
Customer_name 25 10
Customer_addr 50 30
Customer_rating 10 5
Maximum Average
Table Column
Width Width
Contact Customer_id 5 5
Contact_name 25 15
Contact_phone 10 10
Chapter 12
Result Sets
In this chapter we:
• Introduce the concept of the result set, which is the intermediate storage area
used by SQLBase when processing a SELECT statement which retrieves
multiple rows from one or more database tables.
• Cover the following detailed characteristics of result sets:
• Structure
• Contents
• Life cycle
• Evaluate the implications that SQLBase’s result sets have for program
processing requirements, especially in the areas of locking and concurrency.
Introduction
A result set is a temporary index, possibly associated with a temporary table for
certain complex queries, which contains a set of data that represents the result of a
SELECT statement. Since most SELECT statement executions cause a number of
rows to be returned from SQLBase, the result set typically consists of many rows.
The contents of the result set’s leaf pages (the bottom level in the B+ index tree
structure) are pointers (ROWIDs) to the rows of the table (or tables) involved in the
SELECT. Alternatively, for some complex queries these ROWIDs point to the rows of
a temporary table containing the data values needed to satisfy the SELECT statement.
The result set is stored on disk in a temporary file, and one or more rows are
transmitted from it to the client workstation when requested via the FETCH
command.
The utilization of result sets is an important feature of SQLBase. Through the use of
result sets, SQLBase decreases response times for client workstation users while
increasing their flexibility to position themselves within the result set when
performing browse operations. SQLBase also offers a number of options relating to
the processing of result sets which allow the system designer to balance the flexibility
of result set processing with the concurrency requirements of his application.
Query
Select Acctnum, Acctname from Customer;
ROWID of row 1
ROWID of row 2
ROWID of row 3
ROWID of row n
Figure 12A - Result set from a single table query with no derived data. The only entry
in the leaf pages of the index tree is the ROWID of each of the Customer table’s rows
which qualifies for inclusion (in this case, all rows).
In Figure A the result set of a simple one-table query is shown. For SELECT
statements such as this, the result set consists of the ROWIDs of the selected rows in
the Customer table. When the client performs a fetch against the result set, the row or
rows returned are identified by the ROWID at the cursor position. As the data rows
are accessed, the columns requested by the query are retrieved and formatted for
return to the client.
Query
Figure 12B - Result set from a two table join. ROWIDs identify the qualifying rows of
each table, and allow SQLBase to retrieve the appropriate columns from the selected
rows of the base tables at FETCH time.
In figure B, the result set of a simple two-table join is shown. Since all of the data
columns requested by the query are present in the rows of the two tables, the only
columns in the result set are the ROWID pointers for the joined rows. Each ROWID
pair in the result set satisfies the join condition, which is that the account number in
the Customer row matches the account number in the Invoice row. When the actual
rows are fetched by the client, SQLBase will use the ROWIDs to retrieve the actual
data values requested by the query from both tables involved in the join. Note that this
query will result in one row for each invoice a customer has on file, in other words
each customer number will be repeated in multiple rows for each invoice that
customer has.
Query
Select C.Acctnum, C.Acctname, SUM(I.Invoice_amt)
from Customer C, Invoice I
where C.Acctnum = I.Acctnum
group by C.Acctnum, C.Acctname;
ROWID for row n VALUE for row n VALUE for row n VALUE for row n
Figure 12C - Result set from a two table join aggregate query. Since this query
contains aggregate data in the form of the SUM (I.Invoice_amt) column, a temporary
table is built to hold the result set data. The result set itself points to the rows of this
temporary table. This temporary table is also sometimes called a ‘virtual result set.’
The aggregate query in Figure C has a significant difference from the previous
queries we have examined, in that its result set contains the ROWIDs of a temporary
table. This temporary table contains the actual row values to be sent back to the client
in response to fetch statements. These values must be stored in a temporary table
otherwise the entire aggregate set of data would have to be retrieved again in order to
recalculate the result of the function. The query in figure C contrasts with the query in
figure B in that it returns a single row for each customer. This row will contain the
total value of all invoices for that customer, calculated by the SUM function which is
applied to the invoice amount field. SQLBase creates a temporary table for any query
that includes aggregate functions, complex views, or queries containing the
DISTINCT, GROUP BY, HAVING, UNION, or ORDER BY clauses.
matches the sequence of the ORDER BY clause, then the result set may be built
incrementally, as outlined above. However, if there is no index that can be used to
satisfy the ORDER BY clause, then SQLBase will have to perform a scan of the
entire table, followed by a sort operation, in order to put the result set in the correct
sequence. Only following this sort may the first row of the result set be returned to the
client.
set either. The only way to make the results of the update appear in the virtual table is
to re-execute the SELECT statement which built it.
5. Compare the row serial number component in the ROWID to the row serial
number actually stored in the slot. If these are different, then the original row has
been deleted and the slot has been reused by another row. If this has occurred, then
SQL return code 00003 (FET NF: Row has not been found) is returned to the
client.
6. Compare the update serial number component in the ROWID to the row update
serial number actually stored in the row. If these are different, then the row has
been updated since the SELECT statement that built the result set was processed.
If this has occurred, then SQL return code 00002 (FET UPD: Row has been
updated) is returned to the client, indicating the row has been updated. The newly
updated row is returned to the client.
This procedure ensures that transactions processing result sets in high concurrency
isolation modes can still detect modifications to the rows in their result set which
occur after the result set is built.
Chapter 13
Concurrency Control
In this chapter we:
• Introduce the topic of concurrency control, which is the way a SQLBase
server makes each client workstation appear as if it were in sole control of the
database.
• Discuss the concept of transaction serialization, which dictates the
requirements for inter-leaving transaction executions in a manner which
eliminates interference between transactions and makes each transaction
individually recoverable.
• Explain the various types of locks SQLBase uses to control transaction
executions, and the characteristics of these locks.
• Delineate the various isolation levels available in SQLBase, which determine
how long certain types of locks are held during a transaction’s execution.
• Explain how to select an appropriate isolation level for a program, balancing
the trade-offs of data integrity and maximum performance for all executing
transactions.
Introduction
The topic of concurrency control addresses the problems faced by SQLBase in
supporting a multi-user environment. When users are at their workstations using
some type of client software, which could be a SQLWindows application, Quest,
WinTalk, or some other program, and are connected to a SQLBase database server,
they expect the results of their programs to be exactly the same as if they were
executing against a local copy of the database contained on their own workstation.
Users want to remain unaware that many other users are also connected to the same
database and are using it in a variety of ways. Unfortunately, there are problems
inherent in multi-user database access that must be considered by both SQLBase and
the application developer in order to realize the users’ objective. These problems are
caused by the pitfalls inherent in allowing many users to access the same database
concurrently. The topic of concurrency control addresses the software mechanisms
used by SQLBase to overcome these difficulties.
The unit of work that is the subject of the concurrency problem is the transaction
(which is discussed at length in Chapter 5, Transaction Definition). One of the key
goals of concurrency control is that either all updates of a transaction are made
permanent if its execution is successful, or it otherwise has no effect at all on the
shared database. The other goal of concurrency control is that transactions not be
allowed to interfere with each other. Each transaction should appear to have sole
control over the shared database without concern for other transactions whose
execution is being conducted simultaneously by SQLBase.
Transaction serialization
Transactions are executions of programs against the shared database which are
delineated by an implicit start when the program connects to the database, and are
ended through either a commit or a rollback command. Transactions which complete
successfully signal this event to SQLBase through the COMMIT command, at which
time the changes made by the transaction are made permanent within the database.
Unsuccessful transactions terminate with the ROLLBACK command, which causes
SQLBase to back-out all changes to the database that were previously performed on
behalf of the transaction. If the program continues to perform database calls after the
execution of a COMMIT or ROLLBACK, a new transaction is considered to have
begun. A rollback of the transaction may occur for a variety of reasons besides
program request (as indicated through the program issuing the ROLLBACK
command) including program failure (abort), system failure (OS abend), or
environmental failure (power failure). If any of these events occur, then SQLBase
must ensure that all non-committed transactions are rolled back so that the database
never appears to have been effected by them.
Database
Transaction 1 Transaction 2
UPDATE order
1
WHERE ordno = 1 1
SET status = 's'; o 5663
s SELECT order
$46
WHERE ordno = 1;
IF status = 's' THEN
UPDATE cust-acct
ROLLBACK WHERE acctno = order.acctno1
5663 SET balance =balance + order.ordtotal;
Correct restored 1 5663 $132
image after both
transactions aborted o $46
COMMIT
[not allowed]
Database
Transaction 1 X-Lock
Transaction 2
read
UPDATE order
1
WHERE ordno = 1 1
write o
SET status = 's';
s SELECT order
W HERE ordno = 1;
COMMIT
Figure 13B - The same as Figure A, except with locks. Transaction 1 places an
exclusive lock on the row it updates, which causes Transaction 2 to wait until
Transaction 1 completes prior to reading the record. This is a serialized execution,
and both transactions are fully recoverable through rollback techniques.
Figure B shows the same two transactions executing with database locking
performed. Following the update of the ORDER row by transaction 1, SQLBase
places an exclusive lock on the page containing the row. This type of lock can only be
held by one transaction at a time and blocks all other transactions from accessing the
page (more detail on locks follows later in this chapter). When transaction 2 attempts
to read the ORDER row, via a SELECT statement, it is placed into a wait state until
the exclusive lock is released. During this time, transaction 1 requests a rollback,
which is performed. Following the rollback, the exclusive lock is lifted on the
ORDER row, and transaction 2 is allowed to continue. Notice that the row image
returned to transaction 2 is not the one that existed at the time of its request, as in
Figure A, but rather the restored image which existed prior to modification by
transaction 1. Since the status value is no longer ‘s’, transaction 2 no longer modifies
the CUSTOMER row as in Figure A, but terminates with a COMMIT anyway. These
transactions are now serialized, they no longer execute in an inter-leaved fashion
which allows them to affect each other’s execution.
This example is one of a number of ways in which transactions may interfere with
each other’s correct execution through their interaction within a shared database.
Other common problems that may occur due to lack of concurrency control include:
• Lost Updates
Lost updates occur when two or more transactions retrieve the same record
and make successive updates of that same record. If no method of locking is
used, transactions can overwrite recent modifications without the benefit of
seeing the newly updated record.
• Uncommitted Dependency Problem
Rollbacks issued by one transaction can also cause another transaction to lose
updated data if no locking protocol is used. If a record is updated by one
transaction but not committed, then subsequently updated by a second
transaction with no commit performed, a rollback by the first transaction will
also rollback the second transaction's updated data.
In order to eliminate these problems, SQLBase ensures that transactions execute in a
strict serialized fashion through the use of database locks. These serialized executions
achieve the same effect as executing each transaction in serial fashion, with respect to
all other transactions, such as would happen if they were dispatched through a FIFO
(First In First Out) queue. Of course, SQLBase does not accomplish this by FIFO,
since only through concurrent executions can the speed that is required for an
effective RDBMS be achieved. When viewed conceptually, however, for any single
transaction the effect of its execution is exactly the same as if it were executed
serially, without other transactions present in the system.
S-locks
With an S-lock, more than one simultaneous transaction can lock the record;
therefore, more than one transaction can access the record. S-locks are applied when
a transaction reads data via a SELECT statement. When data is held by an S-lock,
inserts, updates, and deletes, which require X-locks, are not allowed. Once the S-lock
is removed, that is, once there is again only one transaction accessing the data,
inserts, updates, and deletes are permissible (X-locks are allowed). A transaction may
upgrade an S-lock to an X-lock only if there are no other transactions holding S-locks
on the same data.
X-locks
With an X-lock, only one transaction at a time can use or lock the data. X-Locks are
applied to data on inserts, updates, and deletes. While the X-lock is in effect, other
transactions cannot access the data for any purpose. Transactions cannot place an X-
lock on data while an S-lock on the data is in effect for some other transaction.
U-locks
Transactions place U-locks when they have retrieved data with a clear intent to
update. This is usually done internally by SQLBase in order to avoid deadlock
situations, such as when building a result set for a UPDATE ... WHERE command
that will affect multiple rows. Only one U-lock can be held on a page at one time,
similar to X-locks, but this single U-lock can coexist with S-locks, unlike X-locks.
However, when the transaction holding the U-lock wishes to perform the update,
which requires upgrade of the U-lock to an X-lock, then the rules for X-locks apply.
This means that any S-locks that may have been coexisting (placed either before or
after the U-lock was set) with the U-lock must be released prior to the upgrade to an
X-lock taking place. These characteristics of U-locks make them less susceptible to
deadlock situations, as will be seen later in the discussions on isolation levels and
index locking.
The U-lock is the one lock in SQLBase that can be set explicitly. Whenever a
transaction issues a SELECT ... FOR UPDATE OF command, U-locks are placed on
the pages of the result set. If the program later issues an UPDATE ... WHERE
CURRENT OF command referring to the cursor of the SELECT, these U-locks are
upgraded to X-locks for the remainder of the transaction. Any U-locks that are still in
place are allowed to downgrade to S-locks when the cursor under which the SELECT
command was executed is used to prepare and execute a new statement (depending on
the isolation level in effect).
Lock compatibilities
The following table summarizes the capabilities of each particular type of lock to
coexist with the other types.
Note: *While S-locks and U-locks may co-exist when they are placed, the S-locks will have to
be released before the U-lock may be upgraded to an X-lock.
Locking level
SQLBase places locks at the page level. If the record being accessed fills most of the
page, only that record is locked. Otherwise, a group that is composed of the record
and its adjacent records, which together make up a page, is locked. A page can
contain any sort of data, including indexes and system catalog data (detailed
descriptions of the various SQLBase page types can be found in Chapter 8, Database
Pages).
Locking scope
The scope of the locks implicitly placed by SQLBase cover all the data structures that
the optimizer determines are required to execute each command within a transaction.
When processing a SELECT statement, the optimizer will lock those index leaf pages
that are required to build the result set, if an index is chosen as a component of the
access path. Also, all data pages read during the processing of the SELECT statement
will have S-locks placed upon them. Often, contention between transactions occurs
not on the data pages but within the index structures due to the potential proliferation
of X-locks within the index. When S-locks are released is a function of the
transaction’s isolation level, but is also affected when transactions use cursor context
preservation. On the other hand, X-locks are always held until either a commit or
rollback is issued, regardless of the transaction’s isolation level or the use of cursor
context preservation.
Index locking
In the case of an INSERT or DELETE, X-locks will be placed not only on the page
containing the row which is the subject of the command, but also on at least one page
for every index that exists on the table. This same case holds true for UPDATE
commands specifying columns which are the subject of one or more indexes.
The index page that will definitely be X-locked is the page containing the leaf entry
pointing to the subject record. This is the only page in the index which will have a
permanent X-lock placed on it, this X-lock will not be released until the transaction
commits or ends. If index structure adjustments need to be made, such as splitting
index records, then additional X-locks will be taken as well. These additional locks
will be taken on those nodes within the index structure that have to be updated to
perform a split or the movement of a successor key up the tree. These X-locks will be
released once SQLBase completes the operation on the index tree; they will not be
held through the life of the transaction (in contrast to the X-lock on the affected leaf
page).
SQLBase performs the following steps when performing a DML INSERT operation
which must modify an index (both DELETE and UPDATE function similarly and
therefore are not explained):
Lock implementation
The mechanism used by SQLBase to implement locks is through an in-storage hash
table. This storage construct is used because it is efficient in storing and retrieving
tables with a large number of entries. When a new lock is needed, a previously
allocated, but currently free lock entry may be re-used. If no lock entries are available
for reuse, then additional lock entries may be dynamically added to the table in
groups of 100 locks at a time. These allocations can be added to the table until the
maximum number of locks specified in the LOCKS keyword of the SQL.INI file is
reached, or until the server has no more memory to allocate. Each lock entry in the
table represents either a single page or a range of pages, and contains the following
data:
• The group number owning the page, or page range, that is locked. As
explained in Chapter 8, this number uniquely identifies one object within the
database, such as a table or index.
• The logical page number of the page that is locked, or the starting logical
page number of a page range that is locked. Logical page numbers are
assigned sequentially within a group starting from one.
• The logical page number of the end of the range which is locked.
• The type of lock that is held, either S, U, or X. This is prefixed with a ‘T’ if
the lock is temporary, such as TS, TU, or TX.
• The transaction identifier of the owner of the lock.
Whenever SQLBase is performing a read or a write operation, the lock table is
checked for the group and page on which the operation is performed. If no conflicting
lock type exists in the table, then a new lock is added on behalf of the current
transaction before the operation proceeds. In this way, SQLBase ensures that locks
are set prior to performing the operations they control, and that no lock conflicts exist
that could undermine the integrity of the system.
Isolation levels
The protocols that are used by SQLBase to perform locking operations on behalf of a
transaction are called isolation levels. These define how long locks will be held and
how SQLBase will pass data to the transaction via the message buffer. The four
available isolation levels are: Read Only (RO), Repeatable Read (RR), Cursor
Stability (CS), and Release Locks (RL). These are often referred to by their
abbreviations. The isolation level is selected by a transaction when it first opens a
cursor to SQLBase, and remains in effect for the life of the transaction. The isolation
level that is selected by the transaction applies to all cursors that will be used by the
transaction. Changing isolation levels in the middle of a transaction has the effect of
causing an implicit COMMIT and starting a new transaction with the new isolation
level. All of the transaction’s cursors will now operate under this new isolation level.
Also, having cursor context preservation on will not preserve the result sets of any
cursors, nor their locks, when changing a transaction’s isolation level in midstream.
page) as soon as a transaction reads the data. At this point, SQLBase will not allow
any other transaction to update the row so that lost updates are guaranteed not to
occur. However, even though multiple transactions read the same data base row/page,
only one transaction may actually choose to update the row. This transaction is
prevented from doing so because one of the other transactions might update the row.
SQLBase offers an advanced “optimistic” isolation level which most multi-user
applications will choose to decrease the number of SQL timeout and deadlock errors.
SQLBase offers the isolation level Release Locks, which allows conditions to develop
where lost updates and uncommitted dependency problems might occur but provides
the programmer with features to detect and avoid these problems when they actually
occur. Each data base row in SQLBase is assigned a unique ROWID when the row is
initially inserted and this field is changed every time the row is updated and
committed. The programmer can track the value of ROWID and determine when a
row has actually been deleted or updated by another user.
Warning: Using Release Locks (RL) isolation level with programs which perform database
updating requires each program using the RL feature to conform to the same method of
detecting and handling cross transaction updates, such as the technique shown here. Failure of
all programs to conform will likely result in lost updates and uncommitted dependency
problems that will proliferate throughout the database.
For example, with the Release Locks isolation level, the programmer would issue the
following command to select a row from a customer table:
SELECT ROWID, COL1, COL2,...COLN FROM CUSTOMER
INTO:SavedRowID, :IntoVar1, :IntoVar2, ... :IntoVarN
WHERE CUST_NAME = 'SMITH AND SONS, INC.’
The following SQL update command would be used when the programmer needed to
update this row:
UPDATE CUSTOMER SET COL1 = '<value>' ...
WHERE ROWID = :SavedRowID
If another user had actually updated or deleted this row since this program selected it,
the ROWID would change such that the SQL update command would fail and the
data base server would return the -806 SQL error code. The programmer can then test
for the -806 error and take appropriate action (such as advising the user that some
other user has either updated or deleted this customer since they read it, and ask if the
user wants the program to re-read the customer record in order to get a fresh copy).
The highest degree of data consistency occurs at the expense of user concurrency. For
example, one locking protocol may ensure that lost updates or uncommitted
dependency problems will not occur. However, this may result in many timeout and
deadlock errors. A SQL timeout error occurs when a given SQL command waits a
program-defined number of seconds (the default is 360 seconds, which is 6 minutes)
for a lock to be released by another program so that it may access the data base page.
A deadlock occurs when user A is waiting for row 1 which user B has, and user B is
waiting for row 2 which user A has. In this situation, SQLBase will sense this
deadlock and abort one of the requests thereby allowing one of the users to continue.
RO RR CS RL RO
Figure 13C - An ordered approximation of isolation levels along the data consistency
vs. user concurrency continuum. Note that Read Only (RO) occupies both ends of the
spectrum only because it disallows any update operations and thereby avoids X-locks
completely.
same values until the transaction ends, then select an isolation level that holds
read locks until commit or end.
The following table may be used to help determine which isolation level is
appropriate for a specific database application based on responses to the previous
criteria description.
Remember, when going from a restrictive isolation mode to a more relaxed mode
(such as from RR to RL), there may be programming modifications required to make
the transaction behave properly in the new isolation level. An example is checking
ROWIDs for concurrent updates by transactions that are executing in RL mode.
Application designers should be cautious when migrating to more relaxed isolation
levels to ensure that data consistency is not sacrificed to the point where the
application can no longer be considered reliable by its users.
Chapter 14
Introduction
There are many kinds of failures to which computer systems are susceptible. In order
to protect the database against the effects of these failures, SQLBase contains a
Recovery Manager (RM) software component which is responsible for ensuring that
the database is always in a consistent and valid state. This consistent state is defined
simply as having all of the effects of committed transactions applied and none of the
effects of uncommitted transactions applied. The RM achieves its goals primarily
through the use of a log file. This file contains the physical images, or history, of
database changes made by transactions over time. For this reason, the log file used by
SQLBase is a physical log file, hereafter called the log file. Since the RM component
of the database server provides the recovery method for all database transactions,
regardless of their program characteristics, SQLBase is said to contain centralized
recovery.
This chapter will first discuss the broad types of failures which SQLBase protects
against, then the mechanisms used to provide this protection.
Failure types
There are many possible failures that a computer system is susceptible to. Some of
these failures are caused by faulty programs that place erroneous data into the
database (but which commit nonetheless), or the entry of flawed data into the system
by application users who are unaware of the damage they are doing. Unfortunately,
these are examples of errors which cannot be recovered automatically, since any
DBMS must trust that the execution of program requests which are semantically
correct will place correct results in the database. As will be seen, though, some of
these types of errors can be recovered from by using the same techniques as are
applicable for media failures. The three basic types of system failures which the RM
provides centralized recovery from are transaction failures, system failures, and
media failures.
Transaction failures
Transaction failures are aborts of transactions that result in a performing rollback for
the transaction. The transaction may abort at its own request (by issuing the
ROLLBACK command), or due to some other cause, such as a client workstation
failure or a network failure. Transaction aborts may also be issued by the SQLBase
server in some situations, such as to break a deadlock situation between two
transactions, one of which must be aborted in order for the other to proceed. Also, a
transaction failure may affect one transaction only, or some number of transactions at
once.
SQLBase provides transaction integrity, which prevents these types of failures from
corrupting the database. The mechanism used to accomplish this is by undo
processing from the log file. Following the undo, all database updates performed by
the transaction are no longer present.
System failures
System failures cause loss of the server’s volatile storage (RAM memory). The term
“system” as applied to this type of failure refers to the program processing
environment that is created by the operating system (OS/2 or other SQLBase-
supported environment) and consists of the resources and control blocks that allow for
task ownership and dispatching functions. These failures may be due to causes such
as an abnormal end (abend) in the operating system environment (such as NetWare),
an operator inadvertently cancelling the server program, or power failures, among
others. The net effect of these types of failures is that either the SQLBase server task
suddenly disappears, or it discovers that it can no longer continue processing and
must request an abend. When the SQLBase server is shutdown normally (when no
users are connected to the database), or a database is de-installed from the server
through the DEINSTALL command, the Database Control Block (DCB) page for the
database is flagged to indicate a normal shutdown took place. This is called a
“graceful shutdown”.
SQLBase provides for system integrity through the implementation of a restart feature
which prevents these failures from affecting the database. Restart is invoked every
time a database is first connected to after the server is started, where it examines the
DCB page of the database to determine what led to the prior shutdown. If it finds that
the flag in the DCB indicates no recovery is required, then it knows that the system
had been terminated gracefully on its last execution. In this case, restart can be
bypassed and the connection established immediately. If the flag in the DCB indicates
that recovery is required, then restart must perform its processing prior to allowing the
connection to the database to be established. Because of the dependency of the restart
logic on the contents of checkpoint records, this process is explained in detail in the
checkpoint section later in this chapter.
Media failures
Media failures cause the loss of some portion of the server’s non-volatile storage (disk
drive). These failures may be caused by environmental effects on the disk storage
environment or through defects in the storage device. Also, magnetic media are
inherently subject to a limited life because eventually the bonding material holding
the oxide on the platter (in the case of a disk) or the film (in the case of tape) will fail
and the oxide will begin to wear away. When this happens, minor errors will begin to
occur. If these errors are observed, the data may still be sufficiently readable to be
copied successfully to another device. Otherwise, the oxide will eventually degrade to
the point that the device will become unusable. This generally occurs to one isolated
storage device at a time. Multiple disk drives rarely fail together, unless some drastic
environmental problem occurs, such as a power surge or shock damage. These
potential disasters are the justification for developing and maintaining a disaster
recovery plan, which involves backing critical data up onto tape or disks that may be
stored off-site from the computer hardware (which is highly recommended for critical
computer installations, but is beyond the scope of this chapter).
Sites can establish procedures using SQLBase’s backup and recovery features which
allow the impact of media failures on the database to be minimized. Through periodic
backup of the database and log files the capability is provided to restore to any
particular point in time that is desired. In order to recover to a specific point in time,
the required backup is simply restored. Any log images that need to be processed to
that restored database in order to reach the desired point in time are then processed
through a SQLTalk command called ROLLFORWARD. The most important
characteristic of this process is that the application of redo log images in the
rollforward is an “all or nothing” choice. There is no option to bypass either some
transactions on a log file, or some log files, or skip over a particular time gap. For
whatever point in time is chosen to recover to, all log images prior to that point in
time must be applied. SQLBase will not allow you to attempt to do otherwise, since
this would probably cause severe database corruption which would most likely be
irreparable.
Recovery mechanisms
The SQLBase recovery manager relies primarily on two mechanisms for
accomplishing its centralized recovery objectives. The first of these is the log file,
which provides the information required for undo, redo, and checkpoint processing.
The other is the set of utilities which provide for backup and recovery of databases.
Figure 14A - There are two basic types of logs, active and archive. Archive logs may
be moved to tape or other media through use of the BACKUP LOGS SQLTalk
command when LOGBACKUP is on.
The log files used by SQLBase can be divided into two subtypes, the active and
archive logs. Together, these two log types make up the complete database log file.
The active log is the log that is currently being written to by the server; this is a disk
file whose status is “open”. For this reason, the active log is pinned to the disk and can
only be processed by the server. Only one active log can exist at any single point in
time. Whenever the server is shutdown gracefully, or a database is de-installed, all
“after images” on the log that have not yet been externalized to the database are
written (they are contained in dirty cache buffers), and the active log becomes an
archive log. When the server restarts, or the database is re-installed, a new active log
file is allocated.
There are usually many archive logs, however. These are simply old active logs that
have filled up and are no longer needed by the server. When LOGBACKUP is
enabled, these logs should be backed up to some stable storage media that is separate
from where active log and database files are stored in order to safeguard their
integrity. Of course, prior to this backup, some of the archive logs will still exist as
files in the same directory or DBAREA as the active log, while waiting for the next
log backup.
Note that log file names are formed by using a sequential number for the first part of
the name, allowing for log files from 1 to 99,999,999 to exist. Since this number is
reset to 1 whenever the database is reorganized, there is little likelihood of it ever
rolling over. Appended to this number is an extension type of LOG, which identifies
the file as a log file.
Log files are stored in a subdirectory of the LOGDIR directory that is named the same
as the database which owns the log file. This convention allows for identification of
the owning database only so long as the log file is in a subdirectory that applies only
to that database. Merging the log files of multiple databases into one subdirectory will
eliminate the ability to ascertain which log files belong to which databases. If these
log files are subsequently needed to ROLLFORWARD, neither the DBA nor
SQLBase will be able to ascertain which log files to make available to the rollforward
process. For this reason, DBAs should take care not to allow the log files of multiple
database to become merged in a single directory.
Active Log
S C S C
T O T O
R Tran 45 R Tran 48
M M
T T T T
S C S
T O T
R Tran 47 Tran 49
M R
T T T
C C C
K K K
P P P
T T T
Figure 14B - The active log, 37.log in this case, is always pinned. The prior log,
36.log, is pinned because it contains log images for active task 49. The 35.log is
pinned because it contains a checkpoint record that could be needed for restart. Note
that 36.log would be pinned even without transaction 49, due to 35.log being pinned.
SQLBase periodically reevaluates the status of all pinned log files and deletes those
that no longer meet one of these criteria. Log file status evaluations are performed
whenever one of the following events occurs:
• At log rollover, usually when the current active log fills up and a new log
must be created. This action can also be forced to occur even when the current
log is not full through execution of the SQLTalk RELEASE LOG command.
Note that some SQLBase commands issue a RELEASE LOG implicitly, such
as the BACKUP SNAPSHOT command.
• When log files are backed up through the SQLTalk BACKUP LOG
command, or the equivalent API call.
• When the system’s “next log to backup” parameter is reset manually through
the use of the SQLTalk SET NEXTLOG command, or the equivalent API
call.
• When logging for the database stops because either the server was shutdown
or the SQLTalk SET RECOVERY OFF command is executed.
These are the only events that cause SQLBase to consider deleting pinned log files.
Performing an evaluation of the logs’ pinned status whenever a transaction performs a
commit or rollback would cause excessive overhead and slow down transactions, and
is therefore not done by SQLBase.
Record types
The SQLBase logging manager is an integral part of the SQLBase DBMS. Whenever
a database program executes, it creates before and after images for those portions of
database rows that are modified in some manner. These modifications are performed
through execution of the INSERT, DELETE, or UPDATE DML commands. Other
commands which update the system tables in the database, including DDL and DCL
commands, also cause the before and after row images of the catalog tables they
update to be logged. Also written to the log at the appropriate times are control
records indicating when the program connected to the database (start record), and
when it issued a COMMIT or ROLLBACK command (commit and abort records).
These log records allow SQLBase to maintain transaction integrity and perform point
in time recovery operations. In addition, SQLBase writes a number of system control
records to the log file that allow it to maintain the integrity of the page allocation
mechanism and enhance the ability to perform system restarts.
Figure 14C - The composition of the log file. Each log file contains many variable
length records such as LR1 through LR5, above. Each record may be one of several
types. An expansion of the image fragments for the data page log record type is
shown.
As shown in figure 14C, each log file physically contains many variable length
records. Each record consists of a header, a variable portion, and a trailer. The log
record header is 34 bytes in length, and contains the following fields:
• Type: the format type of the log record, such as those explained below.
• Length: the total length of the log record.
• Timestamp: the point in time that the log record was written.
• Transaction: the ID of the transaction which caused the log record to be
written.
• Chaining information: a pointer connection two related log records.
Each of log record is of one particular type, such as:
• Start transaction
This log record is written whenever a program first connects to the database,
thereby starting a transaction. There is no variable portion to the start
transaction log record type.
• Commit transaction
This log record is written whenever a program issues a COMMIT command.
There is no variable portion to the commit transaction log record type.
• Abort transaction
This log record is written whenever a program issues a ROLLBACK
command, following completion of the undo operation. There is no variable
portion to the abort transaction log record type.
• Data page modification
Any change made to a table’s data rows will cause this record to be written.
Note that only those columns that actually change in the row will be logged.
These changes are made through the execution of the DML commands
INSERT, UPDATE, and DELETE (an exception to this case is system
catalog tables, which have changes made to them through the execution of
DDL commands). The variable portion of this record type has two major
components:
• Before fragment
The before fragment is written whenever a previously existing row is
modified through execution of the UPDATE or DELETE DML
commands. The fragment has a 5 byte header, followed by the before
images of the changed row columns, and ends with a 2 byte trailer.
• After fragment
The after fragment is written whenever a DML command such as
INSERT or UPDATE causes changed data to remain on a database page.
This fragment has a 5 byte header, followed by the after images of the
changed row columns, and ends with a 2 byte trailer.
• Index page modification
Whenever a program issues a DML command that INSERTs or DELETEs a
row a table that has an index associated with it, or UPDATEs columns which
are the subjects of an index, then some change will be required to be made to
the index structure. These changes are logged as index page modifications,
with a format similar to the database page change log record type. Each time
a node of the index tree is modified, an index page modification log record is
written which contains the before and after fragments of the changed portion
of the index node. Note that CLUSTERED HASHED indexes are not subject
to this logging activity, since they have no index tree structure.
• Checkpoint
A checkpoint log record is written during each system checkpoint. The
variable portion of the record contains a 22 byte entry for each active
transaction connected to the database at the time the checkpoint was
processed.
• Page allocation
This log record type is written whenever pages are allocated from the
database. There are usually three of these log records written for each page
allocation action, the largest of which contains the after image of the
allocated page’s header and footer, which is about 60 bytes long. The other
two log records record the changes made to other pages which are chained to
the newly allocated page, each of these log records contains about 8 bytes of
variable information.
• Long VARCHAR object
These log records are written whenever a column whose type is LONG
VARCHAR is inserted, modified, or deleted. The log record contains one
page that has been allocated to store the LONG VARCHAR column in the
variable portion of the log record. If the column is longer than one page, then
multiple log records will be written, each containing one of the pages
allocated for that column.
each after image of a database change is checked against the database to ensure that it
was successfully written, and if not, then the redo process writes it to the database. If
the end of the log is encountered, any active transactions that remain (those for which
no termination record has been found) are the subject of undo processing, as if they
had been the subjects of a transaction failure.
One important characteristic of the undo and redo processes in SQLBase is that they
must be capable of recovering the database to its committed state regardless of the
status of updated pages in cache. Cache is the volatile system memory used as a buffer
area for database pages. A large quantity of cache memory can greatly accelerate the
performance of SQLBase by decreasing the number of read and write operations that
must wait for external disk accesses to complete. The maintenance of the cache
buffers is the responsibility of the Cache Manager (CM), which performs fetch
operations to return the contents of cache to SQLBase, and flush operations to dispose
of the contents of cache pages. In SQLBase, system efficiency is enhanced by
allowing the CM to make the majority of decisions regarding the timing of cache
operations, such as when to flush pages and when to write “dirty” pages to disk
(“dirty” cache pages are those pages which have been updated in the buffer but not yet
written to disk). In order for the undo and redo processes to work successfully in this
environment, SQLBase must follow two rules regarding the interplay between the
CM and external storage:
• Undo rule (also called the write ahead log protocol). This rule states that
before an uncommitted value may be written to the external database
(DASD), its prior value must be saved in stable storage (the log). This rule
ensures that no previously committed row can be overwritten by an
uncommitted row without its contents being saved in stable storage for
possible undo processing. This means that the log record containing a before
image of an updated row must be written prior to the CM’s flushing of the
dirty cache buffer containing the updated page. The CM enforces this rule by
placing the log record pointer within the dirty cache buffer, and not allowing
flushes of cache buffers whose associated log records have not yet been
written.
• Redo rule. When processing a transaction’s commit, all write operations must
be completed to stable storage. This can consist of either the log file or the
external database. This rule ensures that committed values are present in
some form of stable storage, and not just within the contents of dirty cache
buffers. SQLBase flushes log buffers at commit time to ensure compliance
with this rule.
pinned to disk to ensure their availability. The following tasks are performed by
SQLBase when creating a checkpoint:
• Write to the log a list of transactions which are currently active, or processing.
• Flush all dirty cache buffers which have not been written since before the
prior checkpoint. If the server has been busy, many of these cache buffers will
have been written during the time interval since the last checkpoint because
the server needs to re-use cache pages based on the least recently used (LRU)
algorithm. If this is the case, then few cache pages should require flushing at
this checkpoint, allowing the checkpoint to be processed faster. Alternatively,
if the server has been largely inactive since the last checkpoint, there may be
numerous pages containing data updated prior to that checkpoint that must
now be written to disk. If this is the case, however, the server will likely still
be relatively inactive and therefore the work associated with writing these
buffers should not have an significant impact on current transactions. The use
of this fuzzy checkpointing feature allows SQLBase to perform checkpoint
processing with a low likelihood of negatively affecting performance.
When SQLBase starts, and discovers that a restart is required, the following steps are
performed:
1. The next to the latest checkpoint record in the active log is located. This is called
the penultimate checkpoint record.
2. Redo processes all transactions following the penultimate checkpoint marker.
3. During the redo processing, the last checkpoint record in the log will be
encountered. This checkpoint contains a list of all active transactions at that point
in time. Redo uses this to track all transactions from this point on in the log.
4. When a transaction ends during redo processing, it is removed from the list of
active transactions.
5. Eventually, redo will encounter the end of the log file.
6. Undo now reads the log backwards from the end to process all entries in the active
transaction list. Since these transactions had not been committed prior to the end
of the log, they must be rolled back.
7. When undo encounters the start control record of the last transaction entry in the
active list, recover is complete and the database is available for new connections.
This process ensures that the database is returned to a consistent state, both in terms
of containing data only from committed transactions and not losing the contents of
any write-delayed cache buffers.
for creation of the backup image file or the operating system has to provide a
technique to redirect the output directly to a tape device.
When using off-line backup techniques to backup partitioned databases, extra care
must be taken to insure that the MAIN database remains synchronized with the
partitioned database. This means that during backup, the server must be shut down,
and the partitioned database and the MAIN database must both be backed up prior to
restarting the server. When performing a restore made through this scenario, the
server must be shut down, and both the partitioned and MAIN databases restored to
the matching set of backups. Failure to maintain this parallelism between off-line
backup and restore techniques will cause SQLBase to be unable to determine the
proper location of the database suballocations within the DBAREAs assigned to the
partitioned database, which will result in unrecoverable database corruption.
When database backup and recovery procedures are implemented using the SQLTalk
BACKUP, RESTORE, and ROLLFORWARD commands, SQLBase modifies the
contents of the MAIN database as necessary to ensure that they accurately reflect the
current state of the suballocations within the partitioned database’s DBAREAs.
Therefore, the DBA can be unconcerned about explicitly managing the backup and
recovery of the MAIN database in this situation. Of course, backing up the MAIN
database might still be considered as a part of a disaster recovery plan that allows a
site to restore server operations on another machine that does not have an initial
installation of SQLBase software.
Chapter 15
Introduction to Query
Optimization
In this chapter we:
• Follow the evolution of data access languages from their procedural roots to
their modern declarative implementation represented by the SQL language
used by SQLBase.
• Examine the various categories of query optimizers that exist in relational
database management systems.
Query optimization
The query optimization process determines the access path for all types of SQL DML
statements. However, the SQL SELECT statement (a query) presents the biggest
challenge to access path selection; therefore, the process is commonly referred to as
query optimization rather than data access optimization. Further, it is worth noting
that the term query optimization is a misnomer in that there is no guarantee that the
query optimization process will produce an optimal access path. A more appropriate
term might be query improvement. For example, the best possible access path given
the known costs. However, we use the standard term to conform to convention.
Early relational database systems processed queries in a simple and direct manner
without attempting to optimize either the query itself or the access path. This resulted
in unacceptably long query processing times, and led to the belief among some
application programmers that relational database systems were not practical for real
world applications. To speed query processing times, a great deal of research and
testing was carried out to discover ways to increase the efficiency of query
processing. Query optimization can thus be defined as the sum of all techniques that
are employed to improve the efficiency of processing queries.
Syntax optimization
The first progress in optimizing queries was in finding ways to restate the query so
that it produced the same result, but was more efficient to process. For example,
consider the following query:
SELECT VENDOR_CODE, PRODUCT_CODE, PRODUCT_DESC
FROM VENDOR, PRODUCT
WHERE VENDOR.VENDOR.CODE = PRODUCT.VENDOR_CODE AND
VENDOR.VENDOR_CODE = “100”
The most direct way to process this query is:
1. Form the Cartesian product of the vendor and product tables.
2. Restrict the resulting table to rows that meet the WHERE condition.
3. Project the three columns named in the SELECT clause.
If the vendor table has 50 rows and the product table has 1000 rows, then forming the
Cartesian product would require 50,050 rows to be read and written. Restricting the
result would require 50,000 more rows to be read from the resulting product, and if
20 rows meet the WHERE condition, 20 rows would be written. Projecting the result
would then require 20 additional reads and writes. Processing the query in this
manner would require 100,090 rows to be read and written.
Rule-based optimization
As progress was being made in improving the processing of queries, other efforts to
improve table access were taking place. These improvements included indexing
techniques and hashing. However, these improvements added complexity to query
processing. For example, if a table had indexes on three different columns, then any
one of these indexes could be used to access the table (along with sequential table
access).
In addition, many new algorithms to join tables began to appear. The two most basic
join algorithms are:
• Nested Loop Join - A row is read from the first table, called the outer table,
and then each row in the second table, called the inner table, is read as a
candidate for the join. Then the second row in the first table is read, and again
each of the rows in the inner table is read again, and so on until all rows in the
first table are read. If the first table has M rows, and the second table has N
rows, then M x N rows are read.
• Merge Join - This join method assumes that the two tables are sorted (or
indexed) so that rows are read in order of the column (or columns) on which
they will be joined. This allows the join to be performed by reading rows
from each table and comparing the join column values until a match is found.
In this way, the join is completed with one pass through each table.
Further, join operations are both commutative and associative; consequently, it is
theoretically possible to perform joins in any order. For example, all of the following
statements are equivalent:
(table_a JOIN table_b) JOIN table_c
table_a JOIN (table_b JOIN table_c)
(table_a JOIN table_c) JOIN table_b
However, different access paths, join algorithms, and join orders may result in
significantly different performance. Consequently, when joining several tables each
with multiple indexes, there exists literally hundreds of possible combinations of join
orders, join algorithms and access paths to select from. Each produce the same result
but with varying performance characteristics.
The first method of dealing with this complexity was to establish heuristic rules for
selecting between access paths and join methods, which is called rule-based
optimization. With this approach, weights and preferences are assigned to the
alternatives based on principles that are generally true. Using these weights and
preferences, the query optimizer works through the possible execution plans until it
arrives at the best plan based on its rules. Some of the rules used by this type of
optimizer are based on the placement of variable tokens such as table or column
names within the query’s syntax structure, when these names are moved a dramatic
difference in the performance of the query can sometimes be realized. For this reason,
rule-base optimizers are said to be syntax sensitive, and one of the methods for tuning
this type of database system involves moving tokens to different positions inside a
statement.
Rule-based optimization provides satisfactory system performance in those situations
when the heuristics are accurate. However, often the general rules are inaccurate. To
detect these situations, the query optimizer must consider the characteristics of the
data such as:
• The number of rows in the table
• The range and distribution of values for a given column
• The row width and consequently the number of rows per physical page on
disk
• The height of an index
• The number of leaf pages in an index
These data characteristics can greatly affect the efficiency of processing a query. This
resulted in the next type of optimization.
Cost-based optimization
Cost-based optimization is similar to rule-based optimization, except that cost-based
optimizers use statistical information to select the most efficient execution plan. The
cost of each alternative execution plan is estimated using statistics such as the number
of rows in a table and the number and distribution of the values in a table column. The
cost formulas typically consider the amount of I/O and CPU instructions required to
perform the execution plan. The statistics are stored in the system catalog and
maintained by the database system.
To understand how statistics can make a difference, consider the following query:
SELECT CUST_NBR, CUST_NAME
FROM CUSTOMER
WHERE STATE = “FL”;
If an index exists on the STATE column, a rule-based optimizer would use this index
to process the query. However, if ninety percent of the rows in the customer table have
FL in the STATE column, then using the index is actually slower than simply
processing the table sequentially. A cost-based optimizer, on the other hand, would
detect that using the index would not be advantageous.
The cost-based optimization approach today represents the state-of-the-art in query
optimization techniques. Most high quality relational database systems, including
SQLBase, employ a cost-based optimizer.
Because steps 1 and 2 take place without regard to the actual data in the tables used, it
is not necessary to repeat these steps if the query needs to be recompiled. Therefore,
most optimizers will store the result of step 2 and use it again when it reoptimizes the
query at a later time.
Chapter 16
Overview of the
SQLBase Query
Optimizer
In this chapter we:
• Review basic relational operators implemented within the SQL data access
language.
• Examine the physical operations used by SQLBase to perform the relational
operators.
• Learn how to view SQL statements as query trees, which simplifies the
processing steps required for their execution.
Introduction
In the next three chapters we examine the query optimizer contained within
SQLBase. The SQLBase query optimizer (often simply called the optimizer) is a
specific implementation of a cost-based optimizer, meaning its decisions on access
path selections are based on the estimates of potential costs associated with executing
a particular access plan.
The basis of these cost estimates are the statistics contained within the SQLBase
catalog and control areas within the database. The advantages of this approach are
that the access plan decisions made by SQLBase can be made with very recent
information concerning the actual database contents. Just how recent and accurate
this information is depends on the procedures you follow in administering your
databases.
This chapter reviews the basic relational operators implemented by SQL followed by
an overview of the techniques that SQLBase may use when actually executing these
operations. This set of techniques, called physical operators, is the set of actions the
optimizer considers when building access plans.
While we summarize the actions taken to perform each physical operator, we will
also briefly summarize the major factors comprising the cost of the operator (which is
usually I/O). Finally, the structure of an access plan is explained, with an emphasis on
the heuristics used by the optimizer when building a set of access plans on which to
perform cost analysis.
Throughout the next three chapters, examples that illustrate optimizer operations
generally use the SQL SELECT statement. You should keep in mind, however, that
all SQL Data Manipulation Statements (DML) are processed by the optimizer. This
includes not just the SELECT statement, but the INSERT, UPDATE, and DELETE
statements as well.
The steps involved in optimizing these other statements are virtually identical to those
of the SELECT, the only difference being operations used to perform locking and
logging during update operations. These exceptions are covered in Concurrency
Control and Logging and Recovery chapters. The benefits of using the SELECT
statement for examples is that the SELECT offers many more relational operations,
such as a join, which exercise a much wider variety of optimizer processes than other
DML statements.
A secondary advantage of using SELECT statements for examples is that the
execution of a SELECT statement creates a result set which visually shows the final
output of SQLBase’s processing. This contrasts to the other DML statements, which
change the state of the database, but these changes are not visible externally except
through the return of error codes or other environmental variables. Only a subsequent
SELECT statement will reveal the changes which were made to the contents of the
database by the execution of one of these DML commands.
Relational operations
The basic operators of the relational algebra were defined by E. F. Codd when he
published his landmark paper delineating the relational model access language in
1972. The operations defined in this relational algebra form the logical operators that
must be processed by the optimizer. These logical operators are implicit in the syntax
of SQL DML statements submitted to SQLBase for processing.
The following relational algebra operations have one or more relations as inputs, and
one relation as output. A relation is simply a table of some degree n, where n is the
number of attributes (or columns) in the table. These operations are broken down into
two classes: traditional set operations and special relational operators. While the
traditional set operations are sufficient to process most multiple relation queries, the
special relational operators specify the desired results more efficiently. This last class
of operators is the basis of the SQL language, and is examined in more detail
throughout the remainder of this chapter. The traditional set operations are defined
here to provide background information.
• Traditional set operations
An important distinction of the traditional set operators is that (with the
exception of the Cartesian product) the relations used as input to these
operations must be of the same number of columns, with each matching
attribute defined on the same domain. This means that each table must have
the same number of columns, and that each column pair of any position (for
example, column 1 of each table) must be defined with the exact same data
type and scale.
• Union
The union of two relations is all rows belong to either or both relations.
It can be formed by appending one table to the other and eliminating
duplicate rows.
• Intersection
The intersection of two relations consists of the rows which appear in
both relations. Any row which appears in only one table, but not the
other, is not a member of the intersection set.
• Difference
The difference between two relations is all rows which belong to the first
table but not to the second. Note that this operation is not commutative
like the other traditional set operators, meaning that A – B ≠ B – A .
• Cartesian product
The Cartesian product of two relations is the table resulting from
concatenating each row of one table with every row of the other table.
The table resulting from this operation contains as many rows as the
product of the rows in the input tables. This means that the Cartesian
product of a 15 row table and a 50 row table contains 750 rows. As
mentioned previously, this is the only traditional set operation where the
format of the two tables can differ.
• Special relational operations
SQL generally uses these operations more than the traditional set operations:
• Projection
The projection operation limits the columns of a table that are referenced
in a SQL statement.
• Selection(restriction)
• A select limits the result set to only those rows that meet the
qualifications. The selection criteria compares one or more columns of
the table to either one or a set of constants or expressions.
• Join
The join operation creates a result set by concatenating rows together
from multiple tables. These concatenated rows must also meet the
specified join criteria, which is a comparison between one or more
columns of the tables involved. These comparisons differ from the select
operation in that they compare columns of different tables.
Because of the central role played by these special relational operators in SQL
statements, we will now examine them in more detail.
Projection
When a SELECT references specific columns of one or more tables, only those
columns are returned:
SELECT NAME, PHONE FROM CUSTOMER;
The alternative to performing a projection is to use the asterisk (*) wild card notation
which causes all columns to be included:
SELECT * FROM CUSTOMER;
Note that many sites have standards prohibiting this notation from being used in code,
since changing the number of columns in the table can have an adverse affect on the
program containing the statement.
Selection
The select is the row equivalent of the column projection operation. In SQL, the
WHERE clause specifies a selection by naming columns in comparison expressions
with either constants, that consist of one or a set of values, or expressions that evaluate
to a single value or set of values. The following are examples of the select operation:
SELECT * FROM CUSTOMER
WHERE NAME = ‘JONES’;
SELECT * FROM PRODUCT
WHERE STOCK_QTY <= REORDER_QTY;
SELECT * FROM ORDERS
WHERE (STATUS IN (‘C’,’P’,’S’)) AND
(TOTAL_AMT > 1000);
Each of the comparisons contained within the WHERE clause is called a predicate.
Some SQL statements, such as the last example shown here, contain more than one
predicate. When the operation specified in a predicate is performed on a row, it is
called applying the predicate. If the predicate evaluates to TRUE, it is said to be
satisfied, otherwise it is not satisfied. When a statement has more than one predicate,
they must be connected by one of the Boolean operators AND or OR. When all of the
predicates of a statement are connected by AND, they are said to be AND-connected,
or conjunctive. When they are all connected by OR, they are OR-connected, or
disjunctive. This terminology of predicates plays an important role in how they are
used by the query optimizer, as we shall see later.
Note that the form of the WHERE clause that compares columns of two different
tables is not a select operation, but rather a specification for a join operation, which is
covered next.
Join
A join creates its result set from two or more tables in a similar manner as the
Cartesian product mentioned earlier; it concatenates each row of one table with every
row of the other table. The difference between the Cartesian product and the join is
that only rows meeting some specified join criteria are eligible for inclusion in the
result set. This is logically equivalent to building the Cartesian product, then
performing a select operation against it. However, the join allows for a more efficient
operation with most query optimizers because they can eliminate rows from the result
set without first forming the entire Cartesian product as an intermediate result.
To perform a join, use the SQL FROM and WHERE clauses. The FROM clause must
name each input table to the join. The WHERE clause specifies the join criteria, by
comparing one or more columns of one table with columns of another table. The
predicates used as join criteria determine how many rows of the Cartesian product
will be bypassed in processing the join. The more restrictive the join criteria are, the
more efficient the join will be when compared to the Cartesian product (note that the
Cartesian product is formed by not specifying any join criteria at all). Here are some
examples of joins:
SELECT NAME, AMOUNT
FROM CUSTOMER, RECEIVABLE
WHERE CUSTOMER.CUST_NO = RECEIVABLE.CUST_NO;
SELECT NAME, QTY, DESC
FROM CUSTOMER C, ORDER O, PRODUCT P
WHERE (C.CUST_NO = O.CUST_NO)
AND (O.PROD_NO = P.PROD_NO);
In the second example, the ‘C’, ‘O’, and ‘P’ letters allow you to qualify column
names with their correct tables without having to spell out the entire table name. In
SQL, these are called correlation variables. Since most column names need to be
qualified they are often used in joins due to the presence of multiple table names in
the FROM clause.
Two important characteristics of the join operation are that it is both commutative and
associative in nature:
A JOIN B = B JOIN A (commutative)
A JOIN (B JOIN C) = (A JOIN B) JOIN C (associative)
(A JOIN B) JOIN C = B JOIN (A JOIN C) (both)
The effect of these properties is that when processing a join of more than two tables,
the optimizer may select any sequence of two table joins to complete the operation.
This allows the optimizer to treat a join of any number of tables as a series of two
table joins, which avoids having to have special physical operations available to join
any specific number of tables.
The type of comparison performed is often used to further refine the name of the join.
These sub-classifications of the join are briefly covered below.
Equijoin
The most common example of the join is the equijoin, which uses the ‘=’ operator.
This version of the join operation is used whenever business relationships are used to
combine information from two tables. These joins occur by setting a primary key in
the parent table equal to the foreign key in the child table.
SELECT C.CUST_NO, C.CUST_NAME, O.ITEM_NO, I.DESC
FROM CUST C, ORDER O, ITEM I
WHERE (C.CUST_NO = O.CUST_NO) AND
(O.ITEM_NO = I.ITEM_NO);
Natural join
This is a particular form of equijoin where all columns of the two tables are included,
with the exception of one set of the key columns, which are duplicated across the
tables. This is a full equijoin between two tables without any projection except for the
elimination of the duplication of the key columns.
Semijoin
The semijoin operation is equivalent to an equijoin with a subsequent projection that
only includes the columns of one of the join tables. This is useful when rows of a table
are needed that meet a selection criteria that is defined in terms of membership within
another table’s foreign key column. An example would be the set of all products
which have been ordered during January of 1993:
SELECT P.PROD_NO, P.PROD_DESC
FROM PRODUCT P, ORDER O
WHERE (O.PROD_NO = P.PROD_NO) AND
(O.ORD_DATE BETWEEN JAN-1-1993 AND JAN-31-1993);
The :equijoin is also useful in a distributed environment when the two tables to be
joined are at different nodes on the network.
Outerjoin
The outerjoin allows the non-participating rows of one of the joined tables to be
included in the result set. The reason these rows are termed non-participating is
because they contain key values which are not referenced by any rows of the other
table. For example, a row in the product set which contains a product number that has
never been ordered by any customer would be a non-participating row in a join of the
product table with the order table. In SQLBase, these rows could be included in the
result set anyway through the specification of the outerjoin operator, ‘+’, on the order
table key as in the following example:
SELECT *
FROM PRODUCT P, ORDER O
WHERE P.PROD_NO = O.PROD_NO(+);
Selfjoin
A selfjoin is an equijoin of a table with itself. This is also called a ‘recursive join’.
Note that in this case, you must assign correlation variables in order to avoid
ambiguous column references, as in the following example which lists the names of
all employees and their assigned managers:
SELECT E.NAME, M.NAME
FROM EMPLOYEE E, EMPLOYEE M
WHERE E.MNGR_NO = M.EMPLOYEE_NO;
Aggregation
While aggregation was not originally specified as a relational operation, its inclusion
as a standardized capability in SQL served to make it a commonplace operation. The
purpose of aggregation is to represent a table of information through some derived
statistic, such as the sum or mean (average) of a set of numbers.
SQL supports two types of aggregation, the simplest is scalar aggregation, which
derives a single output from a column of a relation. Scalar aggregation in SQL is
performed through the inclusion of an aggregate function in a query that does not
have any GROUP BY clause. Examples of scalar aggregation include:
SELECT SUM(SALARY) FROM EMPLOYEE;
SELECT COUNT(*)
FROM EMPLOYEE
WHERE NAME = ‘DAVE’;
The first query produces the total salary paid to all employees, while the second query
tells how many employees are named ‘DAVE’. The distinguishing characteristic of
these queries is the presence of some aggregate function in the SELECT list which is
allowed to range over the entire result set of the table access (after performing the
project and select operations).
The second type of aggregation supported by SQL is function aggregation. This type
of aggregation uses the same aggregate functions as scalar aggregation. The
difference between the two is that function aggregation always creates another table
as output. This contrasts with scalar aggregation, which simply returns a single value
as output.
The SQL syntax of a statement performing function aggregation always includes the
GROUP BY clause. The columns named in the GROUP BY clause become, in effect,
the key to the new table that is output from the statement. Examples of function
aggregation are:
SELECT DEPT, AVG(SALARY)
FROM EMPLOYEE
GROUP BY DEPT;
SELECT @MONTH(BIRTH_DATE), COUNT(*)
FROM EMPLOYEE
GROUP BY @MONTH(BIRTH_DATE);
Here the first query lists the department and average salary for each department, while
the second lists the birth month and number of employee birthdays in each month
throughout the year. In each of these statements, the aggregation function only ranges
over the rows of the input table that have equal values in the columns specified in the
GROUP BY clause.
Sort
The sort operation is performed whenever a table must be placed into a specific
sequence according to one or more columns, which are collectively called the sort key.
This physical operator has one table as input, and one table as output. The basic
technique SQLBase uses for sorting is the postman sort, which is a radix sort.
Aggregation
The aggregation operator also has one table as input, and one table as output. The
SQLBase physical aggregation operator performs a serial scan of the qualifying data
rows, with the aggregate function calculation being performed for each row. When
SQLBase performs simple scalar aggregation, the input rows can be in any sequence.
However, when aggregate functions are being evaluated through SQL statements
containing GROUP BY clauses, the input table to the aggregation operation must be
in sequence by the columns specified in the GROUP BY. This prerequisite is satisfied
by the query optimizer through the placement of either a sort operator prior to the
aggregation, or through the use of a table access technique which returns table rows
in a specific sequence (discussed later in this chapter).
Table scan
This is the simplest technique for accessing a base table. Each data page associated
with the table is read in succession. For each page, the rows on that page are extracted
for processing. Note that this extraction may require accessing additional pages in the
form of either extent pages or overflow pages (in the case of hashed tables). In the
absence of any extent or overflow pages, the number of disk accesses required for a
table scan is equal to the number of data pages assigned to the table, including pages
required for any LONG VARCHAR columns that may be specified.
restrictive enough to make the extra I/Os for the index worthwhile. Also, if an
inequality predicate is used for some index column, this technique uses the index tree
to find the starting leaf, and then scans the sequence set forward or backward from
that point.
The number of I/Os incurred by this access technique is one for each level in the
index, plus one for each index node that is accessed, plus one for every row that must
be retrieved. The optimizer estimates costs associated with this access technique
using an index statistic called the selectivity factor, which is explained in detail in the
next chapter. Briefly, the selectivity factor describes the expected number of rows in
the table that would satisfy a predicate.
Hash access
A table which has a clustered hashed index placed on it may be accessed via this
index whenever all of the columns contained within the index are used in equality
predicates in an SQL statement. As explained in Chapter 10, Hashed Indexes, the
hashing technique lacks any generic or scanning capabilities. This is why the hashed
index key must be fully specified with the equality predicate in the SQL statement.
Whenever the restrictive conditions placed on the use of the hash access technique are
met, this is probably the quickest possible alternative for table access and will likely
be selected by the optimizer. The I/O cost of this technique is one I/O for each row, or
set of rows, which has the key specified in the predicates. Extra I/O costs may be
incurred for searching a chain of overflow pages if the hashed table has experienced a
large number of collisions, and this condition may cause the optimizer to reject the
hash access technique.
Join operations
The relational join is one of the most frequently executed operations of any relational
database. It can also be extremely time consuming due to its data intensive nature.
This is because the most simplistic implementation of a join requires every row of the
one table to be compared to every row of another table. In order to execute joins as
quickly and efficiently as possible, SQLBase selects from among a number of
alternative techniques. All of these operators accept two tables as input, and produce a
single table as output. These input and output tables can be any combination of
database tables, temporary tables, or final result sets.
Terminology that is common to all join algorithms is the naming of the two tables
involved. Since one row of a table must be compared to all rows of another table, the
table for which a row is held is called the outer table. The table which is scanned
while a row is held from the outer table is called the inner table. These terms first
originated with the loop join algorithm, where they are especially appropriate, as we
shall see next.
the number of data pages in the outer table plus the number of rows in the inner table.
This only occurs if every row in the inner table is referenced by the outer table. When
not all rows of the inner table are referenced in the outer table, there is a reduction in
the number of I/Os.
Hash join
This technique can be used for equijoins only, and does not require any indexes on the
join criteria for the two tables. The algorithm first performs the setup phase, when it
scans the outer table and places each row into a hash table at a location determined by
applying a hash function to the join criteria. The smaller table is always chosen as the
outer table, in order to minimize the memory demands of the hash table.
In the second phase, called the probe, the inner table is scanned and the same hash
function used in the setup is applied to the join columns in order to locate a possible
match. Any rows found in the hash bucket from the first table are then examined. The
rows that satisfy the join criteria are appended to the inner table row and written to the
output table.
This can be a very rapid join technique, especially when the smaller table can be
hashed into a table fitting in memory. When this is the case, the I/O required is the
sum of the data pages of the two tables. Otherwise, additional I/O is required to
partition the hash table into segments, and manage those segments in a temporary file.
Imbedded operators
The two most common relational operators, select and project, are implemented
within the other physical operators. This allows the optimizer to perform these
operations at the earliest possible time in an access plan, which immediately reduces
the amount of data that the remainder of the plan must handle. The implementation of
the project operator is further enhanced by performing index-only access when
appropriate, as explained in the following paragraphs.
Select
The select operation restricts access to rows that do not satisfy the predicate logic of
the query. These predicates are evaluated as early as possible when executing the
query, which is usually the first time the row containing the column named in the
predicate is accessed. Each file access physical operator in SQLBase accepts
predicates to perform this row filtering when accessing a table. If the predicates in the
query are conjunctive, or and-connected, a form of logic called short-circuit testing
may be performed. This technique allows the row to be discarded as soon as the first
unsatisfied predicate is found.
When the SQLBase optimizer estimates the cost of a query, the effect of the
predicates used for the select operation is taken into account through a statistic called
the selectivity factor. This figure represents the proportion of rows in a table that
would be expected to satisfy a particular predicate. The optimizer makes its estimate
of selectivity factor based on statistics associated with the table’s indexes and certain
heuristics concerning the predicate’s comparison operator. We will examine this
process in more detail in the next chapter.
Project
The project operation is implemented within other physical operators by passing the
list of columns to be retrieved to any file access operator. That operator then only
includes those columns in its output table. Performing the project at the earliest
possible time in the query plan reduces the width of intermediate tables and result
sets, thereby saving space both in memory and temporary disk files. Note that one
side effect of performing the project at page access time is a possible reduction in I/
Os due to avoiding access to extent or LONG VARCHAR pages which do not contain
needed columns.
can be satisfied without access to the table’s data pages. This can result in significant
I/O savings, depending on the number of rows which satisfy the predicates of the
query.
Query trees
One way to view a query that provides for easier understanding is called the query
tree. This is a tree structure, where the final result of the query is at the top of the tree
(the root), and the database tables involved in the query are at the leaves. The
intermediate nodes of the tree, between the leaves and the root, represent physical
operations that are performed during the query execution. When the query plan is
executed, the sequence of node traversal is from bottom to top, and from left to right.
This is the sequence in which the operations are printed when the ‘SET PLANONLY’
option of SQLTalk is executed to print out a query plan for a given SQL statement.
The following example shows a query tree for a SQL statement that gets the names of
authors who published books in 1993:
SELECT A.ANAME
FROM AUTHORS A, BOOKS B
WHERE (B.DATE_PUBLISHED = ‘1993’) AND
(A.ANAME = B.ANAME);
project
A.ANAME
select
B.YEAR='1993'
join
A.ANAME=B.ANAME
A B
Figure 16A - A query tree for: Get names of authors who published
books in 1993.
of Figure A by applying a select first to reduce the number of rows that are input to
the equijoin operation.
project A.ANAME
join A.ANAME=B.ANAME
select B.YEAR='1993'
A B
Chapter 17
SQLBase Optimizer
Implementation
In this chapter we:
• Discuss the details of how the SQLBase query optimizer performs access
plan analysis.
• Summarize the statistical information available to the SQLBase query
optimizer to develop access plan cost estimates.
• Explain how the query optimizer identifies possible access plans for
processing an SQL statement.
• Review the two cost components SQLBase estimates for each access plan
using statistics about the database:
• Total number of I/O operations.
• Total CPU time required.
Database statistics
When the SQLBase query optimizer performs access plan selection, it uses statistics
about the database to estimate the costs associated with each candidate access plan.
These statistics, which are stored in the SQLBase system catalog tables, are
associated with the various tables and indexes contained within the database. Some of
these statistics are maintained dynamically, meaning that SQLBase keeps them
internally updated during normal server operations. The majority of statistics,
however, are maintained statically, meaning that they are only calculated and stored
when specific events occur.
The event that usually causes these statistics to be refreshed is the ‘UPDATE
STATISTICS’ SQL statement. This command examines a database’s tables and
indexes, calculates the required statistics, and then stores them in the appropriate
tables. Statistics are also created for an index or a table when it is initially created. If
no rows are present at this time, the statistics are set to default values.
Starting with SQLBase version 6.0, statistics can now also be updated directly by the
DBA, using the user modifiable statistics feature. In this feature, the access paths
chosen by the optimizer reflect the cost calculations that would be performed if the
database was populated in a manner that would yield the statistics placed in the
catalog by the DBA. In this way, the DBA can influence the optimizer into making
decisions that are based not on the current database contents, but rather on the
database as the DBA projects it will be populated at some point in the future. In order
to accurately calculate and store these statistics, the DBA must have a thorough
understanding of what these statistics mean.
Statistics that are dynamically maintained by SQLBase are used by the optimizer
whenever the DBA has not overridden the catalog statistics using the user modifiable
statistics feature. When you override the catalog statistics with your own calculated
values, the optimizer will them ignore the dynamic internal statistics in favor of your
catalog statistics.
Table statistics
The statistics associated with database tables are maintained in two separate system
catalog tables, each of which is described in the following paragraphs. All columns
not specifically mentioned as being maintained dynamically are static, and are
populated by running UPDATE STATISTICS or at CREATE time.
Statistics in SYSTABLES
The following statistical information is maintained in these columns of the
SYSADM.SYSTABLES table in the system catalog:
• ROWCOUNT
This is the number of rows in the table. This statistic is maintained
dynamically in the table’s control page, but is only externalized to the
SYSTABLES column when the UPDATE STATISTICS command is
executed.
• PAGECOUNT
This is the total number of data pages in the table. This is also a dynamic
statistic that is externalized when UPDATE STATISTICS is run. Also, when
this column is maintained by the system, it will always be the sum of the next
two columns, ROWPAGECOUNT and LONGPAGECOUNT. If, on the
other hand, the DBA has set these statistics explicitly, then this relationship
may not hold true.
• ROWPAGECOUNT
This is the number of base row pages occupied by the table, plus any extent
pages that may also be allocated. This is another dynamic statistic which is
only externalized on command.
• LONGPAGECOUNT
This is the number of pages allocated to the table for the storage column
defined with the LONG VARCHAR data type. This is a dynamic statistic
which is only externalized on command.
• EXTENT_PAGECOUNT
This is the average contribution of data stored in base and extent row pages.
It excludes contribution from long pages.
• AVGROWLEN
This is the actual average length of rows in the table. This can differ
significantly from the defined row length since SQLBase stores all columns
as variable data regardless of the data type used in the column definition. Note
that this row length is only the length of the row as stored on base table and
extent pages, and therefore excludes all LONG VARCHAR columns.
• AVGROWLONGLEN
This is the actual average length of all LONG VARCHAR columns stored in
the table. This column will contain zero if there are no columns of this data
type stored.
Statistics in SYSCOLUMNS
The following statistical information is maintained in these columns of the
SYSADM.SYSCOLUMNS table in the system catalog:
• AVGCOLLEN
This is the actual average length of this column across all rows in the table.
This can differ from the defined column length since SQLBase stores all
columns as variable data. A base row page stores “a long data description”
for every non-NULL long column.
• AVGCOLLONGLEN
This is the actual average length of a LONG VARCHAR column across all
rows of the table. This value is zero if this is not a LONG VARCHAR
column.
Index statistics
The statistics associated with indexes on tables are also maintained in two separate
system catalog tables, as detailed in the following paragraphs. All columns (except
one, the HEIGHT column in the SYSINDEXES table) are maintained statically, and
are populated when UPDATE STATISTICS is run or the index is initially created.
Statistics in SYSINDEXES
The following statistical information is maintained in these columns of the
SYSADM.SYSINDEXES table in the system catalog:
• HEIGHT
The height of the index tree is the number of nodes that have to be read from
the root to the leaf level inclusive. Also, it is called the depth of the tree. The
statistic is also maintained dynamically in the index’s control page. This field
is null for a hash index.
• LEAFCOUNT
The total number of nodes in the bottom level (leaves) of the index is the
leafcount. Also, it is the number of pages in the index’s sequence set. This
field is null for a hash index.
• CLUSTERCOUNT
The total number of page changes that would occur if the entire table was
read through the index’s sequence set. For a perfectly clustered index, this
column is equal to the number of base data pages in the table. At the other
extreme, the highest value for a totally unclustered index is equal to the
number of rows in the table. This field is null for a hash index.
• PRIMPAGECOUNT
The number of primary pages that have been allocated in the base table for
the subject table of a hashed index. It is the same as the number of hash slots
available for distribution of the table’s rows (see Chapter 10, Hashed Indexes,
for more information on the significance of this figure). This field is null for
a B+Tree index.
• OVFLPAGECOUNT
This is the number of overflow pages that SQLBase allocates for the subject
table of a hashed index. When the table and index are initially created, this
number reflects the number of overflow pages that SQL Base pre-allocates.
After that, this number increases as additional overflow pages are required to
handle hashing collisions. This field is null for a B+Tree index.
• AVGKEYLEN
For B+Tree indexes, this is the average length of the key across all entries in
the index. This number is needed because SQLBase maintains all index
entries as variable length, minimum prefix only fields (see Chapter 9, B-Tree
Indexes, for more information on these index entries). This field is null for a
hash index.
Statistics in SYSKEYS
The following statistical information is maintained in this column of the
SYSADM.SYSKEYS table in the system catalog:
• DISTINCTCOUNT
This is the number of distinct keys in the index for that portion of the key from
the first column up to the COLNO of this entry. If every possible combination
of values for a composite key exists, this number is the product of each
successive column’s cardinality. Cardinality is the number of distinct values
that a column has in a table. The simplest example is a column called
SEX_CODE, which has a cardinality of two, for the values ‘M’ and ‘F’.
Consider the following index:
CREATE INDEX XNKSALS
ON EMPLOYEE (DEPT,
SALARY,
YEARS_SERVICE);
The employee table contains 250 rows, one for every employee in the
company. Also, the cardinality characteristics of the columns are as follows:
• DEPT - There are 10 departments.
Selectivity factor
The selectivity factor is a statistic which is computed by the optimizer for predicates
used by the relational select operator (recall that the select operation is implemented
internally in all SQLBase file access operators). In addition, the selectivity factor is
used by the optimizer to determine if an index access path is more efficient than a
table scan. Consequently, SQLBase only calculates the selectivity factor for those
predicates which contain attributes for which an appropriate index exists.
The selectivity factor estimates the number of rows which are returned following the
application of the select predicates on a table of known size. This is critical when the
optimizer is evaluating an index in an access plan. The index will be more efficient
than a table scan only if the selectivity factor multiplied by the number of rows in the
table are less than the number of database pages occupied by the table.
In other words, if most of the table rows have to be retrieved anyway, using the index
only incurs the overhead of reading through the index structure without saving the
need to read the same number of data pages. Since this is the case, the selectivity
factor appears in virtually every I/O costing formula associated with an index access
operation that is used by the optimizer.
Numerically, the selectivity factor is a probability, which means it ranges from zero to
one. Multiplying the number of rows in a table by the selectivity factor associated
with that table’s predicates will directly yield the number of rows which would
survive the select operation, based on the assumption that the values of the table are
evenly distributed among its rows.
Single predicates
Determining the selectivity factor of a single predicate, such as EMPLOYEE_NO =
65, depends on what operator is used in the predicate. The operator affects the
selectivity factor because it determines the relationship between the rows that satisfy
the predicate and the other operand.
The more restrictive the operator, the lower the selectivity factor of the predicate. The
most restrictive operator is equality (=), since only a single column value can satisfy
the predicate. In this case, the selectivity factor is simply the inverse of the cardinality
of the column, which is stored in the DISTINCTCOUNT column of the
SYSADM.SYSINDEXES table entry for this index’s key.
However, it is not as easy to calculate the selectivity factor for inequality operators.
Consider the following two predicates:
ORDER_DATE > JAN-01-1900
ORDER_DATE > SYSDATE - 1 DAY
Obviously, the first predicate could be expected to return many more rows than the
second when applied to the ORDER table. But how can the SQLBase optimizer
determine this? It turns out that the optimizer calculates the selectivity factor by
accessing the upper levels of the index structure which contains the column in
question (in this example, an index on ORDER_DATE). It then estimates the number
of rows that satisfies the predicate by extrapolating the proportion of index keys at this
index level. This technique, called the BTree scan, allows the SQLBase optimizer to
make reasonably accurate estimations about the results of inequality predicates.
The other factor that plays a role in determining the selectivity factor of a single
inequality predicate is whether the column is compared to a constant or a bind
variable. This is not critical for equality predicates because the optimizer can
calculate the selectivity factor from the DISTINCTCOUNT column with the
assumption of equal value distribution throughout the table. This assumption allows
the optimizer to be insensitive to the actual value of the operand on the other side of
the equality operator. However, for the optimizer to use the BTree scan technique, it
must be able to find a value for the other operand. When the operand is a bind
variable, the optimizer must fall back to making an assumption that is actually a hard-
coded value.
The following table summarizes the determination of the selectivity factor of a single
predicate, depending on the operator and whether a constant or bind variable is used
as the other operand. Note that where ‘transformation’ is listed, the predicate is
eliminated from the query through semantic transformation to another form.
= 1/cardinality 1/cardinality
in transformation transformation
Multiple predicates
When multiple predicates are placed on a single table, the SQLBase optimizer
follows some fixed rules when combining their selectivity factors into a total
selectivity factor for the entire relational select operation. The first rule regards how
many predicates are considered when calculating a total selectivity factor for
predicates containing a mix of equality and inequality comparison operators.
SQLBase follows these rules in determining which predicates to evaluate a selectivity
factor for:
• When the high order columns of the index key are contained within equality
predicates, the optimizer can find the selectivity factor for the multiple
columns by reading the DISTINCTCOUNT column of the lowest order key’s
SYSKEYS row. When the first inequality predicate is encountered, it is
ignored along with any remaining predicates containing columns that appear
in the index. An example of this is:
EMPLOYEE_NO = 45 AND
DEPT = 50 AND
SALARY > 25000
Assume the index is defined on a composite key consisting of
EMPLOYEE_NO, then DEPT, followed by SALARY. The combined
selectivity factor for both the EMPLOYEE_NO and DEPT columns, which
appears in the DISTINCTCOUNT column of the DEPT row of the
SYSKEYS catalog table, is used by the optimizer to estimate the number of
rows that will be returned from the query. The effect of the inequality
predicate on the SALARY column is ignored. For an example of how the
combined DISTINCTCOUNT is calculated for multiple columns, see the
database statistics section earlier in this chapter.
• If the high order column of the index key is the subject of an inequality
predicate, only the selectivity factor of this predicate is evaluated, and any
remaining predicates are ignored (regardless of whether they are inequality or
equality predicates). An example of this is:
SALARY > 25000 AND
DEPT = 50 AND
YEARS_SERVICE > 3
Chapter 18
Introduction
Section 1, Database Design, presents a methodology for physical database design.
This database design method compares general principles and heuristics to create an
internal schema that exhibits good performance characteristics for most SQL queries.
However, the physical design method also acknowledges that some individual queries
may not meet their performance requirements without further tuning (see chapter 7,
Physical Design Control).
A myriad of tuning options faces database designers for increasing the performance
of individual queries. One of the primary ingredients for success is knowing which
option to use in which situation and when. However, one primary option is working
with the SQLBase query optimizer to ensure that the correct access paths are
available to the query optimizer. This chapter addresses this issue.
When working with the SQLBase optimizer to increase the performance of a specific
SQL SELECT statement, the database designer performs three distinct activities:
update statistics, determine the optimal index set, and override the query plan(s)
chosen by the SQLBase optimizer.
Update statistics
As described earlier in this section, SQLBase maintains several statistics regarding
tables and indexes. The query optimizer uses these statistics to calculate the cost of
various access paths. Therefore, the first aspect of working with the query optimizer,
improving the performance of a given query, is to make sure these statistics are
correct and up-to-date.
limited to the slower join algorithms, the hash join or one of the nested loop
techniques. However, for most SELECT commands, the access paths available to the
optimizer with the initial index set provide adequate performance.
On the other hand, the initial index set may not meet the performance requirements of
all queries. For example, some queries containing join columns that are neither
primary nor foreign keys may perform poorly. For those queries not meeting their
performance criteria, you must add additional indexes to increase the speed of the
join. However, you do not want to create one or more specific indexes for every
SELECT command performing poorly since:
• Indexes slow update DML commands, specifically all INSERT and DELETE
statements, as well as any UPDATE statements that modify a column that is
contained within one or more indexes.
• Index management is both time-consuming and costly.
• Indexes take additional disk space.
• Indexes can also cause locking contention because many rows may be stored
on a single index node page.
Consequently, the general design goal is to create an index set that contains the fewest
possible indexes which meets the performance requirements of all database
transactions. This is the optimal index set.
query was executed). If so, execute the query to determine if performance has
improved significantly.
these indexes are only helpful in queries involving the LIKE operator when the wild
card characters do not occur in the beginning. Therefore, the SQLBase optimizer
does not use an index if a LIKE operator is used and a wild card is in the beginning
position.
Furthermore, the SQLBase optimizer also does not use an index when bind variables
are used with the LIKE operator. This is because the optimizer does not know at
runtime where the wild card will occur. Therefore, the optimizer chooses an access
path (i.e., table scan) which satisfies the query with the wild card in any position,
resulting in significantly decreased performance. Hence, convert all bind variables to
constants in LIKE predicates.
• If the query contains an ORDER BY clause, comment it out and see if the
query plan changes. If the query plan changes, execute the query to determine
if performance improves significantly.
• If the query contains several joins, locate the one that is taking the longest to
perform. Successively comment out all but one join and execute the query.
Determine if one join is the bottleneck.
• Eliminate any AT (@) functions (such as @UPPER) being performed on
WHERE clause predicates and see if performance increases. It may be that an
index is not being used because of a @function. If this is the case, the index
may be altered to contain the function if the query and function are
sufficiently critical. Of course, this alternative may then make the index
unusable for other queries whose performance had been acceptable before the
modification.
with the PLANONLY parameter ON. If it does, execute the query to determine if
adding this index significantly increases the performance of the query.
If adding the index does not increase performance significantly, create another index
on another table column identified above. Continue until you have tried all columns
identified above. If you are unable to increase the performance of the query by adding
a single column index, proceed to the next step.
general, list equality columns before other columns, and list all equality
predicate columns in increasing selectivity factor order. After listing all
equality predicates in the CREATE INDEX statement, list the inequality
predicate which has the lowest selectivity factor as the last column in the
index. Any other inequality predicates present in the WHERE clause will
probably not benefit the performance of the statement, nor will they be
considered by the SQLBase query optimizer when selecting access paths.
A B-tree 18-5
abbreviations 2-3 general purpose indexing 9-2
access path selection 8-2 index structures 9-3
access plans 16-2 leaf node 9-4, 9-5, 9-6, 9-7, 9-8, 9-9, 9-12, 9-13
matching index scan 16-10 root page 17-9
non-matching index scan 16-10 scan 17-7
physical operators 16-2, 16-9 bucket size 10-7
relational join 16-11
relational operations 16-4 C
aggregate query 12-6 cache 14-11–14-13
aggregation dirty cache buffers 14-5
data 7-11 dirty cache pages 14-11
purpose 16-9 used in join operation 16-12
scalar 16-9 Cache Manager 14-11
ALTER TABLE 3-9, 6-6 cardinality 2-7–2-10, 7-12, 17-5
ANSI/SPARC column 6-11
conceptual schema 1-2, 2-1, 4-2–4-3, 7-2 low 6-16
external schema 1-2, 1-4, 1-5, 4-4–4-5, 5-7, 7-2 maximum 2-7, 2-8, 3-6
views as 4-2, 4-4 minimum 2-7, 2-8
internal schema 1-2, 1-5, 1-6, 3-2, 3-5, 5-7, 6-1, Cartesian product 15-4, 16-4
7-2 CASE tool 3-2, 5-3
three schema architecture 1-2–1-6, 4-2, 6-1, 7-2 CHAR 3-4, 6-4
attributes Chen diagram 2-6, 2-10
aggregate data 7-9 child table 3-2, 3-5, 3-8, 3-9, 6-6, 6-9
derived data 7-9 clustered hashed index 10-2
authorization-id 3-3 columns, adding 3-3
COMMIT 13-2
B conjunctive form of multiple predicates 16-17
B+tree conventions 1-13
index-only access 16-14 conversion of subqueries 18-6
index-only access with project operator 16-14 correlation variables 16-6, 16-8
sequence set 16-10 CREATE 6-7
B+Tree index, compared to hashed index 10-2 CREATE CLUSTERED HASHED INDEX, effect on
Bachman diagram 2-6, 2-7 packing density 10-13
BACKUP SNAPSHOT 14-7 CREATE INDEX 6-9, 6-14
binary search tree CREATE TABLE 6-9, 6-13, 6-18
balanced 9-4, 9-6 CREATE UNIQUE INDEX 3-9, 6-9, 6-13–6-14
branching factor 9-3 cursor context preservation(CCP)
key value 9-3 effect on result set flushing 12-8
node key 9-3 rules for releasing locks 13-8
pointer 9-3 Cursor Stability (CS), drawback 13-11
short version for indexes 9-3
target key 9-3, 9-4, 9-8 D
unbalanced 9-6 data administration 2-2, 2-4, 2-5
bind variables 18-5 data control language (DCL) 4-4
blocking factor 6-4 data definition language (DDL) 3-2, 3-9
Boolean operators 16-5 data dictionary 2-4
brute force method 7-3–7-4 data element 2-3
E I
E.F. Codd 2-11, 16-3 incremental lock(U-Lock) 13-5
enterprise-wide data model 2-4, 2-5 index
entities, major 2-6 B+ trees 6-7
entity relationship diagram (ERD) 2-5, 2-6, 3-2–3-5, candidates 6-12
3-6, 3-8, 5-2, 7-2 good 6-14
domain specification 3-2 poor 6-12, 6-15–6-16
equality predicate 6-12 clustered hashed 6-4, 6-7–6-9, 6-10, 7-7
equijoin 16-7 composite index 6-12, 6-16
explain plan facility 7-5 index only access 6-12–6-13