0% found this document useful (0 votes)

80 views255 pages

Dbms

This document provides an overview of CS520: Introduction to Database Design & Engineering, a graduate-level course offered at an unnamed university in Fall 2002. The course focuses on database design and development of database management systems, with an emphasis on projects involving DBMS development. It covers topics such as SQL, database design, query optimization, recovery and concurrency control, and distributed database systems.

Uploaded by

annu1987

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views255 pages

Dbms

Uploaded by

annu1987

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 255

CS520: Introduction to Database

Design & Engineering

Fall 2002

(C) Frieder, Grossman, & Goharian 1996, 2002 1

Database Course Differentiation
Credit not given for both classes since they are similar in content

CS 425 CS 520
Prerequisite: CS401 Prerequisite: CS402
Emphasis: Emphasis:
» Database Design & Use » Database Design & Eng.
» Application Development » DBMS Development
Project: Project:
» Application Development » DBMS Development

(C) Frieder, Grossman, & Goharian 1996, 2002 2

Contents
Introduction 1 - 44
SQL 45 - 122
Database Design 123 - 188
Query Optimization 189 - 211
Recovery and Concurrency Control 212 - 233
Integration of Structured Data and Text 234 - 241
Distributed Database Systems 242 - end

(C) Frieder, Grossman, & Goharian 1996, 2002 3

Background
Initially, installations wrote separate applications
with large amounts of repeated code to implement
concurrency control, security, and recovery.

This was a lot of wasted effort that was also very

much error prone

(C) Frieder, Grossman, & Goharian 1996, 2002 4

Example of Common Functionality

Payroll Inventory Marketing

User Interface User Interface User Interface

Business Logic Business Logic Business Logic

Concurrency Concurrency Concurrency

Control Control Control

Security Security Security

Recovery Recovery Recovery

(C) Frieder, Grossman, & Goharian 1996, 2002 5

Database System Example
Payroll Inventory Marketing

User Interface User Interface User Interface

Business Logic Business Logic Business Logic

Database System

Concurrency
Control

Security

Recovery
(C) Frieder, Grossman, & Goharian 1996, 2002 6
Definitions
DBMS - a Database Management System is a set of
routines that is capable of providing the following basic
functions:
» Add
» Delete
» Update
» Retrieve

(C) Frieder, Grossman, & Goharian 1996, 2002 7

Primitive Functionality

Add (X) =
Find (X);
If not found then insert (X)
else return (error_code)

Delete (X) =
Find (X);
If found then remove (X)
else return (error_code)
(C) Frieder, Grossman, & Goharian 1996, 2002 8
Primitive Functionality (cont)

Update (X, Y) =
Delete (X);
If not error_code then Add (Y)

Retrieve (X) =
Delete (X);
If not error_code then Add (X)

(C) Frieder, Grossman, & Goharian 1996, 2002 9

Database Models
Network
» Any links supporting quick access

Hierarchical
» Links but no cycles (hierarchy)

Relational
» Data Independence

Object – Oriented
» Entity Abstraction

(C) Frieder, Grossman, & Goharian 1996, 2002 10

Inverted Index

I
N
D
E
X

Posting List

(C) Frieder, Grossman, & Goharian 1996, 2002 11

Inverted List Example

Record Record Record

Hank
1 3 5

Query:
find all occurrences of the name (value of attribute) is
‘Hank’ in the database:

Hash to the value Hank in the “index”

Scan the posting list for all occurrences

(C) Frieder, Grossman, & Goharian 1996, 2002 12

Inverted Index

Associates a posting list with each attribute value

abacus: (F3AB, 873A, FF32)

abatement: (6A15)
…
zoology: (D381, DA32)
83: (F623, B001, 879D, 76AA)
2002: (AAAA, BBBB, CCCC)

Inverted because it lists for each attribute the

location on disk where the value is stored.

(C) Frieder, Grossman, & Goharian 1996, 2002 13

Building an Inverted Index
For each relation r in the collection
» For each attribute t in relation r
– Find attribute t in the item dictionary
– If term t exists, add a new disk location to its posting list
– Otherwise,
Add attribute value t to the item dictionary
Add a node to the posting list

(C) Frieder, Grossman, & Goharian 1996, 2002 14

File Organizations
Unsorted files (Heap Files)
– Good storage efficiency, fast insertion, deletion.
– Slow searches
Sorted Files
– Good storage efficiency and search of range
– Slow insertion and deletion
– not practical (files never sorted) => B+ tree data structure
Hashed-Based Files
– Not efficient storage
– Fast insertion, deletion, and equality searches
(C) Frieder, Grossman, & Goharian 1996, 2002 15
Sample Database
Hank graduated:
» Michigan
» IIT
» MIT

Hank worked:
» IBM
» Intel
» Bellcore
» Harris
(C) Frieder, Grossman, & Goharian 1996, 2002 16
Network Model

MIT IIT Michigan

Graduated

Hank

Worked

IBM Intel Bellcore Harris

(C) Frieder, Grossman, & Goharian 1996, 2002 17

Hierarchical Model

Hank

Graduated Worked

IBM Intel
MIT IIT Michigan

Bellcore Harris

(C) Frieder, Grossman, & Goharian 1996, 2002 18

Relational Model

Person Graduated Person Worked

Hank IIT Hank IBM
Hank MIT Hank Intel
Hank Michigan Hank Harris
Hank Bellcore

Key: (Person, Graduated) Key: (Person, Worked)

(C) Frieder, Grossman, & Goharian 1996, 2002 19

Object Oriented Model

Type EMPLOYEE begin

name : char (20);
graduated : SETOF (schools);
worked : SETOF (companies);
end;

INSERT ( 1,Hank, {IIT, Michigan, MIT},

{IBM, Intel, Bellcore, Harris} )

(C) Frieder, Grossman, & Goharian 1996, 2002 20

Relational Model
The initial paper on the relational model was written in
1969 by Codd.
System R was a relational prototype implemented in the
mid 70’s by IBM.
Ingres was a relational prototype implemented at UC
Berkeley in the mid 70’s.
Finally, commercial offerings of relational systems started
with Oracle in 1979 and was quickly followed by SQL/DS
and DB2 by IBM in the mid 80’s.

(C) Frieder, Grossman, & Goharian 1996, 2002 21

Data Independence
The relational model allows users to simply specify what
data they require, not how to get them. This is referred to
as data independence and is a key contribution of the
relational model.
Older models are referred to as navigational as users must
navigate through the data and follow pointers from one
datum to another.

(C) Frieder, Grossman, & Goharian 1996, 2002 22

Structure of RDBMS

User Query

Query Optimizer

Operators

File Manager
Concurrency Control &
Crash Recovery:
Buffer Manager
Transaction Manager
Disk Space Manager Lock Manager
Recovery Manager

(C) Frieder, Grossman, & Goharian 1996, 2002 23

Relational Model
Relational algebra is used to specify the operations allowed
within the relational model.
Relational algebra is theoretically based on set theory.
A relation can be illustrated as a table of rows and columns.
The table is referred to as a relation, rows are referred to as
tuples, and columns are attributes.
Relational operators are closed. A relational operation
applied to a relation always results in a relation.

(C) Frieder, Grossman, & Goharian 1996, 2002 24

Example Relation:
EMPLOYEE(emp#, name, department, salary)

Emp# Name Department Salary

002 Jones Marketing 300.00

004 Smith Sales 150.00

007 Bond Diplomacy 999.00

Table = Relation
Row = Tuple
Column = Attribute
(C) Frieder, Grossman, & Goharian 1996, 2002 25
Formal Definition
IF R is the set of attributes (columns), commonly
referred to as schema, then
r(R) is a mapping of a set of tuples (rows ),
commonly referred to as instance.

Since each attribute is restricted to a limited domain,

a relation is actually a subset of:

dom(A1) x dom(A2) x dom(A3) where dom(X)

indicates the domain or set of valid values for
attribute X.
(C) Frieder, Grossman, & Goharian 1996, 2002 26
Characteristics of Relations
No inherent ordering of tuples.
Conceptually, no inherent ordering of attributes. In
practice, attributes are ordered based on the initial schema
definition.
All attribute values within a tuple should be atomic (1NF).

(C) Frieder, Grossman, & Goharian 1996, 2002 27

Key Attributes
Since a relation is merely just a set of tuples, NO
DUPLICATE ELEMENTS are theoretically possible.

(Unfortunately, some implementations violate this

uniqueness definition)
A column or set of columns must uniquely identify a
tuple.
Superkey - any combination of attributes that uniquely
identify a tuple.

(C) Frieder, Grossman, & Goharian 1996, 2002 28

Keys (continued)
Key - a superkey from R such that the removal of
any attribute results in a set of attributes that is
NOT a superkey. Hence, a key is a:
MINIMAL SUPERKEY

Ex: (emp#, name, department, salary) is a

superkey but not a key, because name, department,
or salary could be removed and we would still
have the superkey, emp#.

(C) Frieder, Grossman, & Goharian 1996, 2002 29

Candidate Keys
It is possible that more than one set of attributes qualify as
a key. These are called candidate keys.
Typically, one is chosen and referred to as the primary
key. This key is underlined in the description of the
relation.

EMPLOYEE(emp#, name, department, salary)

(C) Frieder, Grossman, & Goharian 1996, 2002 30

Student-Grade Database
Student # GPA Degree
1 4.0 Bachelors
1 4.0 Masters
1 4.0 Doctorate
2 3.8 Bachelors
2 3.0 Masters
3 2.1 Bachelors
4 3.5 Bachelors
4 3.4 Masters
4 3.9 Doctorate
5 3.6 Bachelors
6 4.0 Associates
6 3.1 Bachelors

Key: (Student#, Degree) – Assumes only one degree type per person
(C) Frieder, Grossman, & Goharian 1996, 2002 31
SELECT
SELECT - extract tuples from a relation

Syntax: σ<selection condition> (Relation Name)

σGPA=4.0 (SG) - obtains all tuples from the Student-Grade

(SG) relation where the GPA is 4.0.
Student # GPA Degree
1 4.0 Bachelors
1 4.0 Masters
1 4.0 Doctorate
6 4.0 Associates
(C) Frieder, Grossman, & Goharian 1996, 2002 32
Project
Project retrieves columns from a table

Syntax: π<Attribute List> (Relation Name)

πstudent#, degree(SG)

Retrieves student number and degree from the Student-

Grade relation.

(C) Frieder, Grossman, & Goharian 1996, 2002 33

Single – Scan Project
πstudent#, degree(SG)

Student # Degree
1 Bachelors
1 Masters
1 Doctorate
2 Bachelors
2 Masters
3 Bachelors
4 Bachelors
4 Masters
4 Doctorate
5 Bachelors
6 Associates
6 Bachelors

(C) Frieder, Grossman, & Goharian 1996, 2002 34

Multiple – Scan Project
πstudent#, GPA(SG)

Student # GPA
1 4.0
1 4.0
1 4.0
2 3.8
2 3.0
3 2.1
4 3.5
4 3.4
4 3.9
5 3.6
6 4.0
6 3.1

(C) Frieder, Grossman, & Goharian 1996, 2002 35

Undergrad & Graduate
Student Grade Database
Student # GPA Degree
1 4.0 Masters
Graduate 1 4.0 Doctorate
2 3.0 Masters
(GSG) 4 3.4 Masters
4 3.9 Doctorate

Key:
(Student#, Degree) Student # GPA Degree
1 4.0 Bachelors
2 3.8 Bachelors
3 2.1 Bachelors
Undergrad 4 3.5 Bachelors
(USG) 5 3.6 Bachelors
6 4.0 Associates
6 3.1 Bachelors
(C) Frieder, Grossman, & Goharian 1996, 2002 36
Set Theoretic Operations:
USG UNION GSG
List all information on all students

Student # GPA Degree

1 4.0 Bachelors
1 4.0 Masters
1 4.0 Doctorate
2 3.8 Bachelors
2 3.0 Masters
3 2.1 Bachelors
4 3.5 Bachelors
4 3.4 Masters
4 3.9 Doctorate
5 3.6 Bachelors
6 4.0 Associates
6 3.1 Bachelors
(C) Frieder, Grossman, & Goharian 1996, 2002 37
Set Theoretic Operations:
πstudent#(USG) INTERSECTION πstudent#(GSG)

List all students who were both graduate and undergraduate students

Grad & Undergrad Student #

1
(GU) 2
4

(C) Frieder, Grossman, & Goharian 1996, 2002 38

Set Theoretic Operations:
πstudent#(USG) DIFFERENCE πstudent#(GSG)

List all students who were undergraduate but not graduate students

Student #
Only Undergrads 3
(OU) 5
6

(C) Frieder, Grossman, & Goharian 1996, 2002 39

Cartesian Product
Relational Algebra includes: GU x OU
For the sets:

1 2 3 5
GU OU
4 6

<1,3> <1,5> <1,6>

GU x OU = <2,3> <2,5> <2,6>
<4,3> <4,5> <4,6>
(C) Frieder, Grossman, & Goharian 1996, 2002 40
θ - Join
θ - Join is a Cartesian product with the addition of a
condition that determines which tuples are
selected.
Can be logically viewed as:
» Step 1: Cartesian Product
» Step 2: Select from result of Step 1

(C) Frieder, Grossman, & Goharian 1996, 2002 41

Additional Join Types
In a θ - Join, the joining attributes are explicitly specified.
An Equi - Join is the most common θ - Join with an equality as the
condition.
A Natural - Join is an Equi-Join where the joining attributes
are all those attributes with a common name (implicitly
specified)

(C) Frieder, Grossman, & Goharian 1996, 2002 42

πstudent#,degree(GSG) Equi-Join πstudent#,GPA(USG)

Student # GPA Degree

1 4.0 Masters
1 4.0 Doctorate
2 3.8 Masters
4 3.5 Masters
4 3.5 Doctorate

(C) Frieder, Grossman, & Goharian 1996, 2002 43

Relational Algebra Summary
Relations are viewed as sets.
Relational operations are closed. Any operation on one or
more relations yields a relation.
No inherent ordering.
Unary Operators: SELECT, PROJECT
Binary Operators: UNION, INTERSECTION, SET
DIFFERENCE, CARTESIAN PRODUCT, JOIN
Single Scan: SELECT, PROJECT including key
Multiple Scan: All others

(C) Frieder, Grossman, & Goharian 1996, 2002 44

Structured Query Language
(SQL)

(C) Frieder, Grossman, & Goharian 1996, 2002 45

Structured Query Language
First relational query language, SQUARE was
implemented in System R (1975).
SQUARE was followed by SEQUEL.
First commercial implementation, Oracle (1979), followed
closely by SQL/DS in 1982.
Major SQL DBMS vendors: Oracle, IBM (DB2), Sybase,
Informix, Computer Associates, and Microsoft.

(C) Frieder, Grossman, & Goharian 1996, 2002 46

SQL Overview
Data Manipulation Language (DML)
» SELECT
» INSERT
» UPDATE
» DELETE
Data Definition Language (DDL)
» CREATE TABLE
» CREATE INDEX
» GRANT

(C) Frieder, Grossman, & Goharian 1996, 2002 47

SELECT Overview
Single Relation
» SELECT
» Boolean Operators
» IN
» BETWEEN
» Aggregate Operators
» Calculated Attributes
» Sorting
» Wildcard Searches
» GROUP BY
» HAVING
» NULLS
» Varchar
Multiple Relations

(C) Frieder, Grossman, & Goharian 1996, 2002 48

Syntax:
SELECT <list of columns>
FROM <list of tables>
PARTS

Ex: SELECT p#,name,qty p# name qty

FROM PARTS 1 Nut 42

2 Bolt 25
3 Wheel 15
Ex: SELECT *
FROM PARTS

(C) Frieder, Grossman, & Goharian 1996, 2002 49

Select with WHERE
SELECT <list of columns>
FROM <list of tables>
WHERE <list of conditions>

<list of conditions> is of the form:

EX: SELECT * p# name qty

FROM PARTS 3 Wheel 15
WHERE p# = 3

(C) Frieder, Grossman, & Goharian 1996, 2002 50

Use of Boolean Operators
Conditions can be separated by Boolean operators:
AND, OR, NOT

EX: “List information about parts 1 and 2”

SELECT *
p# name qty
FROM PARTS
WHERE p# = 1 OR p# = 2 1 Nut 42
2 Bolt 25
EX: “LIST information about all wheels that contain more
than 20 in stock”
SELECT *
FROM PARTS
WHERE name = ‘Wheel’ and qty > 20

(C) Frieder, Grossman, & Goharian 1996, 2002 51

Shortcut Number 1: IN
To find information for a list of values, the IN operator
may be used:

Ex: “List the name of all parts whose part # is 1,2, or 5”

SELECT name Name

FROM PARTS Nut
WHERE p# IN (1, 2, 5) Bolt

instead of: WHERE (p# = 1) OR (p# = 2) OR (p# = 5)

(C) Frieder, Grossman, & Goharian 1996, 2002 52

Shortcut Number 2:
BETWEEN
To find values within a range, it is often easier to use
BETWEEN.
Ex: “Find all parts where the quantity on hand is
greater than or equal to twenty parts,
but less than or equal to fifty.”
SELECT * p# name qty
1 Nut 42
FROM PARTS 2 Bolt 25
WHERE qty BETWEEN 20 and 50

instead of: WHERE qty >= 20 AND qty <= 50

(C) Frieder, Grossman, & Goharian 1996, 2002 53
Aggregate Operators
A common requirement is to compute statistics such as MIN,
MAX, and AVERAGE on a range of data.

Ex: Find the min, max, and average quantity of all wheels.

SELECT MIN(qty), MAX(qty), AVG(qty)

FROM PARTS min(qty) max(qty) avg(qty)
WHERE name = ‘Wheel’ 15 15 15

(C) Frieder, Grossman, & Goharian 1996, 2002 54

Calculated Attributes
A new attribute is obtained by using arithmetic operators
(+,-, *, /) on other numeric attributes.

All operators follow standard precedence.

Ex: List all parts and their quantity given an

increase of 20%
p# name qty
1 Nut 50
SELECT p#,name,(qty+(qty*.2)) ‘qty’ 2 Bolt 30
FROM PARTS 3 Wheel 18

(C) Frieder, Grossman, & Goharian 1996, 2002 55

Sorting
ORDER BY <list of attributes> [DESC]
can be added to SELECT to obtain sorted output.

Ex: List all part names in ascending order:

SELECT p#, name, qty p# name qty

2 Bolt 25
FROM PARTS 1 Nut 42
ORDER BY name 3 Wheel 15

For descending order change to: ORDER BY name DESC

(C) Frieder, Grossman, & Goharian 1996, 2002 56
Sorting Calculated Attributes
To refer to a computed attribute in the ORDER BY, use the
position in the list of columns following SELECT.

Ex: “List all part information in descending order of a

projected 20 percent reduction in quantity”

SELECT p#,name,(qty-(qty*.2)) ‘qty’ p# name qty

FROM PARTS 1 Nut 34
2 Bolt 20
ORDER BY 3 DESC 3 Wheel 12

(C) Frieder, Grossman, & Goharian 1996, 2002 57

Wildcard Searches of Strings
The LIKE operator is used to search parts of a string.

The following wildcard characters are used:

% - zero or more characters
_ - exactly one character

Ex: List all parts whose name starts with a ‘W’

SELECT * p# name qty

3 Wheel 15
FROM PARTS
WHERE name LIKE ‘W%’

(C) Frieder, Grossman, & Goharian 1996, 2002 58

More LIKE Examples
Ex: List all parts whose name starts with a
‘W’ and ends with an ‘L’ p# name qty
WHERE name LIKE ‘W%L’ 3 Wheel 15

Ex: List all parts whose name is three characters

long and starts with a ‘N’
WHERE name LIKE ‘N__’ p# name qty
1 Nut 42

(C) Frieder, Grossman, & Goharian 1996, 2002 59

Review
List all parts whose name starts with a ‘W’ or whose part
number is either 2,4,8,11,12,13,14,15. Sort the list in
descending order by quantity.

SELECT *
FROM PARTS
WHERE _____ LIKE _______ OR
_____ IN (_,_,_) OR
______ BETWEEN __ AND __
ORDER BY ____ DESC

(C) Frieder, Grossman, & Goharian 1996, 2002 60

Review (Answer)
List all parts whose name starts with a ‘W’ or whose part
number is either 2,4,8,11,12,13,14,15. Sort the list in
descending order by quantity.

SELECT *
FROM PARTS
WHERE name LIKE ‘W%’ OR
p# IN (2,4,8) OR
p# BETWEEN 11 AND 15
ORDER BY qty DESC

(C) Frieder, Grossman, & Goharian 1996, 2002 61

More Review
Ex: List the total number of parts that would exist if
quantity was increased by 25 percent.

Ex: List all parts that would have a quantity greater than
50, if the quantity was increased by 25 percent.

(C) Frieder, Grossman, & Goharian 1996, 2002 62

More Review (Answer)
Ex: List the total number of parts that would exist if
the quantity was increased by 25 percent.
SELECT SUM(qty + qty * .25)
FROM PARTS Sum(qty + qty * .25)
103
Ex: List all parts that would have a quantity greater
than 50, if the quantity was increased by 25 percent.
SELECT * p# name qty
FROM PARTS 1 Nut 42
WHERE qty + qty * .25 > 50
(C) Frieder, Grossman, & Goharian 1996, 2002 63
GROUP BY
It is often necessary to review data about groups of related
tuples. Consider an employee relation that contains the
“Department” attribute. Assume one employee may work
in only a single department.

DEPARTMENT partitions the EMP set into subsets:

Sales

Marketing
Service
Finance

(C) Frieder, Grossman, & Goharian 1996, 2002 64

GROUP BY (continued)
EMPLOYEE (emp#, name, salary, department)
emp# name salary department
1 Fred 200 Sales
2 Mike 300 Sales
3 Sam 400 Sales

4 Martha 350 Marketing

5 Juanita 500 Marketing

6 Steve 800 Finance

7 Tom 200 Service

8 Sue 900 Service
(C) Frieder, Grossman, & Goharian 1996, 2002 65
GROUP BY (continued)
For each department, list the average salary.
SELECT department, AVG(salary)
FROM EMPLOYEE
GROUP BY department

department AVG(salary)
Sales 300
Marketing 425
Finance 800
Service 550
(C) Frieder, Grossman, & Goharian 1996, 2002 66
Group By (continued)
If a WHERE clause exists, it is executed as well.

Ex: For each department, list the highest salary, but

exclude all employees whose name starts with a ‘S’

SELECT department, MAX(salary)

FROM EMPLOYEE
WHERE name NOT LIKE ‘S%’
GROUP BY department

(C) Frieder, Grossman, & Goharian 1996, 2002 67

GROUP BY (continued)
department max(salary)
Sales 300
Marketing 500
Service 200

(C) Frieder, Grossman, & Goharian 1996, 2002 68

GROUP BY (continued)
More refined groups are obtained by using multiple
attributes in GROUP BY.
Add the attribute “REGION” to the employee relation. Now
the department partition may be partitioned into different
regions.

North

West MARKETING East

South

(C) Frieder, Grossman, & Goharian 1996, 2002 69

GROUP BY (continued)
EMPLOYEE (emp#, name, salary, department,rgn)
emp# name salary department rgn
1 Fred 200 Sales north
2 Mike 300 Sales north
3 Sam 400 Sales east

4 Martha 350 Marketing west

5 Juanita 500 Marketing west

6 Steve 800 Finance south

7 Tom 200 Service north

8 Sue 900 Service south
(C) Frieder, Grossman, & Goharian 1996, 2002 70
GROUP BY
(multiple partitions)

To specify more than one partition, simply add more

attributes after the GROUP BY:

Ex: Compute the average salary for each region within

each department.

SELECT department, region, AVG(salary)

FROM EMPLOYEE
GROUP BY department, region

(C) Frieder, Grossman, & Goharian 1996, 2002 71

GROUP BY (continued)
department reg avg(salary)
Sales north 250
Sales east 400
Marketing west 425
Finance south 800
Service north 200
Service south 900

(C) Frieder, Grossman, & Goharian 1996, 2002 72

HAVING (restricts groups)
Syntax: HAVING (list of conditions)
Aggregate functions used (SUM, MIN, MAX, COUNT).
Ex: List the average salary for all departments that have
more than two employees.
SELECT department, AVG(salary)
FROM EMPLOYEE
GROUP BY department department avg(salary)
Sales 300
HAVING COUNT(*) > 2

(C) Frieder, Grossman, & Goharian 1996, 2002 73

HAVING (continued)
Ex: List the minimum salary in departments sales,
marketing and service as long as the department has an
average salary greater than 400.
SELECT department, MIN(salary)
FROM EMPLOYEE
WHERE department IN (‘sales’,’marketing’,’service’)
GROUP BY department
HAVING AVG(salary) > 400
department min(salary)
Marketing 350
(C) Frieder, Grossman, & Goharian 1996, 2002 74
Multiple Entity Relationships
Multiple tables needed to store multi-entity relationships.

One to many:
One parent may have many children

Many to one:
Many people may attend a single meeting

Many to Many:
Students graduate from multiple colleges
Each college graduates by many students
(C) Frieder, Grossman, & Goharian 1996, 2002 75
Single Relation Design
It is tempting to try and stuff all multi-valued relationships
into a single relation:

emp# name salary college1 college2 college3

1 Fred 200 Harvard Unused Unused

2 Ethel 300 IIT Michigan Unused

3 Mike 400 MIT Stanford IIT

(C) Frieder, Grossman, & Goharian 1996, 2002 76

Problems with Poor Design
Incompleteness
» Unable to store more than three colleges per individual.
Many to many relationships can have an infinite
number of values.
Query Complexity
» Queries such as “list all colleges attended by Mike”
become substantially more difficult.
Wasted Storage
» Many entries are not used.

(C) Frieder, Grossman, & Goharian 1996, 2002 77

Multiple Relations:
2 Relations for Many-Many

EMPLOYEE COLLEGE
emp# name salary emp# name
1 Fred 200 1 Harvard

2 Ethel 300 2 IIT

2 Michigan
3 Mike 400
3 MIT
3 Stanford
3 IIT

emp# in EMPLOYEE is a primary key

emp# in COLLEGE is a foreign key
emp#,name in COLLEGE is a primary key
(C) Frieder, Grossman, & Goharian 1996, 2002 78
Preserving Relationships
Joining the two relations restores the original relation
assuming a key is part of the partitioning.

Poor partitioning may result in additional spurious tuples

being introduced.

(C) Frieder, Grossman, & Goharian 1996, 2002 79

Equi-Join
Explicit indication of the join attribute conditions.
Typically involves joining on foreign key attributes.
List all colleges attended by “Mike”
SELECT b.name
FROM EMPLOYEE a, COLLEGE b
WHERE a.emp# = b.emp#
AND a.name = ‘Mike’ name
MIT
Stanford
IIT
(C) Frieder, Grossman, & Goharian 1996, 2002 80
Problem with only using Two
Relations for Many-Many

Suppose we need to maintain the college location as well.

location must be replicated many times.

EMPLOYEE COLLEGE
emp# name salary emp# name location
1 Fred 200 1 Harvard Boston
2 IIT Chicago
2 Ethel 300
2 Michigan Ann Arbor
3 Mike 400 3 MIT Boston
3 Stanford Stanford
3 IIT Chicago

(C) Frieder, Grossman, & Goharian 1996, 2002 81

Use of 3 Relations
To avoid needless repetition, we create a separate table for each of
the two entities involved in a many-many relationship, and then a
third “linking” relation to contain data about the relationship
between the entities.

All 1-1 information about employees:

EMPLOYEE(emp#, name, salary)
All 1-1 information about colleges:
COLLEGE(col#, name, location)
All data pertaining to a single employee attending a single college:
ATTENDS(emp#, col#, gpa)

(C) Frieder, Grossman, & Goharian 1996, 2002 82

Use of Three Relations (continued)
EMPLOYEE COLLEGE
emp# name salary col# name location
1 Fred 200 11 Harvard Boston
2 Ethel 300 22 IIT Chicago
3 Mike 400 33 Michigan Ann Arbor
44 MIT Boston
55 Stanford Stanford
ATTENDS
emp# col# gpa
1 11 2.45
2 22 3.79
2 33 3.65
3 44 2.85
3 55 2.65
3 22 4.0
(C) Frieder, Grossman, & Goharian 1996, 2002 83
Sample Query (3 Relations)
Ex: List the names of all colleges attended by
“Mike.”

SELECT b.name
FROM EMPLOYEE a, COLLEGE b, ATTENDS c
WHERE a.emp# = c.emp# AND name
b.col# = c.col# AND Michigan
a.name = ‘Mike’ MIT
Stanford

(C) Frieder, Grossman, & Goharian 1996, 2002 84

Sample Query (3 Relations)
Ex: List the names of all employees who
attended Harvard.
SELECT a.name
FROM EMPLOYEE a, COLLEGE b, ATTENDS c
WHERE a.emp# = c.emp# AND
b.col# = c.col# AND name
Fred
b.name = ‘Harvard’

(C) Frieder, Grossman, & Goharian 1996, 2002 85

Subqueries
Instead of hard-coding the list that is used by IN, it is possible to
dynamically generate the list using a subquery.

Ex: SELECT *
FROM EMPLOYEE
WHERE emp# IN (1,2,3,4,5,7)

could be rewritten as;

Ex: SELECT *
FROM EMPLOYEE
WHERE emp# IN (SELECT num FROM SAMPLE)

Assuming SAMPLE(num) is a relation with tuples: 1,2,3,4,5, and 7.

(C) Frieder, Grossman, & Goharian 1996, 2002 86

EXISTS
EXISTS prefaces a subquery and evaluate to TRUE if one
or more tuples are present in the result set of the subquery.

Ex: List all employees who attended at least one college.

SELECT *
FROM EMPLOYEE a
WHERE EXISTS (SELECT c.emp#
FROM ATTENDS c
WHERE c.emp# = a.emp#)

(C) Frieder, Grossman, & Goharian 1996, 2002 87

DISTINCT
DISTINCT is used to remove duplicates.
Ex: List all distinct salaries.
SELECT DISTINCT(salary) distinct (salary)
200
FROM EMPLOYEE 300
400

(C) Frieder, Grossman, & Goharian 1996, 2002 88

UNION
The union of two result sets (of the same data type) is obtained via the
UNION operator (duplicates are removed). The OR query can be
written using UNION:

Syntax: <SQL select statement> UNION

<SQL select statement>
Ex: Obtain billing from 2000 and 2001.
SELECT *
FROM BILL2000
UNION
SELECT *
FROM BILL2001

(C) Frieder, Grossman, & Goharian 1996, 2002 89

EXCEPT
Syntax: <SQL select statement> EXCEPT
<SQL select statement>
Ex: Find employees, who attended IIT but not MIT.
SELECT e.emp#
FROM EMPLOYEE a, COLLEGE b, ATTENDS c
WHERE a.emp# = c.emp# AND
b.col# = c.col# AND
b.name = ‘IIT’
EXCEPT
SELECT e.emp#
FROM EMPLOYEE a, COLLEGE b, ATTENDS c
WHERE a.emp# = c.emp# AND
b.col# = c.col# AND
b.name = ‘MIT’
(C) Frieder, Grossman, & Goharian 1996, 2002 90
Other Data Manipulation
UPDATE
» Modify tuples in a single relation
DELETE
» Remove tuples from a single relation
INSERT
» Add tuples to a single relation

(C) Frieder, Grossman, & Goharian 1996, 2002 91

INSERT
Syntax:
INSERT INTO <table name> [list of columns]
VALUES (<list of values>)

Ex. For the relation: EMPLOYEE (emp#, name, salary)

INSERT INTO EMPLOYEE (emp#, name, salary)

VALUES (5, ‘Herbert’, 200)

Note: If optional column list is not found, the values must be listed in
the order of their initial definition

(C) Frieder, Grossman, & Goharian 1996, 2002 92

INSERT - Format 2
Syntax: INSERT INTO <table name>
(select statement)

Ex: Copy all tuples in the EMPLOYEE relation and place

them in NEW_EMPLOYEE

INSERT INTO NEW_EMPLOYEE

SELECT *
FROM EMPLOYEE

(C) Frieder, Grossman, & Goharian 1996, 2002 93

UPDATE
Syntax:
UPDATE <table name>
<list of assignments separated by commas>
WHERE <any SELECT statement>
where <assignment> is of the form:
SET <column_name> = <value>
Ex: Modify John’s salary to 150
UPDATE EMPLOYEE
SET salary = 150.00
WHERE name = ‘John’

(C) Frieder, Grossman, & Goharian 1996, 2002 94

UPDATE (continued)
An assignment statement may contain a numeric expression:

Ex: Give all employees a ten percent raise.

UPDATE EMPLOYEE
SET salary = salary * 1.10

(C) Frieder, Grossman, & Goharian 1996, 2002 95

DELETE
Syntax:
DELETE FROM <table name>
WHERE <any SELECT statement>

Ex: Remove all employees who work in department 5

DELETE FROM EMPLOYEE

WHERE dept = 5

To remove all employees:

DELETE FROM EMPLOYEE

(C) Frieder, Grossman, & Goharian 1996, 2002 96

Data Definition Language (DDL)

» Create Table
» Drop Table
» Create Index
» Drop Index
» GRANT
» REVOKE
» ALTER TABLE

(C) Frieder, Grossman, & Goharian 1996, 2002 97

CREATE TABLE
CREATE TABLE <table name> (
[<colname1> <datatype1>],
....
[<colnameN> <datatypeN>]
)
typical data types are:
CHAR(x), VARCHAR(x), SMALLINT, INTEGER,
DATE, TIME, DECIMAL (x,y)

(C) Frieder, Grossman, & Goharian 1996, 2002 98

CREATE TABLE (example)
CREATE TABLE EMPLOYEE
(emp# SMALLINT,
name CHAR(20),
salary DECIMAL(5,2))

(C) Frieder, Grossman, & Goharian 1996, 2002 99

Varying Length Character

VARCHAR(x) indicates that a string will be no

longer than x characters.
Fixed length strings are padded to fill fixed space.
Varying length strings have a Length Indicator.

FIXED VARYING
200 Hank FILL 252.35 200 4 Hank 252.35

(C) Frieder, Grossman, & Goharian 1996, 2002 100

Effect of Varying Length Columns
on Performance

FIXED VARCHAR
tuple 1 tuple 1 tuple 2
tuple 2 tuple 2 (cont) tuple 3
tuple 3 tuple 3 (cont)
tuple 4 tuple 4
tuple 5 tuple 5

A modification to the FIXED table only affects one tuple.

A modification to VARCHAR might result in the reshuffling or

copying to OVERFLOW of other tuples so that they fit on a single
page.

(C) Frieder, Grossman, & Goharian 1996, 2002 101

Rule of Thumb (VARCHAR)
Avoid VARCHAR when it is not necessary. One
“rule of thumb” is to avoid VARCHAR when the
maximum savings is less then thirty characters.

Advantage: Save storage

Disadvantage: Degrades performance of UPDATE

(C) Frieder, Grossman, & Goharian 1996, 2002 102

Nulls
An attribute may be defined as null.
This indicates that the value is unknown and avoids the
need for user-defined special indicators.

CREATE TABLE EMPLOYEE

(emp# SMALLINT,
name CHAR(20),
salary DECIMAL(5,2) NULL)

(C) Frieder, Grossman, & Goharian 1996, 2002 103

Effect of Nulls on Performance

Any data in a tuple that allows nulls is prefaced by

a null indicator.

Null Indicator No Null

200 Hank Null 252.35 200 Hank 252.35
Indicator
(1 byte)

(C) Frieder, Grossman, & Goharian 1996, 2002 104

Effect of Nulls on Performance

A table that specifies about NULLS results in one byte

added for each tuple stored in the relation.
This can be a tremendous waste of storage if no data are
ever NULL.
Most DBMS default to allow NULLS, while many real
world applications do not require NULLS.
Performance of retrieval and update is slightly degraded
because the null indicator must be examined before
checking tuple content.

(C) Frieder, Grossman, & Goharian 1996, 2002 105

Syntax Modifications for Nulls

Allow NULL specification in INSERT

» To add an employee whose salary is unknown:
– INSERT INTO EMPLOYEE (3,’Hank’, null)
Use of NULL in select.

SELECT *
FROM EMPLOYEE
WHERE salary IS NULL

(C) Frieder, Grossman, & Goharian 1996, 2002 106

CREATE INDEX
The relational model does not specify how data should be
accessed.
To create a separate access path, SQL allows users to use
CREATE INDEX to create a separate structure, called
access method. Usually B+tree is used.

Ex: CREATE UNIQUE INDEX I1 ON EMPLOYEE (num)

SELECT *
FROM EMPLOYEE
WHERE num = 25
will use a B-tree instead of a sequential scan.
(C) Frieder, Grossman, & Goharian 1996, 2002 107
DROP INDEX / TABLE
To remove an index use:

DROP INDEX <index name>

To remove a table use:

DROP TABLE <table name>

(C) Frieder, Grossman, & Goharian 1996, 2002 108

Referential Integrity
When a primary key is modified, it is often necessary to
delete the corresponding foreign keys.
A single employee id might be a foreign key in many
tables.

Employee

Colleges Projects Dependents

(C) Frieder, Grossman, & Goharian 1996, 2002 109

Specification of Primary and
Foreign Key
EMPLOYEE(emp#, name, salary) and
COLLEGE (emp#, col#, col_name)
emp# is a primary key in the EMPLOYEE relation and
emp# is a foreign key of the COLLEGE relation.
CREATE TABLE EMPLOYEE CREATE TABLE COLLEGE
(emp# SMALLINT, (emp# SMALLINT,
name CHAR(20), col# SMALLINT,
salary DECIMAL(5,2), col_name CHAR(20),
PRIMARY KEY (emp#)) FOREIGN KEY K1 (emp#)
REFERENCES EMPLOYEE
ON DELETE CASCADE,
PRIMARY KEY (emp#, col#))
(C) Frieder, Grossman, & Goharian 1996, 2002 110
Referential Integrity (continued)
ON DELETE:
» [CASCADE, SET NULL, RESTRICT]
CASCADE
» A delete to a primary key results in a delete of all corresponding
tuples that contain the foreign key.
SET NULL
» A delete to a primary key results in null values placed in all
corresponding foreign keys.
RESTRICT
» A delete to primary key results in an error if a matching foreign
key exists.

(C) Frieder, Grossman, & Goharian 1996, 2002 111

Views
A view is a logical relation. It is defined as a subset of tuples and
attributes of a physical (base) relation.

Syntax: CREATE VIEW <view name> as <sql select>

Ex: Create a view on the EMPLOYEE relation such that the salary
attribute is omitted.

CREATE VIEW V1 AS
(SELECT num, name FROM EMPLOYEE)

Now a user may be given access to only V1 without access to the

base relation: EMPLOYEE.

(C) Frieder, Grossman, & Goharian 1996, 2002 112

Views (continued)
The tuples in the base relation may be restricted by adding a
WHERE condition to the view definition.

EX: Create a view that contains information only about employees

in named ‘Steve’

CREATE VIEW V2 as (SELECT *

FROM EMPLOYEE
WHERE name = ‘Steve’)

Any queries against V2 are executed by merging the view definition

with the query to ensure the result set only accesses data allowed by
the view.

(C) Frieder, Grossman, & Goharian 1996, 2002 113

View Insert
An insert to a view that does not contain all of the attributes in
the base relation results in the additional attributes being set to
NULL. This is only valid if nulls are permitted.

Ex: Consider the view V1 that omits the SALARY attribute.

INSERT INTO V1 (4, ‘Hank’) is equivalent to:

INSERT INTO EMPLOYEE (4,’Hank’, null)

A null value will be placed in the salary attribute.

(C) Frieder, Grossman, & Goharian 1996, 2002 114

Security and Authorization
Access to relations and views is controlled by the GRANT and
REVOKE statements.

GRANT [ALL, SELECT, INSERT, UPDATE, DELETE]

ON <object name> to <list of userid>

Ex: Give John access to all EMPLOYEE data and ensure that Mary
and Sue may not look at employee salaries.

GRANT ALL ON EMPLOYEE TO JOHN

GRANT SELECT ON V1 TO MARY, SUE

(C) Frieder, Grossman, & Goharian 1996, 2002 115

GRANT (continued)
Optionally, “with grant option” may be specified to allow
the recipient to grant access to the privileges which are
being given.
If user1 issues:
» GRANT SELECT ON T1 TO USER2
WITH GRANT OPTION
User 2 may now:
» GRANT SELECT ON USER1.T1 to USER3

(C) Frieder, Grossman, & Goharian 1996, 2002 116

GRANT (continued)
A GRANT of the UPDATE operation may be restricted to
only certain columns of the object.
Ex:
» Give John the ability to update only the SALARY
attribute.
» GRANT UPDATE(salary) on EMPLOYEE to JOHN
» “PUBLIC” is a special user name that implies all users
» GRANT ALL ON SUPPLIER TO PUBLIC gives all
users access to all of the tuples in the SUPPLIER
relation.
(C) Frieder, Grossman, & Goharian 1996, 2002 117
Revoke
Removes access from a user

REVOKE <access> ON <object name>

FROM <list of userids>

Ex: remove Mary’s access to look at employee data.

REVOKE SELECT ON EMPLOYEE
FROM MARY

(C) Frieder, Grossman, & Goharian 1996, 2002 118

System Catalogs
System catalogs contain metadata (data about data). These
catalogs can be queried with any valid SQL SELECT.

When a user issues a CREATE TABLE statement, the

following catalogs are updated:
» A tuple is added to the TABLES relation that indicates the name
of the table and who created, time of creation, etc.
» One tuple is added to the COLUMNS relation for each column in
the CREATE TABLE statement to indicate the name and the data
type of each column.
» A tuple is added to the TABLE_AUTH to indicate that the creator
has access to the relation.
(C) Frieder, Grossman, & Goharian 1996, 2002 119
Performance Aspects of Catalogs
Catalogs often are a “hot spot” in that many users issuing
DDL against the catalogs will result in contention.
Some catalogs store information about when applications
have been processed.
Most installations support separate development and
production systems.

(C) Frieder, Grossman, & Goharian 1996, 2002 120

Embedded SQL
Some queries can not be answered conveniently by only
SQL commands.
The use of SQL commands within a program in a host
language is called embedded SQL.
The data types of SQL might not be recognized by host
language, and vise versa. Thus, casting the data values.
Programming languages typically do not support set of
rows. Thus use of Cursors.

(C) Frieder, Grossman, & Goharian 1996, 2002 121

Trigger
Trigger is a procedure that is invoked by DBMS as the
result of a transaction to perform the following:
Maintaining database integrity
Another database transaction
Alerting users
Supporting auditing and security checks by creating logs
Collecting Statistics

(C) Frieder, Grossman, & Goharian 1996, 2002 122

Database Design

(C) Frieder, Grossman, & Goharian 1996, 2002 123

Problem
Given some body of data to be represented in the database,
what is the best logical structure for the data?
» Identify Entities
» Identify Relationships

Focus of this effort is on logical design, not physical

(C) Frieder, Grossman, & Goharian 1996, 2002 124

Design Approaches
Entity Relationship Model
» Identify the general entities and relationships

Normalization
» Refine the design

(C) Frieder, Grossman, & Goharian 1996, 2002 125

Entity Relationship Model
Developed by Peter Chen in 1976
Entity
» A Distinguishable Object
» Regular Entity
» Weak Entity
» Entities have properties
Relationship

(C) Frieder, Grossman, & Goharian 1996, 2002 126

ER Diagram
Each entity is shown as a rectangle
Weak entities have a double rectangle (not shown here)
Properties are shown as ellipses with the name of the
property in question.
Properties are attached to the specific entity.

SSN

Name
Employee
Salary

(C) Frieder, Grossman, & Goharian 1996, 2002 127

Drawing Relationships
Each relationship is labeled as a diamond.

The participants are connected to the relevant relationship

with a line. Each line is labeled 1 or M to indicate the type
of relationship: (1-1, 1-M, M-1 or M-M).

(C) Frieder, Grossman, & Goharian 1996, 2002 128

Relationship Examples

DEPARTMENT 1 M
Dept-Emp EMPLOYEE

EMPLOYEE 1 1
Emp-Sp SPOUSE

SUPPLIER M M
Sup-part PARTS

(C) Frieder, Grossman, & Goharian 1996, 2002 129

ER Diagram
Weak Entity
» has a discriminator attribute (ex: p#)
» does not exist without the strong entity (ex: no payment exists
without loan)
Weak entity requires the primary key of the strong entity
along with the discriminator to be identified.
P-date
L# ssn L-amount P# amount

1 loan- M
Loan payment Payment

(C) Frieder, Grossman, & Goharian 1996, 2002 130

DDL From E-R diagrams
Each entity may result in a relation whose attributes are
the properties for the entity

Each relationship may result in a relation whose attributes

link the entities described in the relationship

(C) Frieder, Grossman, & Goharian 1996, 2002 131

Example (1-M, M-1)
city p# pname
s# sname color

1 M
Supplier Supl-part Parts

Supplier (s#, sname, city)

Parts ( p#, pname, color)
Sup-part (s#, p#)
OR:
Supplier (s#, sname, city)
Parts ( p#, pname, color,s#)
(C) Frieder, Grossman, & Goharian 1996, 2002 132
Example (1-M, M-1)
city p# pname
s# sname color

1 M
Supplier Supl-part Parts

qty
date

Supplier (s#, sname, city)

Parts ( p#, pname, color)
Sup-part (s#, p#, qty,date)

(C) Frieder, Grossman, & Goharian 1996, 2002 133

Example (1-1)
salary s-ssn name
ssn name DOB

1 1
Employee Emp-Sp Spouse

Employee (ssn, name, salary)

Spouse (s-ssn, name, DOB) OR:
Emp-Sp (ssn, s-ssn) Employee (ssn, name, salary,s-ssn)
OR: Spouse (s-ssn, name, DOB)
Employee (ssn, name, salary)
Spouse (s-ssn, name, DOB,ssn)
(C) Frieder, Grossman, & Goharian 1996, 2002 134
Example (M-M)
salary dname city
ssn name

M M
Employee Emp-Dept Department

Employee (ssn, name, salary)

Department (dname, city)
Emp-Dept (ssn, dname)

(C) Frieder, Grossman, & Goharian 1996, 2002 135

Example (Weak Entity)
P-date
L# ssn L-amount P# amount

1 loan- M
Loan payment Payment

Loan (L#, ssn, L-amount)

Payment (L#, P#, amount, P-date)

(C) Frieder, Grossman, & Goharian 1996, 2002 136

Example (Class Hierarchies)
ssn name Staff (ssn, name,phone)
Full Time (ssn,salary,start-date)
phone Part Time (ssn, hourly-rate)
Staff

OR:
IS-A
Full Time (ssn,salary,start-date,
Full Time Part Time
name,phone)
Part Time (ssn, hourly-rate,
Start-date name, phone)
Salary Hourly-rate

(C) Frieder, Grossman, & Goharian 1996, 2002 137

Example
(Ternary Relationship)
salary dname city
ssn name

M M
Employee E-D-P Department

M
Project pname

Employee (ssn, name, salary)

pno
Department (dname, city)
Project (p-no,pname)
E-D-P (ssn, dname,pno)

(C) Frieder, Grossman, & Goharian 1996, 2002 138

Example
(Aggregation)
salary dname city
ssn name

M M
Employee Emp-Dept Department

M
active date
Employee (ssn, name, salary) 1
Department (dname, city) pname
Project
Project (p-no,pname)
Emp-Dept (ssn, dname) pno
Active (ssn, dname,pno,date)

(C) Frieder, Grossman, & Goharian 1996, 2002 139

Design Scenario
Some Institute of Technology (SIT) is considering to
modernize its administrative functions including is
campus-wide information database. Since SIT is in the
same financial shape as all universities are, it was decided
to collect many free designs and evaluate them with the
hope that some would be perfectly developed. That is
were you come in. You have been graciously volunteered
to design their database. You must state all of your
assumptions. Failure to do so will lead to “general”
assumptions and might violate your design. Your design
must be in at least 3NF and must be able to store and
query data on the following topics:

(C) Frieder, Grossman, & Goharian 1996, 2002 140

Design Scenario Requirements

• Academic Structure:
• Colleges – containing many departments, head by a Dean, etc.
• Departments – containing many labs, faculty, courses,
students, head by a Chair, etc.
• Classes:
• Locations, prerequisites, offerings, professors
• Personnel:
• Names, social security numbers, children, offices, phone
numbers, email addresses, salary, etc.
• Cafeteria Offerings:
• Prices, item selection based on meal (breakfast, lunch, and
dinner), purchases (date & cost), location, manager, etc.

(C) Frieder, Grossman, & Goharian 1996, 2002 141

ERD for Design Scenario

Lab
Children
College 1 M
M
DC DL

M FC
1
DS
1 Department M M
DF
M M
1 Faculty
Student
M CS CD M
term loc
Sec# M M
M CF
M Course
CP term Sec#
M
(C) Frieder, Grossman, & Goharian 1996, 2002 142
Design Relation Definition

Relations (Academic)
» College( college#, dean, college_nm)
» Dept( dept#, chair, dept_nm)
» Lab (lab_nm, bld#, room#, capacity)
» Student (ssn, first_nm, last_nm)
» Faculty (f_ssn, first_nm, last_nm, salary, phone,email)
» Course (c#,course_nm,credit_hrs)

(C) Frieder, Grossman, & Goharian 1996, 2002 143

Design Relation Definition

Relationships (Academic)
» DC (dept#, college#)
» DL (lab_nm, dept#)
» DS (ssn, dept#) /* ssn of student */
» DF (dept#, f_ssn) /* ssn of faculty */
» CF (c#, f_ssn, term, sec#, loc)
» CD (c#, dept#)
» CS (c#, ssn, term, sec#)
» CP( course#, prereq# ) /* Prereqs of Courses /*

(C) Frieder, Grossman, & Goharian 1996, 2002 144

Design Relation Definition

Relations (Personnel)
» Children (ssn, f_nm,l_nm, age)

Relationship (Personnel)
» FC (f_ssn, ssn) /* ssn of faculty and child */

Cafeteria
» Meal( meal#, time )
» Café( café#, address, phone# )
» … etc.
(C) Frieder, Grossman, & Goharian 1996, 2002 145
Sample Query 1

Who is teaching CS 425 Section 051 during the fall 1999

term and where is it taught?

SELECT f_ssn, loc

FROM CF
WHERE (c# = ‘CS425’
AND Sec# = ‘051’
AND Term = ‘Fall99’)

(C) Frieder, Grossman, & Goharian 1996, 2002 146

Sample Query 2

What are the names of the children of the faculty who

taught classes offered at the Stuart Building during the
fall 1998 term?

SELECT Children.f_nm, Children.l_nm

FROM Children, Faculty, FC , CF
WHERE (loc = “Stuart Building”
AND term = “Fall98”
AND CF.f_ssn = Faculty.f_ssn
AND FC.f_ssn = Faculty.f_ssn
AND FC.ssn# = Children.ssn)

(C) Frieder, Grossman, & Goharian 1996, 2002 147

Sample Query 3

What are the common meal offerings that are available for
both lunch and dinner?

SELECT A.meal#
FROM Meal A, Meal B
WHERE (A.meal# = B.meal#
AND A.time = “Lunch”
AND B.Time = “Dinner”)

(C) Frieder, Grossman, & Goharian 1996, 2002 148

Normalization
Different normal forms have been defined to characterize
a database design. Each NF is progressively more
restrictive.
U

3NF

2NF

1NF

(C) Frieder, Grossman, & Goharian 1996, 2002 149

Functional Dependency
Functional Dependency (FD)
» A many-one or many-many relationship from one set of
attributes to another within a given relation
– ex: Supplier (S#), Part (P#) --> QTY
» Functional dependency still holds even though for
many combinations of S# and P#, there is only one
quantity

(C) Frieder, Grossman, & Goharian 1996, 2002 150

Broader FD
A functional dependency can be defined if the attribute
values for x uniquely determine those in y.
A functional dependency is a statement about a relational
scheme (i.e., all possible relations), and cannot be deduced
from a particular relation.

(C) Frieder, Grossman, & Goharian 1996, 2002 151

Example
S# CITY P# QTY
S1 London P1 100
S1 London P2 100
S2 Paris P1 200
S2 Paris P2 200
S3 Paris P2 300
S4 London P4 400
S4 London P5 400

Functional Dependency:
Whenever two tuples have the same value
of x, they will also have the same value of y.

Ex: S# -> CITY

(C) Frieder, Grossman, & Goharian 1996, 2002 152

FD Diagrams
Draw a line to represent irreducible FD’s

SNAME

S# STATUS

CITY

An arrow will always exist from the primary key to the non-key
attributes. Problems usually exist when other arrows are
present. Normalization may be informally defined as the
process by which extra arrows are removed.

(C) Frieder, Grossman, & Goharian 1996, 2002 153

Closure Of A Set Of Dependencies

Question: What is F+?

Answer: F+ is the closure of F - the set of all

functional dependencies derivable
from F.

Equivalently, F+ is the set of all dependencies

that follow from F by Armstrong’s axioms.

(C) Frieder, Grossman, & Goharian 1996, 2002 154

Armstrong’s Axioms
Let R be a relational scheme with attributes U and
functional dependencies FD.

» Reflexivity - If Y is a subset of X, then X=>Y.

» Augmentation - If X=>Y, and Z is a subset of U, then XZ=>YZ.

» Transitivity - If X=>Y and Y=>Z, then X=>Z.

(C) Frieder, Grossman, & Goharian 1996, 2002 155

Completeness Proof

Given that Armstrong’s axioms are complete prove

that the FD set {Reflexivity, Augmentation, and
Pseudotransitivity} is complete.

Reflexivity - If Y is a subset of X, then X=>Y.

Augmentation - If X=>Y then XZ=>YZ.
Transitivity - If X=>Y and Y=>Z, then X=>Z.
Pseudotransitivity - If X=>Y and WY=>Z, then
XW=>Z.

(C) Frieder, Grossman, & Goharian 1996, 2002 156

Completeness Proof

To demonstrate a FD set A is complete given a

complete FD set B, must demonstrate that FD
set A derives every FD in FD set B.
FD Set A FD Set B (Armstrong’s Axioms)
Reflexivity => Reflexivity
Augmentation => Augmentation

To derive Transitivity, set W = {} in Pseudotransitivity, hence

Pseudotransitivity simplifies to Transitivity.

(C) Frieder, Grossman, & Goharian 1996, 2002 157

Armstrong’s Axioms Extended
(Projectivity)

Projectivity - If X=>YZ, then X=>Y and X=>Z.

1. X => YZ Given
2. YZ => Y Reflexivity
3. X => Y Transitivity 1, 2
4. YZ => Z Reflexivity
5. X => Z Transitivity 1, 4

(C) Frieder, Grossman, & Goharian 1996, 2002 158

Armstrong’s Axioms Extended
(Additivity)
Additivity - If X=>Y and X=>Z, then X=>YZ.

1. X => Y Given
2. X => Z Given
3. XY => YZ Augment Y on 2
4. X => XY Augment X on 1
5. X => YZ Transitivity 4, 3

(C) Frieder, Grossman, & Goharian 1996, 2002 159

Armstrong’s Axioms Extended
(Pseudotransitivity)

Pseudotransitivity - If X=>Y and WY=>Z, then XW=>Z.

1. X => Y Given
2. WY => Z Given
3. XW => YW Augment W on 1
4. XW => Z Transitivity 3, 2

(C) Frieder, Grossman, & Goharian 1996, 2002 160

Functional Dependency Derivation
Using the Extended Armstrong’s Axioms

Given the following functional dependency set,

FD = { AB=>E,
AG=>J,
BE=>I,
E=>G,
GI=>H },
prove
AB=>GH.
(C) Frieder, Grossman, & Goharian 1996, 2002 161
Derivation Proof

1. AB => E Given
2. BE => I Given
3. E => G Given
4. GI => H Given
5. AB => G Transitivity 1, 3
6. AB => BE Augment B to 1
7. AB => I Transitivity 6, 2
8. AB => GI Additivity 5, 7
9. AB => H Transitivity 8, 4
10. AB => GH Additivity 5, 9

(C) Frieder, Grossman, & Goharian 1996, 2002 162

Closure of Attributes

Without computing F+, can identify if a given FD

(X => Y) is in F+.
Obtaining candidate keys
Algorithm:
closure = a;
while there are changes to closure do
if there is a FD x=>y such that x is subset of closure
then closure = closure U y;
end;

(C) Frieder, Grossman, & Goharian 1996, 2002 163

Closure of Attributes (Example)

R = (A, B, C, D)
FD {A=>B, C=>A, C=>D}

A+ = A, B
B+ = B
C+ = C, A, D, B
D+ = D

(C) Frieder, Grossman, & Goharian 1996, 2002 164

1NF
A relation is in 1NF if and only if all
underlying domains contain scalar values.
No multi-valued attributes within a single
“cell” of a relation.

(C) Frieder, Grossman, & Goharian 1996, 2002 165

Example of 1NF
The following relation is NOT in 1NF
Student # Courses
1 {CS425, CS595, CS100} Key: <Student#>
2 {CS525, CS548}
...

The following relation is in 1NF

Student # Courses
1 CS425
1 CS595
1 CS100
Key: <Student#, Courses>
2 CS525
2 CS548
...

(C) Frieder, Grossman, & Goharian 1996, 2002 166

Problems with 1NF
Consider the relation: (relation in 1NF but not 2NF)
Professor# Student# Course Goal
P1 S1 425 M.S.
P1 S2 100 Ph.D.
P1 S4 595 Ph.D.
P2 S1 525 M.S.
P2 S2 525 Ph.D.
P2 S3 548 M.S.
P3 S2 548 Ph.D.
P3 S4 425 Ph.D.
P4 S4 525 Ph.D.
P5 S5 548 M.S.

FD: {Professor#, Student# -> Course, Student# -> Goal}

Key: < Professor#, Student# >

(C) Frieder, Grossman, & Goharian 1996, 2002 167

1NF Problems
Insertion Anomaly
» Can not insert new professors without them teaching courses

Deletion Anomaly
» Deletion of a tuple describing a particular student (S5) eliminates
additional valid information (P5 exists)

Update Anomaly
» The Goal value appears many times for the same student and is
redundant.

(C) Frieder, Grossman, & Goharian 1996, 2002 168

Fixing the problems with 1NF

To fix the problems, place information that is logically

separate in separate relations.

STUDENTS
» Facts about the individual students

PROFESSOR-STUDENTS
» Facts about where a professor and a student first met

(C) Frieder, Grossman, & Goharian 1996, 2002 169

2NF
Consider the following two relations:
STUDENTS PROFESSOR-STUDENTS
Professor# Student# Course
Student# Goal
P1 S1 425
S1 M.S.
P1 S2 100
S2 Ph.D.
P1 S4 595
S3 M.S.
P2 S1 525
S4 Ph.D.
P2 S2 525
S5 M.S.
P2 S3 548
P3 S2 548
Key: < Student# > P3 S4 425
P4 S4 525
And by shear luck… P5 S5 548
also 3NF Key: < Professor#, Student# >

Note: Database valid for student information only!!!

(C) Frieder, Grossman, & Goharian 1996, 2002 170
2NF
A relation is in 2NF if and only if it is in 1NF and
every non-key attribute is fully dependent on the
primary key. That is, no non-key attribute is
dependent on only part of the key.

<Professor#, Student#> is the key, but Goal is

dependent on only Student#

(C) Frieder, Grossman, & Goharian 1996, 2002 171

Problems with 2NF
Consider the relation in 2NF but not in 3NF:
Emp# Age City Experience
1 30 Chicago Accounting
2 45 Washington Computer Science
3 52 New York Physics
4 44 Washington Computer Science
5 50 Chicago Accounting
6 52 London Accounting
7 49 Washington Computer Science

FD: {Emp# ->Age, City, Experience, City ->Experience}

Key: < Emp# >

(C) Frieder, Grossman, & Goharian 1996, 2002 172

Problems with 2NF
Insertion Anomaly
» Can not insert fact that a city has a particular expertise until we
have an employee located in the particular city.

Deletion Anomaly
» Some deletes will not only eliminate the fact that an employee
exists in a given location (employee number 6), but will also
remove the information about the expertise in a city (London and
accounting).
Update Anomaly
» Experience is redundant

(C) Frieder, Grossman, & Goharian 1996, 2002 173

Fixing problems with 2NF
Replace with information about
» EMPLOYEE
– Facts about employees
» OFFICE
– Facts about company offices

That is, create a 3NF version of the design

(C) Frieder, Grossman, & Goharian 1996, 2002 174

3NF
A relation is in 3NF if and only if it is in 2NF
and every non-key attribute is non-
transitively dependent on the primary key.
Or in English:
» Every non-key attribute is dependent on the key
and nothing but the key.

(C) Frieder, Grossman, & Goharian 1996, 2002 175

FD’s for Example

Emp# -> City

City Emp# -> Experience
Emp# -> Age
City -> Experience
Emp#
Note: There is a transitive dependency
Emp# -> City and City -> Experience
Experience implying Emp# -> Experience

Age

(C) Frieder, Grossman, & Goharian 1996, 2002 176

3NF

OFFICE EMPLOYEE
City Experience Emp# Age City
Chicago Accounting 1 30 Chicago
Washington Computer Science 2 45 Washington
New York Physics 3 52 New York
London Accounting 4 44 Washington
5 50 Chicago
6 52 London
Key: <City >
7 49 Washington

Key: < Emp# >

No anomalies exist!

(C) Frieder, Grossman, & Goharian 1996, 2002 177

3NF – Unfortunately!!!
Consider the following two relations:
STUDENTS PROFESSOR-STUDENTS
Professor# Student# Course
Student# Goal
P1 S1 425
S1 M.S.
P1 S2 100
S2 Ph.D.
P1 S4 595
S3 M.S.
P2 S1 525
S4 Ph.D.
P2 S2 525
S5 M.S.
P2 S3 548
P3 S2 548
Key: < Student# > P3 S4 425
P4 S4 525
Relations are in 3NF, P5 S5 548
but anomalies exist !!! Key: < Professor#, Student# >

(C) Frieder, Grossman, & Goharian 1996, 2002 178

Anomalies Dependency
For student information, no anomalies exist.

For the professorial information, anomalies exist !!!

» Insertion Anomaly:
– Can not insert P6 if never taught a student

» Deletion Anomaly:
– If S5 terminates the university, P5 is dismissed!!!
– ( You may like it, but as a professor, I do not!!! )

(C) Frieder, Grossman, & Goharian 1996, 2002 179

3NF Example
FD: xy {trivial FD; x is a super key; y part of a key}

FD: {ssnsid, nm , sidssn} CK: {ssn;sid}

R: ssn nm sid
1 Mary 10
3 Joe 30
2 Mary 20
What is the normal form of relation R?
Do you see any anomaly?

(C) Frieder, Grossman, & Goharian 1996, 2002 180

3NF Example
FD:{sid,programdegree; degreeprogram}
Key: sid, program
R:
sid program degree
10 U BS
10 M MS
20 U BS
30 U BA
What is the normal form of relation R? why?
Do you see any anomaly?

(C) Frieder, Grossman, & Goharian 1996, 2002 181

BCNF Example
FD: xy {trivial FD; x is a super key}

R: (sid, program, degree)

FD:{sid,programdegree; degreeprogram} Key:
sid, program
sid degree degree program
10 BS BS U
R1: 10 MS R2:
MS M
20 BS BA U
30 BA
What is the normal form of relation R? why?
Do you see any redundancy?
(C) Frieder, Grossman, & Goharian 1996, 2002 182
Boyce-Codd Normal Form
(BCNF)

A relation is in BCNF if and only if it is in

3NF and every attribute is non-transitively
dependent on the primary key.
Or in English:
» Every attribute is dependent on the key and
nothing but the key.

(C) Frieder, Grossman, & Goharian 1996, 2002 183

Loss-Less Decomposition
R1 ∩ R2 R1 or R1 ∩ R2R2

R1 (sid,degree) and R2 (degree,program)

R1 ∩ R2 R2 as we have FD: degreeprogram

Thus: loss-less.
sid program degree
10 U BS
R1 natural join R2: 10 M MS
20 U BS
30 U BA
(C) Frieder, Grossman, & Goharian 1996, 2002 184
Lossy Decomposition
R1 (sid,program) and R2 (degree,program)
R1 ∩ R2 : program ≠> R1 or R2
Thus: lossy decomposition.
R1 natural join R2:
sid program degree program sid program degree
10 U 10 U BS
BS U
R1: 10 M R2: 10 U BS
MS M
20 U 30 U BS
BA U
30 U 10 M MS
10 U BA
20 U BA
(C) Frieder, Grossman, & Goharian 1996, 2002
30 U BA 185
Dependency Preservation
F+ = (Fx ∪ Fy)+
R: (sid, program, degree)
FD:{sid,programdegree; degreeprogram}
Key: sid, program

R1 (sid,degree) and R2 (degree,program)

FD: {degreeprogram}

F+ ≠ (Fx ∪ Fy)+
Have to join to verify FD: sid,programdegree
(C) Frieder, Grossman, & Goharian 1996, 2002 186
Minimal Cover
Create the smallest set of functional dependencies by:
•Removing transitive dependencies from non-key attributes to key.
ssnprogram, degree => ssnprogram
program degree
•Union all functional dependencies that the left hand side is the same.
ssn DOB
ssn name
=> ssnDOB, name
•Remove the extra attribute from the left hand-side of FD if the
FD is valid after removal of that attribute.
ssnaddress
ssn, name address
=> ssnaddress
(C) Frieder, Grossman, & Goharian 1996, 2002 187
Decomposition to 3NF using
Minimal Cover
IF relation R is not in 3NF then
Create the Minimal Cover of FD on R.
Create a Relation Ri for each FD in Minimal Cover.
If the key of relation R is not in its entirety included in
any of the relations Ri, then create one more relation
with that key.

This decomposition is loss-less and preserves

dependencies.

(C) Frieder, Grossman, & Goharian 1996, 2002 188

Query Optimization

(C) Frieder, Grossman, & Goharian 1996, 2002 189

Query Optimization
A key difference in the relational model and other models
is that it is entirely up to the DBMS how data are
retrieved.

(C) Frieder, Grossman, & Goharian 1996, 2002 190

Query Processing Components
Each SQL statement is implemented with the following
steps:

» Parse - Identify tokens in the query

» Develop internal representation of the query (attempt to
have an internal form such that two queries with
different syntax, but similar functionality will have a
uniform internal representation)
» Execute the optimizer to choose the best access path to
the data.
(C) Frieder, Grossman, & Goharian 1996, 2002 191
Access Paths
Typical choices the optimizer must make are:

» Is it better to implement a sequential scan than to use a

b-tree?
» Which b-tree should be used?
» Given five relations that must be joined, what order
should be implemented?

(C) Frieder, Grossman, & Goharian 1996, 2002 192

Selectivity
Selectivity of a condition refers to the ratio of the number
of tuples that satisfy a condition to the total number of
tuples in the relation.
Selectivity of a primary key is 1/N where N is the number
of tuples
An attribute with i distinct values will have a selectivity of
(N / i) / N, assuming a uniform distribution.
Typically, an index is used for terms with “good”
selectivity (i.e., emp#) and a scan is used for terms with
“bad” selectivity (i.e., gender)

(C) Frieder, Grossman, & Goharian 1996, 2002 193

Optimization with Boolean Logic
(Conjunctive Expressions)

A conjunctive (only AND) query, e.g., A and B and C, is

easy to optimize since any false match terminates the
search.

In a conjunctive query, order the conditions according to

selectivity, the lower the selectivity, the earlier is the
condition evaluated.

Lower selectivity values (primary keys being the lowest)

are typically indexed.
(C) Frieder, Grossman, & Goharian 1996, 2002 194
Optimization with Boolean Logic
(Disjunctive Expressions)

A disjunctive (only OR) query, e.g., A or B or C, is more

difficult to optimize.

In a disjunctive query, all conditions must be evaluated

since truth in any of the conditions validates the
expression.

Poor access paths, if they exist, none-the-less must be

evaluated.

(C) Frieder, Grossman, & Goharian 1996, 2002 195

Join Algorithms
Nested (inner-outer) Loop. For each tuple in the outer
relation R, retrieve every tuple in S (inner) and test
whether or not the join condition is satisfied.

Merge-Scan. Sort R and S by the join attributes. Scan the

tuples matching those that have match conditions that
restrict R and S.

(C) Frieder, Grossman, & Goharian 1996, 2002 196

Join Order Optimization
(Cost Based)

Consider a join of EMPLOYEE with DEPARTMENT

Assume EMPLOYEE requires e data blocks to store.
Assume DEPARTMENT requires d data blocks to store.

Nested Loop Join (with n+1 blocks of memory available)

I/O demands are computed as:
– There are two potential algorithms:
EMPLOYEE as the outer, DEPARTMENT as inner
DEPARTMENT as outer, EMPLOYEE as inner

(C) Frieder, Grossman, & Goharian 1996, 2002 197

Join Optimization (continued)
EMPLOYEE as outer, DEPARTMENT as inner
» EMPLOYEE is read into the n+1st buffer one block at a time, n
blocks at a time of DEPARTMENT are read and then scanned:
– Reading EMPLOYEE requires e block reads
– The nested join, requires e/n iterations as only n blocks may be placed in
memory at one time.
– For each of the e/n iterations, a full scan of d blocks is required.
» Total I/O = e + (e) (d / n)

Reversing the join order yields:

» Total I/O = d + (d) (e / n)

(C) Frieder, Grossman, & Goharian 1996, 2002 198

Join Order (continued)
The importance of join order is seen when values are
substituted into the equations.
For d = 10 and e = 50 and n = 5:
» EMPLOYEE as outer, DEPARTMENT as inner
» Total I/O = e + (e) (d / n)
» Total I/O = 50 + (50)(10 / 5) = 150
Reversing the join order yields:
» Total I/O = d + (d ) (e / n)
» Total I/O = 10 + (10)(50 / 5) = 110
Hence, for this situation, it is better to use
DEPARTMENT as the outer relation.
(C) Frieder, Grossman, & Goharian 1996, 2002 199
Join Optimization (continued)

Rule of Thumb for Nested Loop Join Computations

The smaller number of relevant tuples

should be the outer relation

(Many other optimization techniques are available)

(C) Frieder, Grossman, & Goharian 1996, 2002 200

Cost Based Optimization
Obtain cost estimates for each execution strategy.
The cost depends on:
» cost to access secondary storage
» storage cost
– need for temporary files
» computational cost
– cost of in-memory operations
» communication cost
– cost of shipping the query from the server to the client

(C) Frieder, Grossman, & Goharian 1996, 2002 201

Cost-Based Optimization
(Metadata Use)

System catalogs contain data that are used to estimate the

cost for each access path.
Metadata used includes:
» number of tuples
» number of blocks
» existence of indexes
» number of levels in a B-tree index
» number of distinct values found in the index
(selectivity) of an attribute.

(C) Frieder, Grossman, & Goharian 1996, 2002 202

Query Trees
(Rule Based Optimization)

Query Tree
» Structure that corresponds to a relational algebra
expression by representing relations as leaf nodes.

(C) Frieder, Grossman, & Goharian 1996, 2002 203

Query Tree Transformation
Ex: Find the last names of employees born after 1957 who work on
a project named Aquarius
EMPLOYEE (ssn, lname, fname, bdate) - e tuples
WORK_ON (essn, p#) - w tuples
PROJECTS (pno, pname) - p tuples
SELECT lname
FROM EMPLOYEE, WORKS_ON, PROJECTS
WHERE (essn = ssn) and (p# = pno) and
(pname = ‘Aquarius’) and (bdate > 1957)

Initial query tree will JOIN the three relations first and then perform
the selections and projections. O(e w p) tuples will be accessed.

(C) Frieder, Grossman, & Goharian 1996, 2002 204

Initial Tree
PROJECT (lname)

SELECT (pname = ‘Aquarius’, bdate > 1957)

Join
(p# = pno)

PROJECTS Join
(essn = ssn)

WORKS-ON EMPLOYEE
(C) Frieder, Grossman, & Goharian 1996, 2002 205
Migrate SELECT
PROJECT (lname)

Step 1: Join
Reduce size of (p# = pno)
join by computing SELECT
early in the process. Join
(essn = ssn)

SELECT (pname = ‘Aquarius’) SELECT (bdate > 1957)

PROJECTS WORKS-ON EMPLOYEE

Both pno and ssn are key attributes. Thus, they both will
yield only one tuple to be retrieved.

The cardinality of PROJECT is less than EMPLOYEE,

thus join PROJECT first.

(C) Frieder, Grossman, & Goharian 1996, 2002 207

Reorder the Joins
PROJECT (lname)

Join
Step 2:
(essn = ssn)
Order joins according
to lower input and result
Join
sizes
(p# = pno)

SELECT (pname = ‘Aquarius’) SELECT (bdate > 1957)

PROJECTS WORKS-ON EMPLOYEE

(C) Frieder, Grossman, & Goharian 1996, 2002 208

Final Query Tree
PROJECT (lname)
Step 3:
Move Projects
early to reduce Join (essn = ssn)

temporary
relation sizes PROJECT (essn) PROJECT (ssn, lname)

Join (p# = pno) SELECT (bdate > 1957)

PROJECT (pno) PROJECT (essn,p#) EMPLOYEE

SELECT (Aquarius) WORKS-ON

PROJECTS
(C) Frieder, Grossman, & Goharian 1996, 2002 209
Overview of Key Rules
Partition each select of (A and B and C) to SELECT (A),
SELECT (B), and SELECT(C)
Move each select as far down the tree as possible
Rearrange leaf nodes so that the small answer sets are
processed first. Typically, use smallest selectivity to
estimate this (found in system catalog).
Combine a Cartesian product and a select of joining
conditions in a join
Move projection as far down the tree as possible
Identify sub-trees that represent groups of operations that
may be executed by a single access routine.
(C) Frieder, Grossman, & Goharian 1996, 2002 210
Explain
Many commercial systems provide a utility to identify the
path the optimizer will choose for a given query.
Typically the utility is referred to as EXPLAIN and the
syntax is:

» EXPLAIN <sql statement>

Results are placed into relations that may be queried.

Performance tuning is often done by changing an index
and examining the effect on queries by repeating the
EXPLAIN statement.

(C) Frieder, Grossman, & Goharian 1996, 2002 211

Recovery
and
Concurrency Control

(C) Frieder, Grossman, & Goharian 1996, 2002 212

Recovery and Concurrency
Transaction
» Logical unit of work
» Composed of one or more SQL statements
» Either all of the transaction completes or none of the transaction
is executed.

Ex: Transfer money from savings account to checking.

– Step 1: Subtract from savings
– Step 2: Add to checking

It is critical that either both steps complete or neither.

(C) Frieder, Grossman, & Goharian 1996, 2002 213

Commit and Rollback
Commit
» Indicates that a transaction completed. There is no
BEGIN and END TRANSACTION necessary as
COMMIT marks the END of the current transaction
and the start of the next transaction.
Rollback
» Undo all work completed by the transaction that is
currently in progress. This is implemented by saving
all “in progress” work to a transaction log.

(C) Frieder, Grossman, & Goharian 1996, 2002 214

Recovery
After a failure, a DBMS checks the log to determine:
» The transactions that were in process during the time of failure.
These transactions are rolled back (UNDO).
To avoid lengthy restart times, the system periodically
writes all contents of main memory to disk. This is
referred to as a checkpoint.
After a failure, all transactions that have completed after
the last checkpoint must be redone because data have not
been written from memory to disk.

(C) Frieder, Grossman, & Goharian 1996, 2002 215

Recovery (continued)
In summary, after a failure a DBMS enters an UNDO phase to
rollback all transactions that were in progress and a REDO phase to
again execute the transactions that occurred after the last checkpoint
and before the failure.
Example below
» Undo t3
» Redo t2
t1 t2 t3

Checkpoint Failure
(C) Frieder, Grossman, & Goharian 1996, 2002 216
Resilience to Failure
Power Failure (only)
» Undo and Redo processing
Data Disk Failure
» The last good data archive is restored and all logs since the point
of failure are re-processed to REDO all transactions lost on the
bad disk.
Log Disk Failure
» This is rare, but the only good means of avoiding this problem is
to use dual logs in which all log writes are duplicated.
» Without dual logging it is necessary to restore back to the last
good data archive and all transactions since then are lost.

(C) Frieder, Grossman, & Goharian 1996, 2002 217

Concurrency Problems
Problems occur when two transactions executing
simultaneously become interleaved.
All proposed means of ensuring concurrency must be
shown to be serializable. These algorithms must produce
a result that is equivalent to some (arbitrary) serial
execution of the transactions that they manage.
In summary, it is correct for transaction t1 to precede t2,
or follow t2, but it is not correct for transactions t1
statements to be interleaved with t2.

(C) Frieder, Grossman, & Goharian 1996, 2002 218

Lost Update Problem
Consider the following two transactions: The typical example is a
husband and wife both withdrawing money from the same bank
account.
Husband: Withdraw $10 Wife: Withdraw $10
s1: Read Savings s2: Read Savings
s3: Subtract 10 s4: Subtract 10
s5: Write Savings s6: Write Savings
s7: COMMIT s8: COMMIT
Consider the execution of: s1, s2, s3, s4, s5, s6, s7, s8. If savings
starts at 100, after s1 and s2, each user determines savings is equal
to 100, so after s8, it ends up at 90 even though 20 has been
subtracted. Hence, one update has been lost.

(C) Frieder, Grossman, & Goharian 1996, 2002 219

Uncommitted Dependency
This occurs when a transaction relies upon a value that may be
rolled back by another transaction.

Consider T1: Consider T2:

s1: Add 5 to X s3: Read X
s2: COMMIT s4: Add X to Y
s5: Write Y
s6: COMMIT

If s1, s3, s4, s5, s6 execute, Y now has a value that will be incorrect if T1 does
not execute s2, but instead rolls back.

(C) Frieder, Grossman, & Goharian 1996, 2002 220

Inconsistent Analysis
Consider a transaction T1 that is computing the
average salary of all employees; it scans all
employee records.
Consider a transaction T2 that updates Hank’s
salary.
If transaction T2 is executed and completes before
T1 is finished, but after T1 has examined Hank’s
salary, the result of T1 will be incorrect.

(C) Frieder, Grossman, & Goharian 1996, 2002 221

Concurrency Control
Two different themes surround concurrency
control algorithms:
» pessimistic
– These algorithms are based on the premise that conflict will
occur, they use locks to force conflicting requests to wait until
a lock is released.
» optimistic
– These algorithms assume that conflicts are rare. Hence, a
transaction executes without any waiting on others, but at the
end, a check is made to determine if a conflict occurred (via
timestamps). If so, the entire transaction is rolled back.
(C) Frieder, Grossman, & Goharian 1996, 2002 222
Locking
Two Phase Locking (2 PL) algorithms have two
phases, growing and shrinking, and are
serializable.
As a transaction progresses, it only acquires new
locks, growing.
Upon commit, the transaction releases all locks.
A transaction that adds locks, releases some, and
adds again is not 2 PL and may not be serializable.

(C) Frieder, Grossman, & Goharian 1996, 2002 223

Types of Locks
Share Lock
» This is a read lock. Other users may read data have a
share lock, but no users may write to data that contain a
share lock.
Exclusive Lock
» This is acquired during a write. No users may read data
that have an exclusive lock. For the lost update
problem, the UPDATE statements result in exclusive
locks which preclude the interleaving seen in the
example.
(C) Frieder, Grossman, & Goharian 1996, 2002 224
Lock Granularity
Locks may be held at the row, page, or table level.
A page or table level lock saves space in the
system lock table as only one lock may serve to
lock millions of tuples, but concurrency is
sacrificed.

(C) Frieder, Grossman, & Goharian 1996, 2002 225

Lock Hierarchy
When a request for a lock on a single tuple is issued, an
INTENT lock is acquired for each level of the hierarchy
above it.
This precludes a user issuing a page level lock and
acquiring access to data already held by a row level lock.
Table

Page 1 Page 2

row 1 row2 row 3 row 4

(C) Frieder, Grossman, & Goharian 1996, 2002 226
Lock Hierarchy (continued)
When user 1 requests a share lock on row1, an intent share lock is
placed first on the table, and then on the page, and finally a share
lock is placed on row 1.
Table IS - first

Page 1 IS - second Page 2

row 1 IS row2 row 3 row 4

third
At this point, a request for an exclusive lock on page 1
will be denied as an intent share lock exists.
(C) Frieder, Grossman, & Goharian 1996, 2002 227
Lock Wait
When a user requests data that are currently
locked, a wait ensues. Some DBMS allow a
timeout to be specified that governs the longest a
transaction must wait before a timeout.
This timeout must be balanced with the need to run
many, valid transactions.
All DBMS provide a means of viewing the lock
table, identifying the resources users are waiting on
since this information is critical for resolving
concurrency control problems.
(C) Frieder, Grossman, & Goharian 1996, 2002 228
Lock Levels
Due to the need for improved concurrency DBMS, allow
users to sacrifice integrity for run-time performance. This
is done via different lock levels which may be set
dynamically. Some commercial systems support up to five
different lock levels. We describe the two most common.
Repeatable Read (RR) refers to the strongest lock level. It
is the type of lock we have discussed in which all locks are
held until the end of a transaction.
The key to repeatable read is that if a transaction reads the
same datum twice, it is ensured that it will have the same
value both times.

(C) Frieder, Grossman, & Goharian 1996, 2002 229

Cursor Stability (CS)
For many applications, it not necessary to ensure
maximum integrity.
A transaction using repeatable read that is scanning a one
billion row table effectively causes all users who are
updating rows in the table to enter a lengthy lock wait.
Cursor stability states that only the current rows being
scanned will be locked. Immediately after the row is
scanned the lock is released.
For the 1 billion row table, under repeatable read, the
number of locks could grow to 1 billion, with cursor
stability, only one row is locked throughout the
transaction.
(C) Frieder, Grossman, & Goharian 1996, 2002 230
Example
RR after one row is read CS after one row is read.
Row 1 (lock) Row 1 (lock)
Row 2 Row 2
Row 3 Row 3
... ...
Row 1,000,000 Row 1,000,000

RR after one million CS after one million

Row 1 (lock) Row 1
Row 2 (lock) Row 2
Row 3 (lock) Row 3
... ...
Row 1,000,000 (lock) Row 1,000,000 (lock)
(C) Frieder, Grossman, & Goharian 1996, 2002 231
Deadlock
A deadlock occurs when two users are waiting on each
other to release resources.
User 1: User 2:
s1: read A s2: read B
s3: write B s4: write A
After s1 and s2 execute, user 1 and 2 each hold a share
lock on A and B. With s3, user 1 begins a wait on user 2.
With s4, user 2 begins a wait on user 1.
All commercial systems implement deadlock detection
periodically, and once detected, a victim is chosen and the
transaction is rolled back.
(C) Frieder, Grossman, & Goharian 1996, 2002 232
Performance Tuning
Concurrency can often be improved by using cursor
stability instead of repeatable read. Many applications do
not have a requirement for repeatable read, but it is the
default for most systems.
Many developers do not test an application under expected
workloads. A single user test does not test any
concurrency. It is essential to test multiple users prior to
delivering an application.
Applications should include code to test for the presence
of a deadlock (SQLCODE = 911) or they will abnormally
terminate when one occurs.

(C) Frieder, Grossman, & Goharian 1996, 2002 233

Integrating Structured Data and
Text:
A Relational Approach

(C) Frieder, Grossman, & Goharian 1996, 2002 234

Mapping Text onto Relations
Relation definition:
DOCUMENT (DocID, Docname, Headline, Dateline)
TERM (Term, df, idf)
INDEX (DocID, Termcnt, Term)

All inverted index entries

e.g., vehicle D1, D3, D4 results in:

term docID
vehicle D1
vehicle D3
vehicle D4
(C) Frieder, Grossman, & Goharian 1996, 2002 235
Text Retrieval Conference (TREC)
Sample Document

<DOC>
<DOCNO> AP881214-0028 </DOCNO>
<FILEID>AP-NR-12-14-88 0117EST</FILEID>
<FIRST>u i BC-Japan-Stocks 12-14 0027</FIRST>
<SECOND>BC-Japan-Stocks,0026</SECOND>
<HEAD>Stocks Up In Tokyo</HEAD>
<DATELINE>TOKYO (AP) </DATELINE>
<TEXT>
The Nikkei Stock Average closed at 29,754.73 points
up 156.92 points on the Tokyo Stock Exchange Wednesday.
</TEXT>
</DOC>

(C) Frieder, Grossman, & Goharian 1996, 2002 236

Relational Document Representation
DOCUMENT
DocID Docname Headline Dateline
28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP)

INDEX TERM
DocID Termcnt Term Term df idf
28 1 nikkei average 2265 1.08
28 2 stock closed 2208 1.08
28 1 average exchange 2790 1.00
28 1 closed nikkei 234 2.07
28 2 points points 1627 1.23
28 1 up stock 2674 1.00
28 1 tokyo tokyo 725 1.58
28 1 exchange up 12746 0.30
28 1 wednesday wednesday 6417 0.60

(C) Frieder, Grossman, & Goharian 1996, 2002 237

Relational Query Representation

QUERY
TERM TERMCNT
nikkei 1 ORIGINAL QUERY:
stock 2 “nikkei stock exchange
american stock exchange”
exchange 2
american 1

(Query Weight * Document Weight)

SQL: SELECT d.docname, SUM(a.termcnt * c.idf * b.termcnt * c.idf)
FROM QUERY a, INDEX b, TERM c, DOCUMENT d
WHERE a.term = b.term AND
a.term = c.term AND
b.docid = d.docid
GROUP BY d.docname
ORDER BY 2 DESC

(C) Frieder, Grossman, & Goharian 1996, 2002 238

Sample Term Query Result
(Inner/Dot Product)

Term Q-Termcnt Q-Weight D-Termcnt D-Weight Q-Wt * D-Wt

nikkei 1 2.07 1 2.07 4.28

stock 2 2.00 2 2.00 4.00
exchange 2 2.00 1 1.00 2.00
american 1 0.60 0 0.00 0.00
_____
Similarity Coefficient 10.28

(C) Frieder, Grossman, & Goharian 1996, 2002 239

Similarity Coefficients
Several similarity coefficients based on the query vector X and the
document vector Y are defined:

t
Inner Prod uct ∑ xi ⋅ yi
i=1

t
∑ xiyi
Cosine Coefficient i=1
t t
∑ i •
x 2
∑ i
y 2
i=1 i=1

(C) Frieder, Grossman, & Goharian 1996, 2002 240

Sample Relevance Ranking Query

SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)*

(229.50439 + (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf))))
FROM trec6$d5$idx a, trec6$d5$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d
WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term
AND c.qryid = 301
GROUP BY c.qryid, b.docid
UNION
SELECT c.qryid, b.docid, SUM(((1+LOG(a.termcnt))/((b.logavgtf)*
(229.50439 + (.20*b.disterm))))*(c.nidf*((1+LOG(c.termcnt))/(d.logavgtf))))
FROM trec6$d4$idx a, trec6$d4$docavgtf b, trec6$q6$qrynidf c, trec6$q6$qryavgtf d
WHERE a.docid = b.docid AND c.qryid = d.qryid AND a.term = c.term
AND c.qryid = 301
GROUP BY c.qryid, b.docid
ORDER BY 3 DESC;

(C) Frieder, Grossman, & Goharian 1996, 2002 241

Distributed Database Systems

(C) Frieder, Grossman, & Goharian 1996, 2002 242

Overview
Distributed DBMS allow data stored at multiple sites to be
accessed from a single site.
One query may join data from two different sites.
The key motivation for a distributed DBMS is to move
data closer to the users.
California Chicago New York
DBMS 1 DBMS 2 DBMS 3

Data frequently accessed Data frequently accessed Data frequently accessed

by California users by Chicago users by New York users

(C) Frieder, Grossman, & Goharian 1996, 2002 243

Overview (continued)
Homogeneous DBMS
» Each site contains the same implementation of a
DBMS, (e.g., all sites are running Oracle)
Heterogeneous DBMS
» Different DBMS are used at different sites (e.g, some
sites are Oracle, some are IBM, and some are Informix)

(C) Frieder, Grossman, & Goharian 1996, 2002 244

Overview (continued)
For the remainder of this discussion, assume that a
homogeneous distributed DBMS is used.
In practice, heterogeneous DBMS exist, but they
require an additional layer of software that serves
as a “global coordinator” or “mediator” of all the
different DBMS.
Today, many research efforts focus on both
internet and intranet mediators

(C) Frieder, Grossman, & Goharian 1996, 2002 245

Distributed Concurrency Control

An update over more than one site requires careful

concurrency as it is possible that one site may fail
while others have committed.
To avoid, this a Two Phased Commit (2PC)
protocol is used.
A single site from which the update originates is
the controlling site.

(C) Frieder, Grossman, & Goharian 1996, 2002 246

Two-Phased Commit
Phase 1
» Send update to all sites
» Receive acknowledgments

Phase 2
» After receiving all acknowledgments, send COMMIT to all sites.
» If all acknowledgments are not received in a certain
pre-defined time period, a ROLLBACK is sent to all sites.
» Once, the COMMIT is sent, all sites commit the data. If a site
fails before it receives the COMMIT, it will receive the
COMMIT upon restart.

(C) Frieder, Grossman, & Goharian 1996, 2002 247

Two-Phased Commit (example)
Assume EMP exists on site A and DEPT exists on site B.
Transaction T1, adds a new department and several employees
who work in that department. Assume T1 originates at site C, so
site C will be the controlling site.
Update EMP A
Phase 1: C (EMP)
(start)
Update Dept B
(DEPT)

Ack A
C (EMP)
(start)
Ack B
(DEPT)
(C) Frieder, Grossman, & Goharian 1996, 2002 248
Two-Phased Commit (example)
After Phase one, site C has received all acknowledgements
and is now ready to send final commit.

COMMIT A
C (EMP)
(start)
Phase 2:
COMMIT B
(DEPT)

Ack A
C (EMP)
(start)
Ack B
(DEPT)
(C) Frieder, Grossman, & Goharian 1996, 2002 249
Two-Phased Commit
Failure During Phase 1
Consider a case where site B fails after receiving the request
for update but site A succeeds:
Update EMP A
Phase 1: C (EMP)
(start)
Update DEPT B
(DEPT)

Ack A
C (EMP)
(start)
Site C receives only one acknowledgment but was
waiting for two, so a rollback is sent to all sites.
(C) Frieder, Grossman, & Goharian 1996, 2002 250
Two-Phased Commit
Failure During Phase 2
Consider a case where site B fails after sending the
acknowledgment in Phase 1.
COMMIT A
Phase 2: C (EMP)
(start)
COMMIT B
(DEPT)
Site B will eventually restart,
receive the COMMIT and phase two will complete.
Ack A
C (EMP)
(start)
Ack B
(DEPT)
(C) Frieder, Grossman, & Goharian 1996, 2002 251
Replication
Since 2PC processing is expensive, a cheaper alternative is
to replicate data so that they are at each site.
Many replication algorithms exist, the goal of which is to
propagate an update to all replicas.

California Chicago New York

DBMS 1 DBMS 2 DBMS 3

EMP
Replica of EMP Replica of EMP
Source

(C) Frieder, Grossman, & Goharian 1996, 2002 252

Write-All Copies
The Write-all copies takes an update to a table at
site A and before commit sends the update to all
sites that contain the replica.
If any site is down, a commit does not occur.

(C) Frieder, Grossman, & Goharian 1996, 2002 253

Shadow Copies
A master copy is defined and one or more read-only
replicas are created based on changes found in the log at
the mater site.

DBMS Log Site A

Master
Site A
Copy

Read-Only
Read-Only DBMS Replica
DBMS
Replica Site C Site C
Site B
Site B

(C) Frieder, Grossman, & Goharian 1996, 2002 254

Shadow Copies
Many replica algorithms exist that allow updates to
the replicas, but they are not widely used in
commercial products.
All commercial vendors provide some form of
read-only replicas based on changes found in the
log at the master site.

(C) Frieder, Grossman, & Goharian 1996, 2002 255

Test Bank For Database Systems: Design, Implementation, and Management 13th Edition Coronel - Read Now With The Full Version of All Chapters
100% (17)
Test Bank For Database Systems: Design, Implementation, and Management 13th Edition Coronel - Read Now With The Full Version of All Chapters
54 pages
Invensys Controls Courses Catalogue
100% (1)
Invensys Controls Courses Catalogue
103 pages
DBMS Practical List
No ratings yet
DBMS Practical List
2 pages
DCS 309
No ratings yet
DCS 309
10 pages
Course Outline For FDS
No ratings yet
Course Outline For FDS
3 pages
Tutorial
No ratings yet
Tutorial
41 pages
SQL Output
No ratings yet
SQL Output
24 pages
Lab # 7 View, Sequence, Synonym and Trigger Eng. Alaa O Shama
No ratings yet
Lab # 7 View, Sequence, Synonym and Trigger Eng. Alaa O Shama
11 pages
Invoking Oracle 6i Reports From Oracle 6i Forms Using RUN
No ratings yet
Invoking Oracle 6i Reports From Oracle 6i Forms Using RUN
8 pages
CLASS12-Report On Bank Management
No ratings yet
CLASS12-Report On Bank Management
37 pages
Resume - Saraswati Kumari 2+ Years Exp in QA
No ratings yet
Resume - Saraswati Kumari 2+ Years Exp in QA
3 pages
Hakin9 On Demand - 201202 - Hakin9 SQL Injection
100% (1)
Hakin9 On Demand - 201202 - Hakin9 SQL Injection
41 pages
Referential Integrity in Databases
No ratings yet
Referential Integrity in Databases
16 pages
Oracle ODI Variable Setup Guide
No ratings yet
Oracle ODI Variable Setup Guide
7 pages
Quillbot Cookie (Second Cookies)
No ratings yet
Quillbot Cookie (Second Cookies)
4 pages
PL SQL Exercise by Unsw
No ratings yet
PL SQL Exercise by Unsw
5 pages
1Z0 006 Oracle Database Foundations
67% (3)
1Z0 006 Oracle Database Foundations
159 pages
Pavani Pingili
No ratings yet
Pavani Pingili
5 pages
Ankita
No ratings yet
Ankita
5 pages
DBMS Study Guide for Diploma Students
No ratings yet
DBMS Study Guide for Diploma Students
28 pages
DMS Question Bank Answers
No ratings yet
DMS Question Bank Answers
22 pages
InformaticsPractices SQP 2025 26
No ratings yet
InformaticsPractices SQP 2025 26
13 pages
DBMS Overview: Types, Evolution, Pros & Cons
No ratings yet
DBMS Overview: Types, Evolution, Pros & Cons
24 pages
SQL Notes
No ratings yet
SQL Notes
8 pages
Section 6
0% (2)
Section 6
3 pages
SQL Server 2008 Exam Prep Guide
No ratings yet
SQL Server 2008 Exam Prep Guide
5 pages
Review: IOE 373 - Data Processing
No ratings yet
Review: IOE 373 - Data Processing
19 pages
Azure SQL
No ratings yet
Azure SQL
3,323 pages
Koushik Sarkar PDF
No ratings yet
Koushik Sarkar PDF
1 page
Cloud Digital Leader Exam Guide English
No ratings yet
Cloud Digital Leader Exam Guide English
8 pages