Database Management System Basics
Database Management System Basics
Unit – I
Basic Concepts: Database and Database users – Database system concepts and architecture – Data
modeling using Entity Relationship model – Enhanced entity relationship and object modeling – Record
storage and file organizations – Index Structures for files.
Basic Concepts
DATA
Data is a collection of facts, such as numbers, words, measurements, observations or just descriptions of
things.Data can be qualitative or quantitative.Qualitative data is descriptive
information(it describes something).Quantitative data is numerical information (numbers)
QUANTITATIVE DATA
• Quantitative data can be Discrete or Continuous.Discrete data can only take certain values (like whole
numbers)
QUALITATIVE DATA
• EX
DATABASE
Data means known facts or raw facts. E.g. names, telephone numbers.
Information means processed data.
Database is a collection of related data. E.g. student table consists of name, regno, marks.
Database management system (DBMS) is collection of programs that enables user to create and maintain a
database. A general-purpose software system facilitates process of defining, constructing, and manipulating
database for various applications.
• Database system includes database and DBMS software.
• A Database Management System (DBMS) is software designed to store, retrieve, define, and manage
data in a database.
DATABASE USERS
Database users are categorized based up on their interaction with the data base.
These are seven types of data base users in DBMS.
1. Database Administrator (DBA) :
Database Administrator (DBA) is a person/team who defines the schema and also controls the 3 levels of
database.
The DBA will then create a new account id and password for the user if he/she need to access the data
base.
DBA is also responsible for providing security to the data base and he allows only the authorized users to
access/modify the data base.
DBA also monitors the recovery and back up and provide technical support.
The DBA has a DBA account in the DBMS which called a system or super user account.
DBA repairs damage caused due to hardware and/or software failures.
2. Naive / Parametric End Users :
Parametric End Users are the unsophisticated who don’t have any DBMS knowledge but they
frequently use the data base applications in their daily life to get the desired results.
For examples, Railway’s ticket booking users are naive users. Clerks in any bank is a naive user because
they don’t have any DBMS knowledge but they still use the database and perform their given task.
3. System Analyst :
System Analyst is a user who analyzes the requirements of parametric end users. They check whether
all the requirements of end users are satisfied.
4. Sophisticated Users: Sophisticated users can be engineers, scientists, business analyst, who are
familiar with the database. They can develop their own data base applications according to their
requirement. They don’t write the program code but they interact the data base by writing SQL
queries directly through the query processor.
5. Data Base Designers: Data Base Designers are the users who design the structure of data base which
includes tables, indexes, views, constraints, triggers, stored procedures. He/she controls what data
must be stored and how the data items to be related.
6. Application Program: Application Program are the back end programmers who writes the code for
the application programs. They are the computer professionals. These programs could be written in
Programming languages such as Visual Basic, Developer, C, FORTRAN, COBOL etc.
7. Casual Users / Temporary Users :Casual Users are the users who occasionally use/access the data
base but each time when they access the data base they require the new information, for example,
Middle or higher level manager.
Types of DBA
• Administrative DBA – This DBA is mainly concerned with installing, and maintaining DBMS servers.
His prime tasks are installing, backups, recovery, security, replications, memory management,
configurations and tuning. He is mainly responsible for all administrative tasks of a database.
• Development DBA – He is responsible for creating queries and procedure for the requirement. Basically
his task is similar to any database developer.
• Database Architect – Database architect is responsible for creating and maintaining the users, roles,
access rights, tables, views, constraints and indexes. He is mainly responsible for designing the structure
of the database depending on the requirement. These structures will be used by developers and
development DBA to code.
• Data Warehouse DBA –DBA should be able to maintain the data and procedures from various sources
in the datawarehouse. These sources can be files, COBOL, or any other programs. Here data and
programs will be from different sources. A good DBA should be able to keep the performance and
function levels from these sources at same pace to make the datawarehouse to work.
• Application DBA –He acts like a bridge between the application program and the database. He makes
sure all the application program is optimized to interact with the database. He ensures all the activities
from installing, upgrading, and patching, maintaining, backup, recovery to executing the records works
without any issues.
• OLAP DBA – He is responsible for installing and maintaining the database in OLAP systems. He
maintains only OLAP databases.
Advantage of Database Management System (DBMS):
As number of users increases data transferring or data sharing rate also increases thus increasing the risk
of data security. It is widely used in corporation world where companies invest money, time and effort in
large amount to ensure data is secure and is used properly. A Database Management System (DBMS)
provide a better platform for data privacy and security policies thus, helping companies to improve Data
Security.
Data inconsistency occurs between files when different versions of the same data appear in different
places.
For Example, data inconsistency occurs when a student name is saved as “John Wayne” on a main
computer of school but on teacher registered system same student name is “William J. Wayne”, or when
the price of a product is $86.95 in local system of company and its National sales office system shows
the same product price as $84.95.
So if a database is properly designed then Data inconsistency can be greatly reduced hence minimizing
data inconsistency.
Due to DBMS now we have Better managed data and Improved data accessing because of which we can
generate better quality information hence on this basis better decisions can be made. Better Data quality
improves accuracy, validity and time it takes to read data.DBMS does not guarantee data quality, it
provides a framework to make it is easy to improve data quality.
7. Increased end-user productivity: The data which is available with the help of combination of tools
which transform data into useful information, helps end user to make quick, informative and better
decisions that can make difference between success and failure in the global economy.
8. Simple:
Data base management system (DBMS) gives simple and clear logical view of data. Many operations
like insertion, deletion or creation of file or data are easy to implement.
Three schema Architecture
o The three schema architecture is also called ANSI/SPARC architecture or three-level architecture.
o The three schema architecture is also used to separate the user applications and physical database.
o The three schema architecture contains three-levels. It breaks the database down into three different
categories.
o Mapping is used to transform the request and response between various database levels of architecture.
o Mapping is not good for small DBMS because it takes more time.
o In External / Conceptual mapping, it is necessary to transform the request from external level to
conceptual schema.
o In Conceptual / Internal mapping, DBMS transform the request from the conceptual to internal level.
1. Internal Level
o The internal level has an internal schema which describes the physical storage structure of the database.
o It uses the physical data model. It is used to define that how the data will be stored in a block.
o The physical level is used to describe complex low-level data structures in detail.
2. Conceptual Level
o The conceptual schema describes the design of a database at the conceptual level. Conceptual level is
also known as logical level.
o The conceptual level describes what data are to be stored in the database and also describes what
relationship exists among those data.
o In the conceptual level, internal details such as an implementation of the data structure are hidden.
3. External Level
o At the external level, a database contains several schemas that sometimes called as subschema. The
subschema is used to describe the different view of the database.
o Each view schema describes the database part that a particular user group is interested and hides the
remaining database from that user group.
o The view schema describes the end user interaction with database systems.
Data model
• A Database model defines the logical design and structure of a database and defines how data will be
stored, accessed and updated in a database management system
• A Data Model in Database Management System (DBMS), is the concept of tools that are developed to
summarize the description of the database.
• A Data Model in Database Management System (DBMS), is the concept of tools that are developed to
summarize the description of the database.
• It is classified into 3 types:
• Hierarchical Model
• Network Model
• Entity-Relationship Model
• Relational Model
• Object-Oriented Data Model
• Object-Relational Data Model
• Flat Data Model
• Semi-Structured Data Model
• Associative Data Model
• Context Data Model
• Entity–relationship modeling was developed for database and design by Peter Chen and published in a
1976
• ER model becomes an abstract data model,
• defines a data or information structure which can be implemented in a database, typically a relational
database.
• ENTITY-RELATIONSHIP DIAGRAM (ERD) displays the relationships of entity set stored in a
database.
• ER diagrams help you to explain the logical structure of databases.
• ER diagram looks very similar to the flowchart
• ER Diagram includes many specialized symbols, and its meanings make this model unique
• The purpose of ER Diagram is to represent the entity framework infrastructure.
• ENTITY-RELATIONSHIP DIAGRAM (ERD) describes interrelated things of interest in a specific
domain of knowledge.
• ER diagrams help to how “entities” such as people, objects or concepts relate to each other within a
system.
• ER Diagrams are most often used to design or debug relational databases in the fields of software
engineering, business information systems, education and research.
• ER Diagram use a defined set of symbols such as rectangles, diamonds, ovals and connecting lines to
depict the interconnectedness of entities, relationships and their attributes.
• The purpose of ER Diagram is to represent grammatical structure
• ER model defines the conceptual view of a database
• It works around real-world entities and the associations among them
• ER model is considered a good option for designing databases.
• ER modeling helps you to analyze data requirements systematically to produce a well-designed database
• The three main components of the ER Model are
• entities,
• attributes
• and relationships.
• An Entity may be an object with a physical existence – a particular person, car, house, or employee – or
it may be an object with a conceptual existence – a company, a job, or a university course.
• Entity is an object of Entity Type and set of all entities is called as entity set. e.g.; E1 is an entity having
Entity Type Student and set of all students is called Entity Set. In ER diagram, Entity Type is represented
as:
Figure 1.8 Entity Type, Entity Set
Attribute(s)
Attributes are the properties which define the entity type. For example, Roll_No, Name, DOB, Age,
Address, Mobile_No are the attributes which defines entity type Student. In ER diagram, attribute is
represented by an oval.
For example, a car entity would be described by attributes such as price, registration number, model
number, color etc. Attributes are indicated by ovals in an E-R diagram
• If an attribute of a particular entity represents single value for each instance, then it is called a single-
valued attribute. For example, Ramesh, Kamal and Suraj are the instances of entity ‘student’ and each of
them is issued a separate roll number. A single oval is used to represent this attribute.
Relationships
1. Generalization
2. Specialization
3. Aggregration
Generalization
• Generalization is a bottom-up approach in which two lower level entities combine to form a higher level
entity.
• In generalization, the higher level entity can also combine with other lower level entities to make further
higher level entity.
• It's more like Superclass and Subclass system, but the only difference is the approach, which is bottom-
up.
• entities are combined to form a more generalised entity, in other words, sub-classes are combined to
form a super-class.
• For example, Saving and Current account types entities can be generalised and an entity with
name Account can be created, which covers both.
Specialization
Specialization is a process that defines a group entities which is divided into sub groups based on their
characteristic.
It is a top down approach, in which one higher entity can be broken down into two lower level entity.
It maximizes the difference between the members of an entity by identifying the unique characteristic or
attributes of each member.
It defines one or more sub class for the super class and also forms the superclass/subclass relationship.
Figure 1.20 Specialization
Aggregation
• In aggregation, the relation between two entities is treated as a single entity. In aggregation, relationship
with its corresponding entities is aggregated into a higher level entity.
• Center entity offers the Course entity act as a single entity in the relationship which is in a relationship
with another entity visitor. In the real world, if a visitor visits a coaching center then he will never
enquiry about the Course only or just about the Center instead he will ask the enquiry about both.
• Objects –
An object is an abstraction of a real world entity or we can say it is an instance of class. Objects
encapsulates data and code into a single unit which provide data abstraction by hiding the
implementation details from the user. For example: Instances of student, doctor, engineer in above
figure.
• Attribute –
An attribute describes the properties of object. For example: Object is STUDENT and its attribute are
Roll no, Branch, Semester in the Student class.
• Methods –
Method represents the behavior of an object. Basically, it represents the real-world action. For example:
Finding a STUDENT marks in above figure as Setmarks().
• Class –
A class is a collection of similar objects with shared structure i.e. attributes and behavior i.e. methods.
An object is an instance of class. For example: Person, Student, Doctor, Engineer in above figure.
class student
{
char Name[20];
int roll_no;
--
--
public:
void search();
void update();
}
In this example, students refers to class and S1, S2 are the objects of class which can be created in main
function.
Inheritance –
By using inheritance, new class can inherit the attributes and methods of the old class i.e. base class. For
example: as classes Student, Doctor and Engineer are inherited from the base class Person.
Advantages of Object Oriented Data Model :
• Codes can ne reused due to inheritance.
• Easily understandable.
• Cost of maintenance can reduced due to reusability of attributes and functions because of inheritance.
Disadvantages of Object Oriented Data Model :
• It is not properly developed so not accepted by users easily.
Record Storage
Magnetic disks -
bit - by magnetizing an area on disk in certain ways, we can make it represent a bit value
of either 0 or 1
byte - bits are grouped into either 4 or 8 to make a byte - one character per byte
single-sided - disks that store info on only one of its magnetic surfaces
double-sided - both surfaces are used multiple disks stacked on top of each other form a disk pack
info stored on a disk in concentric circles of small
width - each circle called a track
cylinder - tracks of the same diameter on different disks in a disk pack
sector - section of disk divided by an angle from the center - hard-coded on disk
blocks - division of a track made by operating system during formatting separated by fixed-size interblock gap -
include control info identifying next block transfer of info from main mem to disk is done in blocks
hardware address - combination of surface number, track number and block number
buffer - contiguous reserved area in main mem that holds one block
read- block from disk copied to buffer
write - contents of buffer copied to disk block
cluster - group of blocks - sometimes data moved in clusters - buffer size matches cluster size
disk drive - system that includes - read/write head - one for each surface - actuator moves r/w head over a
cylinder of tracks - motor to rotate disks
transfer a disk block:
- position read/write head on correct track - seek time
- rotate disk to the desired block - rotational delay
- transfer data to/from main memory - block transfer time
- seek time and rotational delay usually much longer than transfer time
- locating data on disk is a major bottleneck in DB applications
- file structures to minimize number of block transfers needed to locate and transfer required data
Magnetic tape
sequential access
tape drive - reads/writes data from/to tape reel
blocks are larger than on disk - so are interblock gaps used to back up DB
Buffering blocks
can reserve several buffers in main mem to speed up transfer while one buffer is being written, CPU can process
data in other buffer
double buffering - can be used to write a continuous stream of blocks from mem to disk permits continuous
reading/writing of data on consecutive
disk blocks -eliminating seek time and rotational delay fo all but the first block Placing file records on disk
data stored in records- collection of related data values - each value formed by one or
more bytes - corresponds to field of a record
EX: one TOY record consists of the fields TOY#, name, manufacturer, etc...
record type - collection of field names and their data types
data type - (of a field) specifies type of values the field can take number of bytes for each data type is fixed for a
given computer system
BLOB - binary large object - large unstructured objects that represent images, digitized video or audio or free
text
- usually record includes a pointer to BLOB - somewhere else on disk
file - sequence of records
usually - all records in a file are of the same record type
fixed-length records - all records same size
variable-length records - different sizes
- variable-length fields in record
EX: a text field for customer comments
- repeating field - more than one value for a field in an individual record
EX: ages of children in customer relation
- optional fields
- mixed file - different record types
separator character used to separate variable-length records in a file
records are mapped to disk blocks - the unit of transfer from disk to MM buffers
blocking factor for a file - the number of records per block on disk
let B=size of block (bytes); R=size of record
B >= R ==> bfr = floor(B/R) (rounds down to next integer)
unused space = B - (bfr * R) bytes
unused space can be used by storing part of a record on one block and the rest on another
- a pointer at the end of one block to the other block storing the rest of the record
- called spanned records
- must be used if R >= B
bfr for variable-length records represents the average number of records per block.
File organizations
File – A file is named collection of related information that is recorded on secondary storage such as
magnetic disks, magnetic tables and optical disks.
What is File Organization?
File Organization refers to the logical relationships among various records that constitute the file, particularly
with respect to the means of identification and access to any specific record. In simple terms, Storing the files
in certain order is called file Organization. File Structure refers to the format of the label and data blocks
and of any logical control record.
Various methods have been introduced to Organize files. These particular methods have advantages and
disadvantages on the basis of access or selection . Thus it is all upon the programmer to decide the best suited
file Organization method according to his requirements.
Some types of File Organizations are :
• Sequential File Organization
• Heap File Organization
• Hash File Organization
• B+ Tree File Organization
• Clustered File Organization
We will be discussing each of the file Organizations in further sets of this article along with differences and
advantages/ disadvantages of each file Organization methods.
The easiest method for file Organization is Sequential method. In this method the file are stored one after
another in a sequential manner. There are two ways to implement this method:
1. Pile File Method – This method is quite simple, in which we store the records in a sequence i.e one after
other in the order in which they are inserted into the tables.
2. Insertion of new record –Let the R1, R3 and so on upto R5 and R4 be four records in the sequence.
Here, records are nothing but a row in any table. Suppose a new record R2 has to be inserted in the
sequence, then it is simply placed at the end of the file.
Let us assume that there is a preexisting sorted sequence of four records R1, R3, and so on upto R7 and
R8. Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at the end of the
file and then it will sort the sequence .
Pros –
Heap File Organization works with data blocks. In this method records are inserted at the end of the file, into
the data blocks. No Sorting or Ordering is required in this method. If a data block is full, the new record is
stored in some other block, Here the other data block need not be the very next data block, but it can be any
block in the memory. It is the responsibility of DBMS to store and manage the new records.
Suppose we have four records in the heap R1, R5, R6, R4 and R3 and suppose a new record R2 has to be
inserted in the heap then, since the last data block i.e data block 3 is full it will be inserted in any of the data
blocks selected by the DBMS, lets say data block 1.
If we want to search, delete or update data in heap file Organization the we will traverse the data from the
beginning of the file till we get the requested record. Thus if the database is very huge, searching, deleting or
updating the record will take a lot of time.
What is Indexing?
Indexing is a data structure technique which allows you to quickly retrieve records from a database file. An
Index is a small table having only two columns. The first column comprises a copy of the primary or candidate
key of a table. Its second column contains a set of pointers for holding the address of the disk block where that
specific key value stored.
An index -
Types of Indexing
Indexing in Database is defined based on its indexing attributes. Two main types of indexing methods are:
• Primary Indexing
• Secondary Indexing
Primary Index
Primary Index is an ordered file which is fixed length size with two fields. The first field is the same a primary
key and second, filed is pointed to that specific data block. In the primary Index, there is always one to one
relationship between the entries in the index table.
The primary Indexing in DBMS is also further divided into two types.
• Dense Index
• Sparse Index
Dense Index
In a dense index, a record is created for every search key valued in the database. This helps you to search faster
but needs more space to store index records. In this Indexing, method records contain search key value and
points to the real record on the disk.
Sparse Index
It is an index record that appears for only some of the values in the file. Sparse Index helps you to resolve the
issues of dense Indexing in DBMS. In this method of indexing technique, a range of index columns stores the
same data block address, and when data needs to be retrieved, the block address will be fetched.
However, sparse Index stores index records for only some search-key values. It needs less space, less
maintenance overhead for insertion, and deletions but It is slower compared to the dense Index for locating
records.
Secondary Index
The secondary Index in DBMS can be generated by a field which has a unique value for each record, and it
should be a candidate key. It is also known as a non-clustering index.
This two-level database indexing technique is used to reduce the mapping size of the first level. For the first
level, a large range of numbers is selected because of this; the mapping size always remains small.
In a bank account database, data is stored sequentially by acc_no; you may want to find all accounts in of a
specific branch of ABC bank.
Here, you can have a secondary index in DBMS for every search-key. Index record is a record point to a bucket
that contains pointers to all the records with their specific search-key value.
Figure 1.30 Secondary index
Clustering Index
In a clustered index, records themselves are stored in the Index and not pointers. Sometimes the Index is created
on non-primary key columns which might not be unique for each record. In such a situation, you can group two
or more columns to get the unique values and create an index which is called clustered Index. This also helps
you to identify the record faster.
Example:
Let's assume that a company recruited many employees in various departments. In this case, clustering indexing
in DBMS should be created for all employees who belong to the same dept.
It is considered in a single cluster, and index points point to the cluster as a whole. Here, Department _no is a
non-unique key.
Multilevel Indexing in Database is created when a primary index does not fit in memory. In this type of indexing
method, you can reduce the number of disk accesses to short any record and kept on a disk as a sequential file
and create a sparse base on that file.
Figure 1.31 Multilevel index
B-Tree Index
B-tree index is the widely used data structures for tree based indexing in DBMS. It is a multilevel format of tree
based indexing in DBMS technique which has balanced binary search trees. All leaf nodes of the B tree signify
actual data pointers.
Moreover, all leaf nodes are interlinked with a link list, which allows a B tree to support both random and
sequential access.
Advantages of Indexing
• It helps you to reduce the total number of I/O operations needed to retrieve that data, so you don't need to
access a row in the database from an index structure.
• Offers Faster search and retrieval of data to users.
• Indexing also helps you to reduce tablespace as you don't need to link to a row in a table, as there is no
need to store the ROWID in the Index. Thus you will able to reduce the tablespace.
• You can't sort data in the lead nodes as the value of the primary key classifies it.
Disadvantages of Indexing
• To perform the indexing database management system, you need a primary key on the table with a
unique value.
• You can't perform any other indexes in Database on the Indexed data.
• You are not allowed to partition an index-organized table.
• SQL Indexing Decrease performance in INSERT, DELETE, and UPDATE query.
SCHOOL OF COMPUTING
UNIT II
Relational Model, Languages and Systems: The Relational Data Model, Relational
Constraints, and the relational Algebra – SQL – The Relational Database Standard – ER and EER
to Relational Mapping and Other Relational Languages – Examples of Relational Database
Management Systems: Oracle and Microsoft Access.
Data model
• Hierarchical Model
• Network Model
• Entity-Relationship Model
• Relational Model
• Object-Oriented Data Model
• Object-Relational Data Model
• Flat Data Model
• Semi-Structured Data Model
• Associative Data Model
• Context Data Model
RELATIONAL MODEL
Relational model is an example of implementation model or representational model. An
implementation model provides concepts that may be understood by end users but that are not
too far removed from the way data is organized within the computer.
CONCEPTS:
Tuple: Each row in the relation is known as tuple. The bellow relation contains 3 tuples
ANDHRA
3 SAM 9156253131 20
PRADESH
• Attribute: Each column in a Table. Attributes are the properties which define a relation. e.g.,
Student_Rollno, NAME,etc.
• Tables – In the Relational model the, relations are saved in the table format. It is stored along with its
entities. A table has two properties rows and columns. Rows represent records and columns represent
attributes.
• Tuple – It is nothing but a single row of a table, which contains a single record.
• Relation Schema: A relation schema represents the name of the relation with its attributes.
• Degree: The total number of attributes which in the relation is called the degree of the relation.
• Cardinality: Total number of rows present in the Table.
• Column: The column represents the set of values for a specific attribute.
• Relation instance – Relation instance is a finite set of tuples in the RDBMS system. Relation instances
never have duplicate tuples.
• Relation key - Every row has one, two or multiple attributes, which is called relation key.
• Attribute domain – Every attribute has some pre-defined value and scope which is known as attribute
domain
• Constraints are the rules enforce limits to the data or type of data that can be inserted/updated/deleted
from a table.
1Domain constraints
2. Key constraints
Domain Constraint-
Entity Integrity constraints says that no primary key can take NULL value, since using primary
key we identify each tuple uniquely in a relation.
• Explanation:
In the above relation, EID is made primary key, and the primary key cant take NULL values but in the
third tuple, the primary key is null, so it is a violating Entity Integrity constraints.
table.
Object-An object is a software bundle of variables and related methods. Software objects
are often used to model real-world objects you find in everyday life.
Messages-Software objects interact and communicate with each other using
messages.
Classes-A class is a blueprint or prototype that defines the variables and the methods
common to all objects of a certain kind.
Inheritance-A class inherits state and behavior from its superclass. Inheritance provides a
powerful and natural mechanism for organizing and structuring softwareprograms.
Information Hiding-The ability to protect some components of the object from external
entities. This is realized by language keywords to enable a variable to be declared as
private or protected to the owning class.
Polymorphism-The ability to replace an object with its subobjects. The ability of anobject-
variable to contain, not only that object, but also all of its subobjects. In contrast to a
relational DBMS where a complex data structure must be flattened out to fit into tables or
joined together from those tables to form the inmemory structure, object DBMSs have no
performance overhead to store or retrieve a web or hierarchy of interrelated objects. This
one-to-one mapping of object programming language objects to database objects has two
benefits over other storage approaches: it provides higher performance management of
objects, and it enables better management of the complex interrelationships between
objects. This makes object DBMSs better suited to support applications such as financial
portfolio risk analysis systems, telecommunications service applications, world wide web
document structures, design and manufacturing systems, and hospital patient record
systems, which have complex relationships between data.
Figure 2.2. Sample user case diagram for object oriented data model.
DEPT_LOCATIONS
DNUMBER DLOCATIO
N
PROJECT
WORKS_ON
DEPENDENT
1. INSERT OPERATION:
The insert operation provides a list of attribute values for a new tuple t that is to be
inserted into a relation R. insert can violate any of the four types of constraints.
Example: Insert < ’ProductX’, 501, ’Stafford’, 4 > Into Project.
If there is a null in the primary key field then that insertion violates entity
integrity constraints.
This insertion may violate the key constraint if there is another tuple with a same
project no.
This insertion may violate the referential integrity constraint if there is no
information regarding DNO = 4, which is the foreign key.
2. DELETE OPERATION:
The delete operation is used to delete existing tuples, which satisfy the specified
condition. The deletion operation violates referential integrity consraints. Example: Delete The
Employee Tuple With Name=’John’
3. UPDATE OPERATION:
The update operation is used to change the values of one or more attributes in a tuple of
some relation R. it is necessary to specify a condition on the attributes of the relation to select the
tuple to be modified.
Example: Update The Salary Of The Employee Tuple With SSN= ‘99988777’ To 28000.
Where (sigma) symbol is used to specify the SELECT operator and the selection condition
is a Boolean expression specified on the attributes of relation R. the result is also a relation
with the same attributes of R. the selection condition is of the form
<Attribute name> <comparison op> <attribute name/constant>
Comparison operator includes {=, <, >, , , }.
Example 1: Select the employees with dno 4
Ans.: dno=4 (EMPOYEE) Result:
More than one conditions can be connected by Boolean operators AND, OR, NOT to form a
general selection condition. Example: consider the following query.
Example 3: Select the employees who either work in dno 3 and receives more than30000 or
work in dno 4 and receives more than 15000.
Ans.: (dno=3 AND salary>30000) OR (dno=4 AND salary >15000) (EMPOYEE) Result:
NAME SSN BDATE ADDRESS SEX SALARY DNO
Mani 384 29-12-78 Chennai M 17000 4
2. PROJECT OPERATION:
The PROJECT operation is used to select certain set of attributes from a relation. I.e.it selects
some of the columns from the table.
Syntax: <attribute list> (R)
Where (pi) symbol is used to specify the PROJECT operator and the attribute list isthe list
of attributes of relation R. the result is also a relation with the selected set of attributes of R.
Example 4: List each employee’s name and salary.
Result:
NAME SALARY
Alex 20000
Siva 15000
Sruthi 17000
Gayathri 14000
Mani 17000
NAME SALARY
Alex 20000
Gayathri 14000
NAME SALARY
Alex 20000
Gayathri 14000
Dep1_emp, Result are temporary relations created automatically.
3. RENAME OPERATION:
The RENAME operation can rename either the relation name or the attribute namesor both.
Syntax: S (B1, B2…. B n) (R)
Where the symbol denote the RENAME operator, S denote the new relation, B1, B2 are
new attribute names.
Example 6: change the name of the EMPLOYEE relation as STAFF LIST
Ans.: STAFFLIST (EMPLOYEE)
4. SET OPERATIONS:
Set operations are used to merge the elements of two sets in various ways including
UNION, INTERSECTION, and SET DIFFERENCE. The three operations UNION,
INTERSECTION, and SET DIFFERENCE can be applied on two union compatible relations.
What is a union compatible relation? Two relations are said to be union compatible relation if
they have same degree n (same no of attributes) and that each pair of corresponding attributes
have the same domain. Consider two union compatible relations.
STUDENT NAM TEACHER NAME
SUS SUSAN
RAM JIMMY
JOH MAYA
JIMM PRIYA
BAR PATRICK
FRA RAMESH
ERN
UNION:
R S where R, S are two relations. The result is also a relation that includes all tuplesthat are
either in R or in S or in both R and S.
INTERSECTION:
R S where R, S are two relations. The result is also a relation that includes all tuplesthat are
in both R and S.
SET DIFFERENCE:
R – S The result is also a relation that includes all tuples that are in R but not in S.
Results:
4. CARTISIAN PRODUCT:
It is also known as CROSS JOIN or CROSS PRODUCT and denoted by X. like set
operations cross product is also used merge two relations but the relations on which it is applied
need not be union compatible.
Let us consider the relational algebra expression. Q R X S.
Where R is a relation with n attributes (A1, A2.An) and S is a relation with m attributes
(B1, B2.Bn) and the resulting relation Q with n+m attributes (A1, A2, An, B1, B2, Bn) in that
order. The resulting relation Q has one tuple for each combination of tuples one from R and one
from S. hence if R has t1 tuples and S has t2 tuples then R X S t1* t2 tuples.
Ans.:
The CARTESIAN product creates tuples with the combined attributes of two relations
the operation applied by itself is generally meaning less. It is useful when it is followed by a
SELECT operation that selects only related tuples from two relations according to the selection
condition. Temp2 ssn = mgrssn (temp1)
RELATION: TEMP 2
NAME SSN … … … … … DNAME DNO MGRSSN
Alex 123 Research 1 123
Sruthi 905 Accounts 3 905
Mani 384 administration 4 384
(temp2)RELATION: MANAGERLIST
DNAME NAME
Research Alex
Accounts Sruthi
administration Mani
6. JOIN OPERATION:
The CARTESIAN product followed by SELECT is commonly used to select
related tuples from two relations. Hence a special operation called JOIN was created to specify
this sequence as a single operation. Join operation is denotedby Q R
<join
condition> S
Where R (A1, A2, .An), S (B1, B2, Bn) are two relations with degree m, n respectively and Q is
the resulting relation with m + n attributes, has one tuple for each combination of tuples one
from R and one from S –whenever the combination satisfies the join condition. This is the main
difference between CARTESIAN PRODUCT and JOIN: in JOIN, only combinations of tuples
satisfying the join condition appear in the result, whereas in the CARTESIAN PRODUCT all
combinations of tuples are included in the result.
The JOIN operation with such a general join condition is called THETA JOIN. If the
comparison operator used is = then that join is called EQUI JOIN.
Example 8 can be performed using join operation as following
NATURAL JOIN:
Natural join requires that the two join attributes have the same name in both therelations. If this
is not the case, a renaming operation is applied first
Q R*S
Where * denotes the natural join operation, the relations R and S must have an attribute with
same name. If there are more than one attribute pair with same name the NATURAL JOIN
operation is performed by equating all attribute pairs that have the same name in both relations.
Example 9: Retrieve the name of the department that GAYATHRI works: GAYATHRI
belongs to DNO 1 and the DNAME information is present in
DEPARTMENT relation. The foreign key DNO is used to find out these related details. Here
both primary key and foreign key are having same name so you can apply naturaljoin
Temp1 DEPARTMENT * EMPLOYEE
This is equivalent of applying
Temp 1 EMPLOYEE employee.dno=department.dno DEPARTMENT Gayathri_dept
Use this when you only want to return rows that have matching data in the left table,
even ifthere's no matching rows in the right table.
Source Tables
Left Table-Individual
Id FirstName LastName UserName
1 Administrator
2 Contributor
3 Contributor
4 Contributor
10 Administrator
Result
Use this when you only want to return rows that have matching data in the righttable, even
if there's no matching rows in the left table.
Source Tables
1 Administrator
2 Contributor
3 Contributor
4 Contributor
10 Administrator
Result
IndividualId FirstName LastName UserName IndividualId AccessLevel
Use this when you want to all rows, even if there's no matching rows in the righttable.
Source Tables
1 Administrator
2 Contributor
3 Contributor
4 Contributor
10 Administrator
Result
VIEWS IN SQL
SQL is a standard language for storing, manipulating and retrieving data in databases.SQL stands for Structured
Query Language.It is designed for managing data in a relational database management system (RDBMS).
For example:
We may frequently issue queries that retrieve the employee name and project
names that the employee works on
Rather than having to specify the join of the employee, works on & project tables
every time we issue that query, we can define a view that is a result of these joins.
We then issue queries on the view.
CREATE VIEW <View name> AS SELECT <Attribute list> FROM <Table list>
WHERE <condition>;
The view is given a table name (view name), a list of attribute name, anda query
to specify the contents of the view.
Example:
CREATE VIEW EMP_PROJ
AS SELECT FNAME, LNAME, PNAME, HOURS
FROM EMPLOYEE, PROJECT, WORKS-ON
WHERE SSN=ESSN AND PNO=PNUMBER;
In this case EMP_PROJ inherits the names of the view attributes from the defining
tables EMPLOYEE, PROJECT, WORKS-ON
EMP_PROJ
FNAME LNAME PNAME HOURS
• A view update is feasible when only one update on the base relations can
accomplish the desired update effect on the view.
• Whenever an update on the view can be mapped to more than one tuple on the
underlying base relations, we must have a certain procedure to choose the
desired update.
For choosing the most likely update:
• A view with a single defining table is updatable if the view attributes contain the
primary key of the base relation, because this maps each view tuple to a single
base relation.
• Views defined on multiple tables using joins are generally not updateable.
• Views defined using grouping and aggregate functions are not updatable.
{ t | COND (t) }
where, t – is a tuple variable & COND(t)- includes Range relation of tuple and
selective condition.
The result of such a query is the set of all tuples t that satisfy COND( t) Example: Q1. To
find all employees whose salary is above 50,000.
Tuple relational Expression:
{ t | Employee(t) and t. salary > 50000}
t1.A1, t2. A2, ….., tn. An are tuple variables, each Ai is an attribute of the relation on
which ti ranges.
• COND is made up of atoms is of the form ti . Ai OP tj . Bj or ti . Ai OP Cor
C OP tj . Bj
Q3: Retrieve the name , address of all employees who work for the research
department.
{ t. fname, t. address | Employee(t) and ( d)( Department (d) and d.dname =’Research’
and d.Dnumber = t. DNO) }
Safe Expression:
• The expression in relational calculus is the one which produce finite tuples as its
result, then that expression is known as Safe Expression or it is known as Unsafe
Expression.
• An expression is safe if all values in its result are from the domain of the
expression.
• Example: { t | not (Employee( t) }
It is unsafe, it produce all tuples in the universe that are not employee tuples,which are
infinite.
• The domain relational calculus differs from the tuple relational calculus in the
type of variables used in formula.
• Rather than having domain variables range over tuples, the domain
variables ranges over the domain of attributes.
An expression of the Domain calculus is of the form: { x1,x2,…..xn |
COND ( x1, x2, ….xn, xn+1, …..xn+m)} x1,x2,…..xn are domain
variables that range over domain(attributes)
Q3: Retrieve the all employees ssn who are all working in project number . { a |
project ( abc) and b=1}
QBE LANGUAGE
• Query By Example (QBE) language, it is one of the first graphical
language with minimum syntax developed for database systems.
• It was Developed by IBM
• It uses Point and Click type Interface.
• It differs from SQL, in that the does not have to specify a structured query
explicitly; rather, the query is formulated by filling in templates of relations
that are displayed on monitor screen.
• The user does not have to remember the names of attributes or
relation, because they are displayed as part of these templates.
• The user does not have to follow the syntax rules.
• In QBE, the user interface first allow the user to choose the tables needed
to formulate a query by displaying a list of all relation namesThe templates
for the chosen relations are then displayed.
• The user moves to the appropriate column in the templates and
specifies the query.
• Special function keys are provided to move among templates &
perform certain functions
• To specify the condition, the user can use CONDITION BOX, which is
created by pressing a particular function key.
Q2: List the SSN of employees who work on project no 1
Q3: List the SSN of employees who work more than 20 hrs per week on projectno 1
p. X HY
CONDITION BOX
_X = 1 AND_HY >20
Q4: List the name and address of all employees who work for the research
department
EMPLO FNAM LNA SS BDAT ADDRE SAL DNO
YEE E ME N E SS ARY
p. p. _DX
CONDITION BOX
_DN=’Research’ AND
_DY=_DX
Aggregate functions:
AVG. , SUM. , CNT. , MAX. , MIN. ,
• In QBE the functions AVG. , SUM. , CNT. , MAX. , MIN. are applied to
distinct values within a group in the default case.
• If we want these functions to apply to all values, we must use the prefixALL
• Grouping is indicated by G. function
Q5: count the no. of distinct salary values in the employee relation.
Q7: Retrieve each department no, the no of employees within each department.
EMPLO FNAM LNAM SSN BDAT ADDRE SALAR DNO
YEE E E E SS Y
P.CNT. P.G.
ALL
Dno column is used for grouping as indicated by G. function.
Q8: Display each project name & the no. of employees working on it for project on which
more than two employee work.
P. G. _PX
CNT._
EX
P. _PX
CONDITION BOX
CNT._EX>2
To Delete , specify D. operator & then condition for the tuple being deleted.
Q10: Increase the salary of John B smith by 500 and also reassign him
todepartment 4
QBE was the first user friendly “visual” relational database language
Database Design
• what does one entity do with the other, how do they relate to each other?
• For example, customers buy products, products are sold to customers, a sale
comprises products, a sale happens in a shop.
• Identifying Entities
• Identifying Relationships
• database are easy to maintain, improves data consistency and are cost effective in
terms of disk storage space.
Functional Dependency
• Functional dependency helps you to maintain the quality of data in the database.
X → Y
• The left side of FD is known as a determinant, the right side of the production is
known as a dependent.
• A functional dependency A->B in a relation holds if two tuples having same value of
attribute A also have same value for attribute B. For Example, in relation STUDENT
shown in table 1, Functional Dependencies
• but
• Augmentation rule: When x -> y holds, and c is attribute set, then ac -> bc also holds.
That is adding attributes which do not change the basic dependencies.
• Transitivity rule: This rule is very much similar to the transitive rule in algebra if x ->
y holds and y -> z holds, then x -> z also holds. X -> y is called as functionally that
determines y.
• YES
• NAME → ROLL NO
• NO
• ROLL NO → MARKS
• YES
• DEPT → COURSE
• NO
• MARKS → DEPT
• NO
• YES
• B->A : Not Valid (for same determinant B’s value ‘2’, dependent A’s value ‘1’,’2′ is
different)
• B->C : Valid (for same determinant B value, same dependent C value, for different B
value C value could be anything hence it is valid).
• C->B : Valid (for same determinant C value that is 3, same dependent B value that is
2, for different C value B value could be anything hence it is valid)
• AB->A : Valid (A’s values are unique hence determinant AB’s values will also be
unique, hence dependent A’s value could be anything, AB->any attribute of above
relation will be valid)
• AB->B : Valid (A’s values are unique hence determinant AB’s values will also be
unique, hence dependent B’s value could be anything, AB->any attribute of above
relation will be valid)
• AB->C : Valid (A’s values are unique hence determinant AB’s values will also be
unique, hence dependent C’s value could be anything, AB->any attribute or attributes
combination of above relation will be valid)
• BC->A : Not Valid (for same value {2,3} of determinant BC there are two values
{1},{2}of dependent A, hence not valid)
• BC->B : Valid (for same value {2,3} o determinant BC, same value {2} of dependent
B exist. this holds for any combination of determinant to dependent. hence it is valid)
• BC->C : Valid (for same value {2,3} o determinant BC, same value {3} of dependent
C exist. this holds for any combination of determinant to dependent. hence it is valid)
• CA->A : Valid (A’s values are unique hence determinant AC’s or CA’s values will
also be unique, hence dependent A’s value could be anything, CA->any attribute or
attributes combination of above relation will be valid)
• CA->B : Valid (A’s values are unique hence determinant AC’s or CA’s values will
also be unique, hence dependent B’s value could be anything, CA->any attribute or
attributes combination of above relation will be valid)
• CA->C : Valid (A’s values are unique hence determinant AC’s or CA’s values will
also be unique, hence dependent C’s value could be anything, CA->any attribute or
attributes combination of above relation will be valid)
• ABC->A : Valid (A’s values are unique hence determinant ABC’s values will also be
unique, hence dependent C’s value could be anything, ABC->any attribute or
attributes combination of above relation will be valid)
• ABC->B : Valid (A’s values are unique hence determinant ABC’s values will also be
unique, hence dependent B’s value could be anything, ABC->any attribute or
attributes combination of above relation will be valid)
• Multivalued dependency:
• Transitive dependency:
• Examples-
• AB → A
• AB → B
• AB → AB
• The Trivial dependency is a set of attributes which are called a trivial if the set of
attributes are included in that attribute.
• of {Emp_id,Emp_name}.
• Thus, if there exists at least one attribute in the RHS of a functional dependency that
is not a part of LHS, then it is called as a non-trivial functional dependency.
• AB → BC
• AB → CD
• AB→ C , AB→AC,AC→B
Multivalued dependency
• Multivalued dependency occurs in the situation where there are multiple independent
multivalued attributes in a single table
• In this example, maf_year and color are independent of each other but dependent on
car_model. In this example, these two columns are said to be multivalue dependent on
car_model.
• car_model-> colour
Normalization
• Redundancy in relation may cause insertion, deletion and updation anomalies. So, it
helps to minimize the redundancy in relations.
• Normalization divides the larger table into the smaller table and links them using
relationship.
• The normal form is used to reduce redundancy from the database table.
• The inventor of the relational model Edgar Codd proposed the theory of
normalization with the introduction of the First Normal Form, and he continued to
extend theory with Second and Third Normal Form.
• When developing the schema of a relational database, one of the most important
aspects to be taken into account is to ensure that the duplication is minimized. This is
done for 2 purposes:
• Avoiding unnecessary data conflicts that may creep in because of multiple copies of
the same data getting stored.
• The core idea of database normalization is to divide the tables into smaller subtables
and store pointers to data rather than replicating it.
Anomalies
• "What does anomaly mean?". The dictionary defines anomaly as “an abnormality.”
• Anomalies are problems that can occur in poorly planned, un-normalised databases
where all the data is stored in one table (a flat-file database).
• No anomalies
• No information redundancy
• So we are clear when the process of reducing the anomaly comes into action.
• There are three types of anomalies: update, deletion and insertion anomalies.
• Update Anomaly: Let say we have 6 columns in a table out of which 2 are called
student name and student contact number. Now if one student changes it’s contact
number then we would have to update the table. But the problem is, if the table is not
normalized one student can have multiple entries and while updating all of those
entries one of them might get missed.
• Insert Anomaly: Let say we have 6 columns in a table out of which 2 are called
student name and student contact number. Now if one student details we would have
to insert in the table. But the problem is, if the table is not normalized one student
can have multiple entries and while inserting all of those entries one of them might
get missed.
Types of normalization
• It states that an attribute of a table cannot hold multiple values. It must hold only
single-valued attribute.
• First normal form disallows the multi-valued attribute, composite attribute, and their
combinations.
• A table is in 1 NF iff(Rule)
• A relation is in second normal form if it is in 1NF and every non key attribute is fully
functionally dependent on the primary key.
• For a table to be in the Second Normal Form, it must satisfy two conditions:
• Candidate Key
- When there are 2 or more primary keys in one table is called candidate key.
• So no partial dependency
• Second Normal Form (2NF) relations have less redundancy than those in 1NF, they
may still suffer from update anomalies.
• If we update only one tuple and not the other, the database would be in an
inconsistent state.
• Once a table is in second normal form, we are guaranteed that every column is
dependent on the primary key,
• A relation will be in 3NF if it is in 2NF and not contain any transitive partial
dependency.
• 3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
• If there is no transitive dependency for non-prime attributes, then the relation must be
in third normal form.
• the same example, where we have 3 tables, Student, Subject and Score.
Student Table
• Subject Table
• In the Score table, we need to store some more information, which is the exam name
and total marks, so let's add 2 more columns to the Score table.
Requirements for Third Normal Form-For a table to be in the third normal form,
• Primary key for our Score table is a composite key, which means it's made up of two
attributes or columns → student_id + subject_id.
• Our new column exam_name depends on both student and subject. So we can say
that exam_name is dependent on both student_id and subject_id.
• Well, the column total_marks depends on exam_name as with exam type the total
score changes. For example, practicals are of less marks while theory exams are of
more marks.
• But, exam_name is just another column in the score table. It is not a primary key or
even a part of the primary key, and total_marks depends on it.
• For BCNF, the table should be in 3NF, and for every FD, LHS is super key.
• Example: Let's assume there is a company where employees work in more than one
department.
Normal forms are insufficient on their own as a criteria for a good relational database
schema design.
– 1. Create an initial matrix S with one row i for each relation Ri in DECOMP, and
one column j for each attribute Aj in R.
– 3. For each row i For each column j If Ri includes attribute Aj Then set S(i, j) := aj
– 4. Repeat the following loop until a complete loop execution results in no changes
to S For each X → Y in F For all rows in S which has the same symbols in the
columns corresponding to attributes in X make the symbols in each column that
correspond to an attribute in Y be the same in all these rows as follows: if any of the
rows has an “a” symbol for the column, set the other rows to the same “a” symbol in
– 5. If a row is made up entirely of “a” symbols, then the decomposition has the
lossless join property; otherwise it does not.
UNIT IV System Implementation Techniques: Database system architectures and the system
catalog – Query processing and optimization
Architectures for DBMSs followed trends similar to those of general computer systems
architectures. Earlier architectures used mainframe computers to provide the main processing
for all functions of the system, including user application programs, user interface programs, as
well as all the DBMS functionality. As the prices of hardware declined, most users replaced
their terminals with personal computers (PCs) and workstations. At first, database systems used
these computers in the same way as they had used display terminals, so that the DBMS itself
was still a centralized DBMS where all the DBMS functionality, application program
execution, and user interface processing were carried out in one machine.
Client-Server Architecture
The client server architecture was developed to deal with computing environments where a
large number of PCs, workstations, file servers, printers, database servers, Web servers, and
other equipment are connected together via a network. The idea is to define specialized servers
with specific functionalities. For example, it is possible to connect a number of PCs or small
workstations as clients to a file server that maintains the files of the client machines. Another
machine could be designated as a printer server by being connected to various printers;
thereafter, all print requests by the clients are forwarded to this machine. Web servers or E-mail
A client in this framework is typically a user machine that provides user interface capabilities
and local processing. When a client requires access to additional functionality—such as
database access—that does not exist at that machine, it connects to a server that provides the
needed functionality. A server is a machine that can provide services to the client machines,
such as printing, archiving, or database access.
In such a client-server architecture, the user interface programs and application programs can
run at the client side. When DBMS access is required, the program establishes a connection to
the DBMS— which is on the server side—and once the connection is created, the client
program can communicate with the DBMS. A standard called Open Database Connectivity
(ODBC) provides an Application Programming Interface (API), which allows client-side
programs to call the DBMS, as long as both client and server machines have the necessary
software installed. Most DBMS vendors provide ODBC drivers for their systems.
The second approach to client-server was taken by some object-oriented DBMSs. Because
many of those systems were developed in the era of client-server architecture, the approach
taken was to divide the software modules of the DBMS between client and server in a more
integrated way. For example, the server level may include the part of the DBMS software
responsible for handling data storage on disk pages, local concurrency control and recovery,
buffering and caching of disk pages, and other such functions. Meanwhile, the client level may
handle the user interface, data dictionary functions, DBMS interaction with programming
In relational DBMSs it is common practice to store the catalog itself as relations and to use the
DBMS software for querying, updating, and maintaining the catalog. This allows DBMS
routines (as well as users) to access the information stored in the catalog—whenever they are
authorized to do so—using the query language of the DBMS, such as SQL.
To include information on secondary key attributes of a relation, we can simply extend the
preceding catalog if we assume that an attribute can be a member of one key only. In this case
we can replace the MEMBER_OF_PK attribute of REL_AND_ATTR_CATALOG with an
attribute KEY_NUMBER; the value of KEY_NUMBER is 0 if the attribute is not a member of
any key, 1 if it is a member of the primary key, and i > 1 for the secondary key, where the
secondary keys of a relation are numbered 2, 3, ..., n.
The definitions of views must also be stored in the catalog. A view is specified by a query, with
a possible renaming of the values appearing in the query result.
The various commercial database products adopt different conventions and terminology with
regard to their system catalog.
In ORACLE, the collection of metadata is called the data dictionary. The metadata is
information about schema objects, such as tables, indexes, views, triggers, and more. Access to
the data dictionary is allowed through numerous views, which are divided into three categories:
USER, ALL, and DBA. These terms are used as prefixes for the various views. The views that
have a prefix of USER contain schema information for objects owned by a given user. Those
with a prefix of ALL contain schema information for objects owned by a user as well as objects
that the user has been granted access to, and those with a prefix of DBA are for the database
administrator and contain information about all database objects.
SELECT *
FROM ALL_CATALOG
The result of this query,which indicates that three base tables are owned by SMITH:
ACCOUNT, CUSTOMERS, and ORDERS, plus a view CUSTORDER. The meaning of each
column in the result should be clear from its name.
1. DDL (and SDL) compilers: These DBMS modules process and check the specification of a
database schema in the data definition language (DDL) and store that description in the catalog.
Schema constructs and constraints at all levels—conceptual, internal, and external— are
extracted from the DDL and SDL (storage definition language) specifications and entered into
the catalog, as is any mapping information among levels, if necessary. Hence, these software
modules actually populate (load) the catalog’s minidatabase (or metadatabase) with data, the
data being the descriptions of database schemas.
2. Query and DML parser and verifier: These modules parse queries, DML retrieval
statements, and database update statements; they also check the catalog to verify whether all
the schema names referenced in these statements are valid. For example, in a relational system,
a query parser would check that all the relation names specified in the query exist in the catalog
and that the attributes specified belong to the appropriate relations and have the appropriate
type.
3. Query and DML compilers: These compilers convert high-level queries and DML
commands into low-level file access commands. The mapping between the conceptual schema
and the internal schema file structures is accessed from the catalog during this process. For
example, the catalog must include a description of each file and its fields and the
correspondences between fields and conceptual-level attributes.
4. Query and DML optimizer (Note 6): The query optimizer accesses the catalog for access
path, implementation information, and data statistics to determine the best way to execute a
query or DML command (see Chapter 18). For example, the optimizer accesses the catalog to
5. Authorization and security checking: The DBA has privileged commands to update the
authorization and security portion of the catalog. All access by a user to a relation is checked
by the DBMS for proper authorization by accessing the catalog.
A catalog is closely coupled with the DBMS software; it provides the information stored in it to
users and the DBA, but it is mainly accessed by the various software modules of the DBMS
itself, such as DDL and DML compilers, the query optimizer, the transaction processor, report
generators, and the constraint enforcer. On the other hand, the software package for a stand-
alone data dictionary or data repository may interact with the software modules of the DBMS,
but it is mainly used by the designers, users, and administrators of a computer system for
information resource management. These systems are used to maintain information on system
hardware and software configurations, documentation, applications, and users, as well as other
information relevant to system administration.
If a data dictionary system is used only by designers, users, and administrators, not by the
DBMS software, it is called a passive data dictionary; otherwise, it is called an active data
dictionary or data directory. Figure 17.11 illustrates the types of active data dictionary
interfaces. Data dictionaries are also used to document the database design process itself, by
storing documentation on the results of every design phase and the design decisions.
QUERY PROCESSING
Query Processing refers to the range of activities involved in extracting data from adatabase.
The activities include translation of queries in high-level database languages into expressions
that can be used at the physical level of the file system, a variety of query-optimizing
transformations, and actual evaluation of queries.
This query can be translated into either of the following relational-algebra expressions:
(or)
The cost of query evaluation can be measured in terms of a number of different resources,
including disk accesses, CPU time to execute a query and in a distributed or parallel database
system, the cost of communication.
The response time for a query-evaluation plan could be used as a good measure of
the cost of the plan. In large database systems, however, disk accesses are usually the most
important cost, since disk accesses are slow compared to in-memory operations. Most
people consider the disk accesses cost a reasonable measure of the cost of a query-
evaluation plan.
The number of block transfers from disk is also used as a measure of the actual cost.
It takes more time to write a block to disk than to read a block from disk. For more accurate
measure, find:
Optimizing Them
From ×
Where σ
Then optimizations transform the query tree into something more efficient based on
reordering and transforming the extended relational operations. After that the tree is
converted to a query execution plan that chooses the best algorithms to implement the
portions of the tree. πPnumber, Dnum, Lname, Address,
Bdate(σP.Dnum=D.Dnumber∧D.Mgr_ssn=E.Ssn∧
Find the last names of employees born after 1957 who work on a project named
‘Aquarius’.
Generally, perform σ and π (which tend to reduce the size of the intermediate tables) before
any &bowie; operations (which tend to multiply their sizes). .
σc₁(σc₂(…σcn(R)))
2 Commutativityσa(σb(R)) ≡ σb(σa(R)) of σ
with π σc(πa,b,…(R)) π
of ⋈
with ⋈, ⋈ (σc₂(S))
conjunction
version
dependes on attributes in L.
πL
with ⋈, extra [(πA₁,…,An,…,An+k(R)) same, but the join condition c depends on
⋈c(πB₁,…,Bm,…,Bm+p(
join attributes ))] additional attributes from R and S. These
unneeded.
of set operations
⋈, ×, ∪, and ∩ T)
General approach:
The DBMS attempts to form a good cost model of various query operations as applied to the
current database state, including the attribute value statistics (histogram), nature of indices,
Computation cost
Usually slower than storage, but sometimes not, the cpu cost of evaluating the query.
(time)
Communication cost
The cost to transport query data over a network between database nodes. (time)
“Typical” databases emphasize access cost, the usual limiting factor. In-memory databases
minimize computation cost, while distributed databases put increasing emphasis on
communication cost.
• Basic file data: number of records (r), record size (R), number of blocks (b).
• Primary file organization (heap file, ordered, ordered with index, hashed, overflow?)
On average half the blocks must be accessed for an equality condition on the key, all
the blocks otherwise.
x for the index search to get the cluster block, then ⌈s/bfr⌉ file blocks pointed to by the
cluster.
Tree height + 1 if its key, that plus the selectivity (s) if not.
The sum of the costs of the subconditions, if the set intersection fits in memory.
A join selectivity (js) is the ratio of the number of tuples produced by a join to the number of
tuples produced by a cross product of the underlying relations, or js = |(R ⋈c S)| / |(R×S)| = |(R
⋈c S)| / (|R|·|S|).
Unit – V
Transactions
A transaction is a program including a collection of database operations, executed as a logical
unit of data processing. The operations performed in a transaction include one or more of
database operations like insert, delete, update or retrieve data. It is an atomic process that is
either performed into completion entirely or is not performed at all. A transaction involving
only data retrieval without any data update is called read-only transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
• read_item() − reads data item from storage to main memory.
• modify_item() − change value of item in the main memory.
• write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.
Transaction Operations
Transaction States
A transaction may go through a subset of five states, active, partially committed, committed,
failed and aborted.
• Active − The initial state where the transaction enters is the active state. The transaction
remains in this state while it is executing read, write or other operations.
• Partially Committed − The transaction enters this state after the last statement of the
transaction has been executed.
• Committed − The transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
• Failed − The transaction goes from partially committed state or active state to failed
state when it is discovered that normal execution can no longer proceed or system
checks fail.
• Aborted − This is the state after the transaction has been rolled back after failure and the
database has been restored to its state that was before the transaction began.
Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation, and
Durability.
• Atomicity − This property states that a transaction is an atomic unit of processing, that
is, either it is performed in its entirety or not performed at all. No partial update should
exist.
• Consistency − A transaction should take the database from one consistent state to
another consistent state. It should not adversely affect any data item in the database.
• Isolation − A transaction should be executed as if it is the only one in the system. There
should not be any interference from the other concurrent transactions that are
simultaneously running.
Types of Schedules
There are two types of schedules −
• Serial Schedules − In a serial schedule, at any point of time, only one transaction is
active, i.e. there is no overlapping of transactions. This is depicted in the following
graph −
• Parallel Schedules − In parallel schedules, more than one transactions are active
simultaneously, i.e. the transactions contain operations that overlap at time. This is
depicted in the following graph −
Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active
transactions perform non-compatible operations. Two operations are said to be in conflict, when
all of the following three conditions exists simultaneously −
• The two operations are parts of different transactions.
• Both the operations access the same data item.
• At least one of the operations is a write_item() operation, i.e. it tries to modify the data
item.
Serializability
Locking-based concurrency control protocols use the concept of locking data items. A lock is a
variable associated with a data item that determines whether read/write operations can be
performed on that data item. Generally, a lock compatibility matrix is used which states whether
a data item can be locked by two transactions at the same time.
Locking-based concurrency control systems can use either one-phase or two-phase locking
protocols.
In systems with low conflict rates, the task of validating every transaction for serializability may
lower performance. In these cases, the test for serializability is postponed to just before commit.
Since the conflict rate is low, the probability of aborting transactions which are not serializable
is also low. This approach is called optimistic concurrency control technique.
In this approach, a transaction’s life cycle is divided into the following three phases −
• Execution Phase − A transaction fetches data items to memory and performs operations
upon them.
In this section, we will see how the above techniques are implemented in a distributed database
system.
• Deadlock prevention.
• Deadlock avoidance.
• Deadlock detection and removal.
All of the three approaches can be incorporated in both a centralized and a distributed database
system.
Deadlock Prevention
The deadlock prevention approach does not allow any transaction to acquire locks that will lead
to deadlocks. The convention is that when more than one transactions request for locking the
same data item, only one of them is granted the lock.
Deadlock Avoidance
The deadlock avoidance approach handles deadlocks before they occur. It analyzes the
transactions and the locks to determine whether or not waiting leads to a deadlock.
The method can be briefly stated as follows. Transactions start executing and request data items
that they need to lock. The lock manager checks whether the lock is available. If it is available,
the lock manager allocates the data item and the transaction acquires the lock. However, if the
item is locked by some other transaction in incompatible mode, the lock manager runs an
algorithm to test whether keeping the transaction in waiting state will cause a deadlock or not.
Accordingly, the algorithm decides whether the transaction can wait or one of the transactions
should be aborted.
There are two algorithms for this purpose, namely wait-die and wound-wait. Let us assume
that there are two transactions, T1 and T2, where T1 tries to lock a data item which is already
locked by T2. The algorithms are as follows −
• Wait-Die − If T1 is older than T2, T1 is allowed to wait. Otherwise, if T1 is younger
than T2, T1 is aborted and later restarted.
• Wound-Wait − If T1 is older than T2, T2 is aborted and later restarted. Otherwise, if T1
is younger than T2, T1 is allowed to wait.
Transaction processing in a distributed database system is also distributed, i.e. the same
transaction may be processing at more than one site. The two main deadlock handling concerns
in a distributed database system that are not present in a centralized system are transaction
location and transaction control. Once these concerns are addressed, deadlocks are handled
through any of deadlock prevention, deadlock avoidance or deadlock detection and removal.
Transaction Location
Transactions in a distributed database system are processed in multiple sites and use data items
in multiple sites. The amount of data processing is not uniformly distributed among these sites.
The time period of processing also varies. Thus the same transaction may be active at some sites
and inactive at others. When two conflicting transactions are located in a site, it may happen
that one of them is in inactive state. This condition does not arise in a centralized system. This
concern is called transaction location issue.
This concern may be addressed by Daisy Chain model. In this model, a transaction carries
certain details when it moves from one site to another. Some of the details are the list of tables
required, the list of sites required, the list of visited tables and sites, the list of tables and sites
that are yet to be visited and the list of acquired locks with types. After a transaction terminates
by either commit or abort, the information should be sent to all the concerned sites.
Transaction Control
Transaction control is concerned with designating and controlling the sites required for
processing a transaction in a distributed database system. There are many options regarding the
choice of where to process the transaction and how to designate the center of control, like −
Steal/No-Force (Undo/Redo)
Steal/Force (Undo/No-redo)
No-Steal/No-Force (Redo/No-undo)
No-Steal/Force (No-undo/No-redo)
Recovery Scheme
A transaction may run in a distributed fashion at multiple nodes. In this execution scenario the
transaction commits only when all these multiple nodes agree to commit individually the part of
the transaction they were executing.
This commit scheme is referred to as “two-phase commit” (2PC). If any one of these nodes fails
or cannot commit the part of the transaction, then the transaction is aborted.Each node recovers
the transaction under its own recovery protocol.
It can be classified into two categories: system security and data security.
System security includes the mechanisms that control the access and use of the database
at the system level. For example, system security includes:
Data security includes the mechanisms that control the access and use of the database at
the schema object level. For example, data security includes:
•Which users have access to a specific schema object and the specific types of actions
allowed for each user on the schema object (for example, user SCOTT can issue
SELECT and INSERT statements but not DELETE statements using the employees
table)
•The actions, if any, that are audited for each schema object
Security Mechanisms
•Authentication to validate the identity of the entities using your networks, databases,
and applications
•Authorization processes to limit access and actions, limits that are linked to user's
identities and roles.
•Security policies
•Database auditing
•Administer profiles.
•Obtain information about profiles, password management, and resources from the data
dictionary.
Profiles are named sets of specific resource limits that can be assigned to any valid
username or schema in an Oracle database. They provide a mechanism to more easily
manage resource limits and password policies. This chapter provides information on
using profiles to make your life as a DBA simpler.
Often it is necessary to carefully manage resource limits and user passwords. When this
is the caseand let's face it, you aren't likely to be told that every user on your system
should have free reign to use as much CPU and memory as she canyou can use profiles
as a means to manage your resources from within the database. To use profiles, you first
need to categorize related types of users in a database. Then determine how many
School Computing Page 123
profiles are needed at this point (you can always change it later) to be able to manage
and provide equitable resources for all your different types of users. Then you must
determine appropriate resource limits for each profile. Now let's get into some more
detail.
Profiles are named sets of passwords and resource limits. They are assigned to
individual users using the CREATE USER or ALTER USER command and can be
enabled or disabled. Everydatabase comes with at least one profile, the DEFAULT
profile. With a profile, you can do the following:
After a profile is created, the DBA assigns users to that profile. If resource limits are
used, Oracle limits the database usage and resources to the defined profile.
DEFAULT Profile
When the database is created, Oracle automatically creates the DEFAULT profile. Users
who have not been explicitly assigned to another profile will be assigned to DEFAULT.
When DEFAULT is created, it has all its resource limits set to UNLIMITED. You, as
the DBA, can change these values so that the limits in your environment reflect the
default settings that you want to have applied to your average user.
Profile Usage
Profiles restrict sets of users from performing operations that require heavy use of
resources. They ensure either that users log off the database when they are finished or
that their session is logged off after it has been idle for a specified amount of time. You
can group people and resources together based on similarities in need, a feature that is
School Computing Page 124
particularly useful when managing large complex databases with many different kinds
of users.
•Assigning a profile to a user does not affect the currently logged-in session.
•If you don't assign another profile to a user when that user is created, the DEFAULT
profile is automatically assigned.
•Any parameters not set within a profile will take on the values from the DEFAULT
profile.
Passwords
Often, it is easier to exercise greater control over database security and password
management by using profiles. You can identify similar characteristics that a given set
of users has and determine password maintenance characteristics similar to those users.
Account Locking
Account locking enables the automatic locking of an account whenever a user fails to
log in to the database after a specific number of attempts. Although it is true that users
often forget passwords and need to be able to try a couple of alternatives before they get
it right, it is also a fact that people trying to gain illicit access to a database may have a
good guess at a userid but not know the password. An unlimited number of attempts
may not be a wise security decision. Limiting a user to three or five invalid attempts
limits your security exposure and provides users with a little guessing room for
passwords they may have forgotten.
Two parameters are available to help with maintaining password aging and expiration.
PASSWORD_LIFE_TIME sets the maximum lifetime after which the password must
bechanged (there is no minimum time that must pass before the user must change the
password). You, as the DBA, can specify a grace period (specified by the
PASSWORD_GRACE_TIME parameter). As soon as the PASSWORD_LIFE_TIME
time has passed, at the next user login, PASSWORD_GRACE_TIME goes into effect. A
warning message is generated every time the user tries to log in until the grace period
has passed. The user is expected to change the password within the grace period. If the
password is not changed within the grace period, the account is locked.
Password History
Password history allows you to preset an amount of time that must pass before a
password can be reused. This can be set either as a specific amount of time, for example
90 days, or as a specific number of password changes, three or five or any number that
you set.
Oracle provides two parameters that assist with password history maintenance.
PASSWORD_REUSE_TIME specifies the given number of days that must pass before
a user can reuse a previous password. PASSWORD_REUSE_MAX forces a user to
arrive at a password that is not identical to earlier passwords.
PASSWORD_REUSE_MAX specifies the number of password changes required before
the current password can be reused. If you setPASSWORD_REUSE_MAX to an integer
value, you have to set PASSWORD_REUSE_TIME to UNLIMITED.
Complexity verification ensures that a password fits with rules that you set up ahead of
time that deal with how many characters must be in a password and the different
configurations that can be used for passwords. (For example, a password must have at
least one number and at least one non- number character, must not be the same or
similar to your userid, must have at most one set of duplicated letters or numbers within
the password, and so forth.)
When users change their passwords, before the password gets assigned, a PL/SQL
function can be invoked to verify the validity of the password. Oracle provides a default
verification routine, but the DBA can create an alternative PL/SQL function that is a
customized way of checking the validity. The Oracle provided parameter,
PASSWORD_VERIFY_FUNCTION, can be set to the custom PL/SQL function to
enable the use of the custom program unit for password verification.
•Password should have at least one alphabetic, one numeric, and one special character.
Password should differ from immediately previous password by at least three letters.
To use the VERIFY_FUNCTION, you need to have run the utlpwdmg.sql script as the
SYS user. This script changes the DEFAULT profile and sets the following defaults:
•PASSWORD_LIFE_TIME = 60
•PASSWORD_GRACE_TIME = 10
•PASSWORD_REUSE_TIME = 1800
•PASSWORD_REUSE_MAX = UNLIMITED
•FAILED_LOGIN_ATTEMPTS = 3
•PASSWORD_LOCK_TIME = 1/1440
•PASSWORD_VERIFY_FUNCTION = verify_function
Whether or not password management is enabled in a profile, the user account can be
locked or unlocked with the CREATE USER or ALTER USER command. This is the
same method used to assign profiles to users, and password limits are always enforced
reliably across users.
• Mobile Databases
• Multimedia Databases
• Geographic Information Systems
• GENOME Data Management
2. Multimedia Databases
Our houses will be wired for bandwidth to handle interactive multimedia applications.
Our high-definition TV/computer workstations will have access to a large
number of databases, including digital libraries, image and video databases that
will distribute vast amounts of multisource multimedia content.
Types of multimedia data are available in current systems
Knowledge dissemination
Categorization of GIS:
• Cartographic applications
• Digital terrain modelling applications
• Geographic objects applications
Data Warehousing
Introduction
A data warehouses is kept separate from operational databases due to the following reasons:
• Financial services
• Banking services
• Consumer goods
• Retail sectors
• Controlled manufacturing
Types of Data Warehouse
Information processing, analytical processing, and data mining are the three types of data
warehouse applications that are discussed below:
• Information Processing - A data warehouse allows to process the data stored in it. The
data can be processed by means of querying, basic statistical analysis, reporting using
crosstabs, tables, charts, or graphs.
• Analytical Processing - A data warehouse supports analytical processing of the
information stored in it. The data can be analyzed by means of basic OLAP operations,
including slice-and-dice, drill down, drill up, and pivoting.
• Data Mining - Data mining supports knowledge discovery by finding hidden patterns
and associations, constructing analytical models, performing classification and
prediction. These mining results can be presented using the visualization tools.
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
School Computing Page 137
knowledge extracted so can be used for any of the following applications −
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid.
Listed below are the various fields of market where data mining is used −
Customer Profiling − Data mining helps determine what kind of people buy what
kind ofproducts.
Identifying Customer Requirements − Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may attract
new customers.
Cross Market Analysis − Data mining performs association/correlations between
productsales.
Target Marketing − Data mining helps to find clusters of model customers who share
the same characteristics such as interests, spending habits, income,etc.
Determining Customer purchasing pattern − Data mining helps in determining
customer purchasingpattern.
Providing Summary Information − Data mining provides us various
School Computing Page 138
multidimensional summaryreports.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call,
time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.
• Retail Industry
• Telecommunication Industry
• Intrusion Detection
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
• Design and construction of data warehouses for multidimensional data analysis and data
mining.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It
is natural that the quantity of data collected will continue to expand rapidly because of the
increasing ease, availability and popularity of the web.
• Design and Construction of data warehouses based on the benefits of data mining.
• Customer Retention.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies,
the telecommunication industry is rapidly expanding. This is the reason why data mining is
become very important to help and understand the business.
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics. Following are the aspects in which data mining contributes for
biological data analysis −
• Discovery of structural patterns and analysis of genetic networks and protein pathways.
The applications discussed above tend to handle relatively small and homogeneous data sets for
which the statistical techniques are appropriate. Huge amount of data have been collected from
scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being
generated because of the fast numerical simulations in various fields such as climate and
ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are the applications
of data mining in the field of Scientific Applications −
• Graph-based mining.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability
of network resources. In this world of connectivity, security has become the major issue. With
increased usage of internet and availability of the tools and tricks for intruding and attacking
• Association and correlation analysis, aggregation to help select and build discriminating
attributes.
Data mining concepts are still evolving and here are the latest trends that we get to see in this
field −
• Application Exploration.
• Integration of data mining with database systems, data warehouse systems and web
database systems.
• Web mining.