Introduction To DBMS July 2018
Introduction To DBMS July 2018
DBMS stands for Database Management System. We can break it like this DBMS
= Database + Management System. Database is a collection of data and
Management System is a set of programs to store and retrieve those data. Based on
this we can define DBMS like this: DBMS is a collection of inter-related data and
set of programs to store & access those data in an easy and effective manner.
What is the need of DBMS?
Database systems are basically developed for large amount of data. When dealing
with huge amount of data, there are two things that require optimization: Storage
of data and retrieval of data.
Storage: According to the principles of database systems, the data is stored in such
a way that it acquires lot less space as the redundant data (duplicate data) has been
removed before storage. Let’s take a layman example to understand this:
In a banking system, suppose a customer is having two accounts, one is saving
account and another is salary account. Let’s say bank stores saving account data at
one place (these places are called tables we will learn them later) and salary
account data at another place, in that case if the customer information such as
customer name, address etc. are stored at both places then this is just a wastage of
storage (redundancy/ duplication of data), to organize the data in a better way the
information should be stored at one place and both the accounts should be linked to
that information somehow. The same thing we achieve in DBMS.
Fast Retrieval of data: Along with storing the data in an optimized and systematic
manner, it is also important that we retrieve the data quickly when needed.
Database systems ensure that the data is retrieved as quickly as possible.
Database Applications – DBMS
Applications where we use Database Management Systems are:
• Telecom: There is a database to keeps track of the information regarding
calls made, network usage, customer details etc. Without the database
systems it is hard to maintain that huge amount of data that keeps updating
every millisecond.
• Industry: Where it is a manufacturing unit, warehouse or distribution
centre, each one needs a database to keep the records of ins and outs. For
example distribution centre should keep a track of the product units that
supplied into the centre as well as the products that got delivered out from
the distribution centre on each day; this is where DBMS comes into picture.
• Banking System: For storing customer info, tracking day to day credit and
debit transactions, generating bank statements etc. All this work has been
done with the help of Database management systems.
• Education sector: Database systems are frequently used in schools and
colleges to store and retrieve the data regarding student details, staff details,
course details, exam details, payroll data, attendance details, fees details etc.
There is a hell lot amount of inter-related data that needs to be stored and
retrieved in an efficient manner.
• Online shopping: You must be aware of the online shopping websites such
as Amazon, Flipkart etc. These sites store the product information, your
addresses and preferences, credit details and provide you the relevant list of
products based on your query. All this involves a Database management
system.
I have mentioned very few applications, this list is never going to end if we start
mentioning all the DBMS applications.
Advantages of DBMS over file system
Drawbacks of File system:
• Data Isolation: Because data are scattered in various files, and files may be
in different formats, writing new application programs to retrieve the
appropriate data is difficult.
• Duplication of data – Redundant data
• Dependency on application programs – Changing files would lead to change
in application programs.
Definition of schema:
The design of a database at physical level is called physical schema, how the data
stored in blocks of storage is described at this level.
keys in DBMS
Key plays an important role in relational database; it is used for identifying unique
rows from table. It also establishes relationship among tables.
Super Key – A super key is a set of one of more columns (attributes) to uniquely
identify rows in a table.
Alternate Key – Out of all candidate keys, only one gets selected as primarykey,
remaining keys are known as alternate or secondary keys.
Composite Key – A key that consists of more than one attribute to uniquely
identify rows (also known as records & tuples) in a table is called composite key.
Foreign Key – Foreign keys are the columns of a table that points to the primary
key of another table. They act as a cross-reference between tables.
Example:
Student Table
Stu _Id Stu _Name Stu _Age
101 Steve 23
102 John 24
103 Robert 28
104 Carl 22
In the above Student table, the Stu _Id column uniquely identifies each row of the
table.
Super keys:
• {Emp _SSN}
• {Emp _Number}
• {Emp _SSN, Emp _Number}
• {Emp _SSN, Emp _Name}
• {Emp _SSN, Emp _Number, Emp _Name}
• {Emp _Number, Emp _Name}
All of the above sets are able to uniquely identify rows of the employee table.
Candidate Keys:
As I stated above, they are the minimal super keys with no redundant attributes.
• {Emp _SSN}
• {Emp _Number}
Only these two sets are candidate keys as all other sets are having redundant
attributes that are not necessary for unique identification.
I have been getting lot of comments regarding the confusion between super key
and candidate key. Let me give you a clear explanation.
1. First you have to understand that all the candidate keys are super keys. This is
because the candidate keys are chosen out of the super keys.
2. How we choose candidate keys from the set of super keys? We look for those
keys from which we cannot remove any fields. In the above example, we have not
chosen {Emp_SSN, Emp_Name} as candidate key because {Emp_SSN} alone can
identify a unique row in the table and Emp_Name is redundant.
Primary key:
Primary key is being selected from the sets of candidate keys by database designer.
So Either {Emp _SSN} or {Emp _Number} can be the primary key.
Definition: Foreign keys are the columns of a table that points to the primary key
of another table. They act as a cross-reference between tables.
For example:
In the below example the Stu _Id column in Course _enrollment table is a foreign
key as it points to the primary key of the Student table.
Student table:
A super key with no redundant attribute is known as candidate key. Candidate keys
are selected from the set of super keys, the only thing we take care while selecting
candidate key is: It should not have any redundant attributes. That’s the reason
they are also termed as minimal super key.
For example:
Note: A primary key is being selected from the group of candidate keys. That
means we can either have Emp _Id or Emp _Number as primary key.
A key that consists of more than one attribute to uniquely identify rows (also
known as records & tuples) in a table is called composite key. It is also known as
compound key.
Out of all candidate keys, only one gets selected as primary key, remaining keys
are known as alternative or secondary keys.
Since we have selected Emp _Id as primary key, the remaining key Emp _Number
would be called alternative or secondary key.
Anomalies in DBMS
There are three types of anomalies that occur when the database is not normalized.
These are – Insertion, update and deletion anomaly. Let’s take an example to
understand this.
The above table is not normalized. We will see the problems that we face when a
table is not normalized.
Update anomaly: In the above table we have two rows for employee Rick as he
belongs to two departments of the company. If we want to update the address of
Rick then we have to update the same in two rows or the data will become
inconsistent. If somehow, the correct address gets updated in one department but
not in other then as per the database, Rick would be having two different
addresses, which is not correct and would lead to inconsistent data.
Insert anomaly: Suppose a new employee joins the company, who is under
training and currently not assigned to any department then we would not be able to
insert the data into the table if emp_dept field doesn’t allow nulls.
Delete anomaly: Suppose, if at a point of time the company closes the department
D890 then deleting the rows that are having emp_dept as D890 would also delete
the information of employee Maggie since she is assigned only to this department.
To overcome these anomalies we need to normalize the data. In the next section we
will discuss about normalization.
Normalization
Here are the most commonly used normal forms:
As per the rule of first normal form, an attribute (column) of a table cannot hold
multiple values. It should hold only atomic values.
Example: Suppose a company wants to store the names and contact details of its
employees. It creates a table that looks like this:
Two employees (Jon & Lester) are having two mobile numbers so the company
stored them in the same field as you can see in the table above.
This table is not in 1NF as the rule says “each attribute of a table must have atomic
(single) values”, the emp_mobile values for employees Jon & Lester violates that
rule.
To make the table complies with 1NF we should have the data like this:
An attribute that is not part of any candidate key is known as non-prime attribute.
Example: Suppose a school wants to store the data of teachers and the subjects
they teach. They create a table that looks like this: Since a teacher can teach more
than one subjects, the table can have multiple rows for a same teacher.
The table is in 1 NF because each attribute has atomic values. However, it is not in
2NF because non- prime attribute teacher_age is dependent on teacher_id alone
which is a proper subset of candidate key. This violates the rule for 2NF as the rule
says “no non-prime attribute is dependent on the proper subset of any candidate
key of the table”.
To make the table complies with 2NF e can break it in two tables like this:
teacher_details table:
teacher_id teacher_age
111 38
222 38
333 40
teacher_subject table:
teacher_id subject
111 Maths
111 Physics
222 Biology
333 Physics
333 Chemistry
An attribute that is not part of any candidate key is known as non-prime attribute.
In other words 3NF can be explained like this: A table is in 3NF if it is in 2NF and
for each functional dependency X-> Y at least one of the following conditions
hold:
An attribute that is a part of one of the candidate keys is known as prime attribute.
Super keys: {emp _id}, {emp _id, emp _name}, {emp _id, emp _name, emp
_zip}…so on
Candidate Keys: {emp _id}
Non-prime attributes: all attributes except emp _id are non-prime as they are not
part of any candidate keys.
Here, emp _state, emp _city & emp _district dependent on emp _zip. And, emp
_zip is dependent on emp _id that makes non-prime attributes (emp _state, emp
_city & emp _district) transitively dependent on super key (emp _id). This violates
the rule of 3NF.
To make this table complies with 3NF we have to break the table into two tables to
remove the transitive dependency:
employee table:
employee_zip table:
Example: Suppose there is a company wherein employees work in more than one
department. They store the data like this:
The table is not in BCNF as neither emp _id nor emp _dept alone are keys.
To make the table comply with BCNF we can break the table in three tables like
this:
emp _nationality table:
Functional dependencies:
emp _id -> emp _nationality
emp _dept -> {dept _type, dept _no_ of _ emp}
Candidate keys:
For first table: emp _id
For second table: emp _dept
For third table: {emp _id, emp _ dept}
This is now in BCNF as in both the functional dependencies left side part is a key.
1. E-R Model
2. Object oriented Model
Record based logical Models – Like Object based model, they also describe data at
the conceptual and view levels. These models specify logical structure of database
with records, fields and attributes.
1. Relational Model
2. Hierarchical Model
3. Network Model – Network Model is same as hierarchical model except that
it has graph-like structure rather than a tree-based structure. Unlike
hierarchical model, this model allows each record to have more than one
parent record.
Physical Data Models – These models describe data at the lowest level of
abstraction.
Here are the geometric shapes and their meaning in an E-R Diagram –
Ellipses: Attributes
Lines: They link attributes to Entity Sets and Entity sets to Relationship Set
Derived Attribute: A derived attribute is one whose value is dynamic and derived
from another attribute. It is represented by dashed ellipses in an E-R Diagram. E.g.
Person age is a derived attribute as it changes over time and can be derived from
another attribute (Date of birth).
Relational
model in DBMS
Sample relationship Model: Student table with 3 columns and four records.
In hierarchical model, data is organized into a tree like structure with each record
is having one parent record and many children. The main drawback of this model
is that, it can have only one to many relationships between nodes.
Course Table:
DBMS languages
Database languages are used for read, update and store data in a database. There
are several such languages that can be used for this purpose; one of them is SQL
(Structured Query Language).
All these commands specify or update the database schema that’s why they come
under Data Definition language.
Data Control language (DCL): DCL is used for granting and revoking user
access on a database –
In practical data definition language, data manipulation language and data control
languages are not separate language; rather they are the parts of a single database
language such as SQL.
For example: Suppose we have a student table with attributes: Stu _Id, Stu _Name,
Stu _Age. Here Stu _Id attribute uniquely identifies the Stu _Name attribute of
student table because if we know the student id we can tell the student name
associated with it. This is known as functional dependency and can be written as
Stu _Id->Stu _Name or in words we can say Stu _Name is functionally dependent
on Stu _Id.
Formally:
If column A of a table uniquely identifies the column B of same table then it can
represented as A->B (Attribute B is functionally dependent on attribute A)
For example: Consider a table with two columns Student _id and Student _Name.
{Student _Id, Student _Name} -> Student _Id is a trivial functional dependency as
Student _Id is a subset of {Student _Id, Student _Name}. That makes sense
because if we know the values of Student _Id and Student _Name then the value of
Student _Id can be uniquely determined.
Also, Student _Id -> Student _Id & Student _Name -> Student _Name are trivial
dependencies too.
RDBMS Concepts
Table:
A table is a collection of data represented in rows and columns. For e.g. following
table stores the information of students.
Records / Tuple:
Each row of a table is known as record or it is also known as tuple. For e.g. The
below row is a record.
Field:
The above table has four fields: Student _Id, Student _Name, Student _ Addr &
Student _Age.
Column / Attribute:
Each attribute and its values are known as attributes in a database. For e.g. Set of
values of Student _Id field is one of the four columns of the Student table.
Student _Id
101
102
103
104
Cardinality in DBMS
In DBMS you may hear cardinality term at two different places and it has two
different meanings as well.
One to One – A single row of table 1 associates with single row of table 2
One to Many – A single row of table 1 associates with more than one rows of table
2
Many to One – Many rows of table 1 associate with a single row of table 2
Many to Many – Many rows of table 1 associate with many rows of table 2
This whole set of operations can be called a transaction. Although I have shown
you read, write and update operations in the above example but the transaction can
have operations like read, write, insert, update, delete.
Now that we understand what is transaction, we should understand what are the
problems associated with it.
The main problem that can happen during a transaction is that the transaction can
fail before finishing the all the operations in the set. This can happen due to power
failure, system crash etc. This is a serious problem that can leave database in an
inconsistent state. Assume that transaction fail after third operation (see the
example above) then the amount would be deducted from your account but your
friend will not receive it.
Rollback: If any of the operation fails then rollback all the changes done by
previous operations.
Even though these operations can help us avoiding several issues that may arise
during transaction but they are not sufficient when two transactions are running
concurrently. To handle those problems we need to understand database ACID
properties.
ACID properties in DBMS
Constraints in DBMS
Constraints enforce limits to the data or type of data that can be
inserted/updated/deleted from a table. The whole purpose of constraints is to
maintain the data integrity during an update/delete/insert into a table. In this
tutorial we will learn several types of constraints that can be created in RDBMS.
Types of constraints
• NOT NULL
• UNIQUE
• DEFAULT
• CHECK
• Key Constraints – PRIMARY KEY, FOREIGN KEY
• Domain constraints
• Mapping constraints
NOT NULL:
NOT NULL constraint makes sure that a column does not hold NULL value.
When we don’t provide value for a particular column while inserting a record into
a table, it takes NULL value by default. By specifying NULL constraint, we can be
sure that a particular column(s) cannot have NULL values.
Example:
Syntax:
Example:
Here we are setting up the UNIQUE Constraint for two columns: STU_NAME &
STU_ADDRESS. which means these two columns cannot have duplicate values.
Note: STU_NAME column has two constraints (NOT NULL and UNIQUE both)
setup.
MySQL:
Syntax:
Syntax:
Example:
Syntax:
Example:
IN MySQL:
syntax:
Example:
Syntax:
Example:
UNIQUE:
DEFAULT:
CHECK:
This constraint is used for specifying range of values for a particular column of a
table. When this constraint is being set on a column, it ensures that the specified
column must have the value falling in the specified range.
In the above example we have set the check constraint on ROLL_NO column of
STUDENT table. Now, the ROLL_NO field must have the value greater than
1000.
Key constraints:
PRIMARY KEY:
Primary key uniquely identifies each record in a table. It must have unique values
and cannot contain nulls. In the below example the ROLL_NO field is marked as
primary key, that means the ROLL_NO field cannot have duplicate and null
values.
FOREIGN KEY:
Foreign keys are the columns of a table that points to the primary key of another
table. They act as a cross-reference between tables.
Domain constraints:
Each table has certain set of columns and each column allows a same type of data,
based on its data type. The column does not accept values of any other data type.
Domain constraints are user defined data type and we can define them like this:
Mapping constraints:
Mapping Cardinality:
One to One: An entity of entity-set A can be associated with at most one entity of
entity-set B and an entity in entity-set B can be associated with at most one entity
of entity-set A.
Many to One: An entity of entity-set A can be associated with at most one entity
of entity-set B and an entity in entity-set B can be associated with any number of
entities of entity-set A.
Example:
Assuming, that a customer orders more than once, the above relation represents
one to many relation. Similarly we can achieve other mapping constraints based
on the requirements.
When more than one transactions are running simultaneously there are chances of
a conflict to occur which can leave database to an inconsistent state. To handle
these conflicts we need concurrency control in DBMS, which allows transactions
to run simultaneously but handles them in such a way so that the integrity of data
remains intact.
Conflict Example
You and your brother have a joint bank account, from which you both can
withdraw money. Now let’s say you both go to different branches of the same bank
at the same time and try to withdraw 5000 INR, your joint account has only 6000
balance. Now if we don’t have concurrency control in place you both can get 5000
INR at the same time but once both the transactions finish the account balance
would be -4000 which is not possible and leaves the database in inconsistent state.
We need something that controls the transactions in such a way that allows the
transaction to run concurrently but maintaining the consistency of data to avoid
such issues.
A lock is kind of a mechanism that ensures that the integrity of data is maintained.
There are two types of a lock that can be placed while accessing the data so that the
concurrent transaction can not alter the data while we are processing it.
1. Shared Lock(S)
2. Exclusive Lock(X)
1. Shared Lock(S): Shared lock is placed when we are reading the data, multiple
shared locks can be placed on the data but when a shared lock is placed no
exclusive lock can be placed.
For example, when two transactions are reading Steve’s account balance, let them
read by placing shared lock but at the same time if another transaction wants to
update the Steve’s account balance by placing Exclusive lock, do not allow it until
reading is finished.
2. Exclusive Lock(X): Exclusive lock is placed when we want to read and write
the data. This lock allows both the read and write operation, Once this lock is
placed on the data no other lock (shared or Exclusive) can be placed on the data
until Exclusive lock is released.
For example, when a transaction wants to update the Steve’s account balance, let it
do by placing X lock on it but if a second transaction wants to read the data(S lock)
don’t allow it, if another transaction wants to write the data(X lock) don’t allow
that either.
__________________________
| | S | X |
|-------------------------
| S | True | False |
|-------------------------
| X | False | False |
--------------------------
Deadlock in DBMS
A deadlock is a condition wherein two or more tasks are waiting for each other in
order to be finished but none of the task is willing to give up the resources that
other task needs. In this situation no task ever gets finished and is in waiting state
forever.
Coffman conditions
Coffman stated four conditions for a deadlock occurrence. A deadlock may occur
if all the following conditions holds true.
Deadlock Handling
Did that made you laugh? You may be wondering how ignoring a deadlock can
come under deadlock handling. But to let you know that the windows you are
using on your PC, uses this approach of deadlock handling and that is reason
sometimes it hangs up and you have to reboot it to get it working. Not only
Windows but UNIX also uses this approach.
The question is why? Why instead of dealing with a deadlock they ignore it
and why this is being called as Ostrich algorithm?
Well! Let me answer the second question first, This is known as Ostrich algorithm
because in this approach we ignore the deadlock and pretends that it would never
occur, just like Ostrich behavior “to stick one’s head in the sand and pretend there
is no problem.”
Let’s discuss why we ignore it: When it is believed that deadlocks are very rare
and cost of deadlock handling is higher, in that case ignoring is better solution than
handling it. For example: Let’s take the operating system example – If the time
requires handling the deadlock is higher than the time requires rebooting the
windows then rebooting would be a preferred choice considering that deadlocks
are very rare in windows.
Deadlock detection
Resource scheduler is one that keeps the track of resources allocated to and
requested by processes. Thus, if there is a deadlock it is known to the resource
scheduler. This is how a deadlock is detected.
Deadlock prevention
We have learnt that if all the four Coffman conditions hold true then a deadlock
occurs so preventing one or more of them could prevent the deadlock.
Deadlock Avoidance
Deadlock can be avoided if resources are allocated in such a way that it avoids the
deadlock occurrence. There are two algorithms for deadlock avoidance.
• Wait/Die
• Wound/Wait
Here is the table representation of resource allocation for each algorithm. Both of
these algorithms take process age into consideration while determining the best
possible way of resource allocation for deadlock avoidance.
Wait/Die Wound/Wait
Older process needs a resource held by Older process Younger process
younger process waits dies
Younger process needs a resource held Younger process Younger process
by older process dies waits