Unit 1 (DBMS) Merged
Unit 1 (DBMS) Merged
Introduction:
Data is a raw and unprocessed form of facts and figures. It is usually meaningless.
For example,3, Deepti,90, 88,70 is considered data. However, when you interpret this
data as Deepti has obtained 90, 88, and 70 marks in three subjects; then this data
becomes meaningful.
Thus, information is the data, which has been processed and has meaning.
It can be easily interpreted. So, It is essential to organize this available data, in
such a way that it can be altered, stored, and retrieved easily. This can be achieved
with the help of a database.
Database:
Database is a collection of data, which is stored in a systematic manner,
i.e., in the form of tables so that you can retrieve and manipulate the
information quickly and efficiently from a large set of data.
Field
1 Ravi 56 77 95
2 Riya 87 91 79
3 Deepti 90 88 70
Record
Data
UNIT 1
A database is basically used developer, administrators and the Users.
DBMS
Developer End-Users
Database
Administrators
There are many commands in DBMS, with the help of which the
user can easily create the database, manage the database and
delete the database.
DBMS Through the user can do the following: -
1. The user can create, delete and manage the database.
2. User can store the data in the database, modify the data and
delete the data.
3. The user can access the data present in the database as per his
need.
Advantages of DBMS:
1. Controlling data redundancy – It can be control data redundancy
because it stores all the data in one single database file and that
recorded data is placed in the database.
2. Data Sharing – In RDBMS, The authorized user of an organization
can share the data among multiple users.
3. Backup and recovery - It provide backup and recovery which create
automatic backup of data from hardware and software failure and
restores the data, if required.
4. Multiple user interface - It provide multiple user different type of user
interface like GUI, application program interface.
5. Data Sharing - Date can be accessed only by authorized users of
the organization.
6. Easily maintenance - It can be easily maintainable due to centralized
nature of the database system.
Disadvantages of DBMS:
Components of DBMS:
A database management system is made up of the following components: -
Hardware
Software
Data
Users
Procedures
Hardware
Hardware includes our computer system which is used to store and access our
database. Under this come physical electronic devices such as – computer, I/O
channels, storage devices etc. Mostly hard disk is used to store data in
computer system.
Software
It is the most important component of DBMS. It is a group of programs that is
used to control and manage the entire database.
DBMS software is located between the database and the users. It provides a
very easy interface to store, update, and retrieve data in the database.
Data
Data is one of the main important components of DBMS.
Users
There are many users in it who access the data according to their needs. The
capability and need of each user are different. In this the users are as follows: -
2. Database administrator
3. Database designers
4. End-users
5. Application programmers
Procedures
Procedures come with rules and instructions for using the database
management system. Procedures tell how to use the database in the
system. Such as: - Installing and setting up the DBMS, logging in to the
database, logging out, backing up the database, and handling the database,
etc.
UNIT 1
File Oriented System: -
The manual paper-based data store system has many disadvantages, to overcome
for the disadvantages of paper-based system. The computer-based system is
developed.
File oriented system are the first attempt of the computerized based system. File
system is a method of organizing the files with a hard disk or other
medium of storage. File system arranges the files and helps in retrieving
the files, when required. It is compatible with different file types, such
as mp3, doc, txt, mp4, etc. and these are also grouped into directories.
It also influences the method of writing and reading data to the hard
disk.
“This System stores the group of directories into separate file and each
files operate independently from other files in the storages”. As shown
here take the Example of Bank has a personal department, Account
department, Loan department, each department has different file to store
the data.
It is nothing but the collection of programs which manage & store data in
file and folder in a computer hard disk. It helps in reading and writing data
to the hard disk.
UNIT 1
Advantages of file oriented systems:
Types of Dbms
models:-
Hierarchical Model
Relational Model
Network Model
Object-Oriented
Model E – R Model
1) Hierarchical
Model
The hierarchical data model is one of the oldest models, developed in the 1950s by
IBM. There is a parent-child relationship, each entity only one parent and many
children, there is only oneentity which is call root.
In this model, data is organizing in a tree-
like structure.
Advantages –
There is a parent-child relationship in it due to which its concept
are simple. It provides database security.
one to many
relationships.
Disadvantages –
It is not flexible.
Changes in its structure have to be made in all the programs.
UNIT 1
Advantages:
1. More flexible than hierarchical model.
2. Direct approach is allowed.
3. 1 to 1, 1 to many relationships is allowed.
Disadvantages:
1. It’s more complex.
2. Slow and difficult to maintain and operate.
3. Diagram structures are very complex and massive.
4. If we update even a single node, the whole dependency changed.
UNIT 1
3) Relational DBMS
In this tables are used to store data. Tables show relation between data
contained. That’s why these are called table or relation. Columns are used to
store fields or attributes, and rows are used to store data of the Entity. (Entity
are real world objects for computers like, mobile, human, vehicles are Entity
for computer.)
In this we can splits large data into smaller and simpler tables and connect
them all with the help of unique key concept.
Each block of the table is called Tuple or record.
SQL (Structured Query Language) is used in this DBMS.
Advantages:
1. Easy to create, maintain, operate.
2. Very flexible.
3. Multiple tables allowed, which improves data simplification.
4. Single tuple operation is allowed, means the changed will effect only
the selected tuple, remaining table will be remained same.
5. Unique key concept like Primary key, foreign key makes operations
easier and safe.
Disadvantages:
1. Software used to create these are very expensive, and this kind of
software requires expensive hardware as well.
2. Highly skilled and creative man power is required.
UNIT 1
4) Object Oriented Database Model
Advantages:
1. Create simplicity, easy access for users.
2. Improves security.
3. Provide durability or long life to the product.
4. Easily holds very large group of data.
Disadvantages:
1. No proper standards to create uniform development.
2. Difficult to create and maintain.
UNIT 1
UNIT 1
UNIT 1
UNIT 1
UNIT 1
UNIT 1
UNIT 1
DBMS Architecture:
Database Languages
Database languages are used to read, store and update the data in the
database. Specific languages are used to perform various operations of the
database.
DML is used to manipulate the data present in the table or database. We can
easily perform operations such as store, modify, update, and delete on the
database.
Select: It shows the record of the specific table. Also, it can be used with a
WHERE clause to get the particular record.
Delete: It is used to delete records from the database tables. Also, it can be
used with a WHERE clause to delete a particular row from the table.
DCL works to deal with SQL commands that are used to permit a user to
access, modify and work on a database. it is used to access stored data. It
gives access, revokes access, and changes the permission to the owner of
the database as per the requirement.
Revoke: It is used to revoke the access from the user that is being granted
by the grant command.
UNIT 1
TCL (Transaction Control Language)
It can be grouped into a logical transaction and is used to run the changes
made by the DML command in the database.
Rollback: The database gets restored to the original since the last commit.
DBMS Interface
Database Interface is a user interface which provides the user with easy
access to the database.
With the help of database interface, the user can easily insert, delete and
update data in the database.
The earliest and most fundamental form of DBMS interface is the command-
line interface (CLI). Through the use of a command prompt, users may
communicate with the database by inputting commands. Users can only
access and modify the database using particular commands and syntax.
Today, programmers and database administrators are the main users of CLI.
The most prevalent kind of DBMS interface used nowadays is the graphical
user interface (GUI). Users may interact with the database using the user-
friendly interface's graphical menus and icons. GUIs are user-friendly and
need little to no programming experience. They are mostly utilized by non-
programmers who are end users.
Web-Based Interface
Users can access the database using a browser using the web-based
interface. Users can use online forms, buttons, and other graphical
UNIT 1
components to interact with the database using a web browser. Because
they can be accessed from any location with an internet connection, web-
based interfaces are common.
This interface enables users to communicate with the database in their own
language. The system will convert user inquiries into SQL instructions and
carry them out if they are entered in plain English. Users who are unfamiliar
with SQL commands might benefit from using natural language interfaces.
Forms-Based Interface
Using the forms-based interface, users may enter data into the database
using a graphical user interface. Users can complete pre-made forms that
provide fields for particular categories of information, such as name,
address, and phone number. Applications for data entry frequently employ
forms-based interfaces.
Menu-Based Interface
Users have access to a number of predefined menus and choices when using
the menu-based interface to interact with the database. Users can choose
the menu choice that reflects the intended action, such as adding or
removing data, from the available options. The majority of specialized
applications and embedded systems employ menu-based interfaces.
Users can interact with the database thanks in large part to DBMS interfaces.
The user's technical proficiency, the nature of the program, and the target
market all influence the interface choice. The user experience and database
usability may both be improved by selecting the correct interface.
UNIT 1
Concept of Generalization, Specialization & Aggregation.
Generalization
For example, Faculty and Student entities can be generalized and create a higher
level entity Person.
UNIT 1
Specialization
Aggregation
UNIT 1
In aggregation, the relation between two entities is treated as a single entity. In
aggregation, relationship with its corresponding entities is aggregated into a higher-
level entity.
For example: Center entity offers the Course entity act as a single entity in the
relationship which is in a relationship with another entity visitor. In the real world, if
a visitor visits a coaching center then he will never enquiry about the Course only or
just about the Center instead he will ask the enquiry about both.
UNIT 2
Relational Model in DBMS
Relational model can represent as a table with columns and rows. Each row is
known as a tuple. Each table of the column has a name or attribute.
Relational schema: A relational schema contains the name of the relation and
name of all columns or attributes.
Relational key: In the relational key, each row has one or more attributes. It can
identify the row in the relation uniquely.
Properties of Relations
For example, ID is used as a key in the student table because it is unique for each
student. In the PERSON table, passport_number, license_number, SSN are keys
since they are unique for each person.
Types of Keys
Primary Key:
Primary key can be defined as the minimum no. of candidate key that is chosen
by the database designer as the principle means of identifying entities within an
entity set.
It is a unique key.
It can identify only one tuple (a record) at a time.
It has no duplicate values, it has unique values.
It cannot be NULL.
Primary keys are not necessarily to be a single column; more than one
column can also be a primary key for a table.
Example: In STUDENT table
(ROLL_NO, NAME, Registration_no)
Primary key = P1 -> Roll_no
Candidate Key
The minimal set of attributes that can uniquely identify a tuple is known as a
candidate key.
The minimal set of attributes that can uniquely identify a record.
It must contain unique values.
It can contain NULL values.
Every table must have at least a single candidate key.
For example = In Student table with attribute (ROLL_NO, NAME,
Registration_no)
Candidate Key = C1 = ROLL_NO
C2 = Registration_no
Foreign Key
A foreign Key is a column whose value is the same as the primary key of
another table.
It combines two or more relations (table) at a time.
They act as a Cross reference between the tables.
Foreign Key is the column of the table used to point to the primary key
of another table.
(PK) (P K)
ROLL NAM ID
ID BRANC ADDRESS
NO E
H
1 RIYA 123
123 CSE DELHI
2 SIYA 111
111 CE BHOPAL
3 PIYA 112
112 ME INDORE
(TABLE 1) (F K) (TABLE 2)
Super key
A Super key is a set of one or more attributes that, taken collectively, allow as
to identify uniquely an entity in the entity set.
For Example = In Student table with attribute.
(Roll_no, Name, Registration_no)
Super key = S1 -> Roll_no
S2 -> Registration_no
S3 -> Roll_no, Registration_no Roll_no
Composite Key
Whenever a primary key consist of more than one attribute is it is known as a
composite key.
For Example = In Student table with attribute.
(ROLLNO, SID, NAME, BRANCH)
Composite key -> ROLLNO, SID
Integrity Constraint
Integrity constraints are a set of rules. It is used to maintain the quality of information.
Integrity constraints ensure that the data insertion, updating, and other processes have
to be performed in such a way that data integrity is not affected.
Domain constraints
o Domain constraints can be defined as the definition of a valid set of values for an
attribute. The data type of domain includes string, character, integer, time, date,
currency, etc. The value of the attribute must be available in the corresponding domain.
Example:
2. Entity integrity constraints
o The entity integrity constraint states that primary key value can't be null. This is because
the primary key value is used to identify individual rows in relation and if the primary
key has a null value, then we can't identify those rows.A table can contain a null value
other than the primary key field.
Example:
Example:
Key constraints
o Keys are the entity set that is used to identify an entity within its entity set uniquely.An
entity set can have multiple keys, but out of which one key will be the primary key. A
primary key can contain a unique and null value in the relational table.
Example:
Relational Algebra
came in 1970 and was given by Edgar F. Codd (Father of DBMS). It is also known as
Procedural Query Language (PQL) as in PQL, a programmer/user has to mention two
things, "What to Do" and "How to Do".
Suppose our data is stored in a database, then relational algebra is used to access the data
from the database.
The First thing is we have to access the data, this needs to be specified in the query as "What
to Do", but we have to also specify the method/procedure in the query that is "How to
Do" or how to access the data from the database.
Basic Operations
Derived Operations
Basic Operations
Six fundamental operations are mentioned below. The majority of data retrieval operations
are carried out by these.
OR
Project (∏)
If relations don't have the same set of attributes, then the union of such relations will result
in NULL.
Set Difference as its name indicates is the difference between two relations (R-S). It is
denoted by a "Hyphen"(-) and it returns all the tuples(rows) which are in relation R but not in
relation S. It is also a binary operator.
Notation: R – S = R but not B
Where R is the first relation
S is the second relation
R - S is not equivalent to S - R
Just like union, the set difference also comes with the exception of the same set of attributes
in both relations.
Cartesian product is denoted by the "X" symbol. Let's say we have two relations R and S.
Cartesian product will combine every tuple(row) from R with all the tuples from S. I know it
sounds complicated, but once we look at an example, you'll see what I mean.
Notation: R X S
Where R is the first relation
S is the second relation
T1 T2
A B C x C D E
1 2 3 2 2 3
2 1 2 3 3 2
(T1 X T2)
A B C C D E
1 2 3 2 2 3
1 2 3 3 3 2
2 1 2 2 2 3
2 1 2 3 3 2
Rename (ρ)
Rename operation is denoted by "Rho"(ρ). As its name suggests it is used to rename the
output relation. Rename operator too is a binary operator.
Notation: ρ (R, S)
Where R is the new relation’s name
S is the old relation’s name
Query 1: To rename the table student to employee and its attributes name, address, phone_no
Derived Operations
Also known as extended operations, these operations can be derived from basic operations
and hence named Derived Operations. These include three operations: Join Operations,
Intersection operations, and Division operations.
Join Operations
Join Operation in DBMS are binary operations that allow us to combine two or more
relations.
They are further classified into two types: Inner Join, and Outer Join.
Intersection (∩)
(R ∩ S) = {3}
Division (÷)
S_id C_id
S1 Dbms
S2 dbms
S1 oopm
S4 Oopm
C_id
Dbms
oopm
the query is to return the S_ID of students who are enrolled in every/ALL course.
S_id
S1
JOIN: Join is used to combines rows from 2 or more tables based on a related column.
OR
Join commands joins two or more tables & create a temporary view joint table.
TYPES OF JOINS:
1) INNER JOIN/JOIN:
In SQL, INNER JOIN selects records that have matching values in both tables
as long as the condition is satisfied. It returns the combination of all rows from
both the tables where the condition satisfies.
SELECTED COLUMN
Example:Select std1.sid, name, bid, bname from std1 inner join library on std1.sid =
library.sid;
SELECTED COLUMN
Syntax: Select column_name from table1 left join table2 on table1.column_name =
table2.column_name;
Example: Select std1.sid, name, bid, bname from std1 left join library on std1.sid =
library.sid;
SELECTED COLUMN
Example: Select std1.sid, name, bid, bname from std1 right join library on std1.sid =
library.sid;
Join all columns
UNION Select std1.sid, name, bid, bname from std1 right join library on std1.sid =
library.sid;
Join all columns
Relational Calculus
Relational calculus is a non-procedural query language, and instead of algebra, it uses
mathematical predicate calculus. Relational Calculus in database management
system (DBMS) is all about "What to do?". Relational calculus does not tell us how to get the
results from the Database, but it just cares about what to do.
The theory of Relational calculus was introduced by computer scientist and
mathematician Edgar Codd.
It is found in two forms. These are
Tuple relational calculus which was originally proposed by Codd in the year
1972 and
Domain relational calculus which was proposed by Lacroix and Pirotte in the
year 1977.
Tuple relational calculus is used for selecting those tuples that satisfy the given
condition. Tuple Relational Calculus in DBMS uses a tuple variable (t) that goes to each row of the
table and checks if the predicate is true or false for the given row. Depending on the given predicate
condition, it returns the row or part of the row.
{t | P(t)}
Where t is the tuple variable that runs over every Row, and P(t) is the predicate logic
expression or condition. The curly braces {} are used to indicate that the expression is a set of
tuples.
Table: Student
Query to display the last name of those students where age is greater than 30
{ t.Last_Name | Student(t) AND t.age > 30 }
Output:
First_Name Age
---------- ----
Ajeet 30
Chaitanya 31
Carl 28
TRIGGERS
Trigger is actually a coding block, written in SQL. Trigger is related to some
specific event which may happen in database. If that event takes place, then the
trigger performs some specific action for which the trigger is designed. Trigger
is designed by the user but executes automatically. We can design multiple
triggers for multiple events. Triggers are for maintain security, integrity but also
helps users to access the database.
Normally Trigger is designed to execute, when DML statements like insert,
update, delete type commands are performed.
Like in a school management database, school’s fee is 5000. If a person enters
the amount less then or greater then this amount, then a trigger will execute and
write the value to “5000” automatically.
Basic parts of a trigger:
A trigger has three basic parts
1. A triggering event or statement
2. A trigger restriction
3. A trigger action
Syntax of a trigger
create trigger Trigger_name
(before | after)
[insert | update | delete] (triggering event or statement)
on [table_name]
[for each row]
[trigger_body] (trigger restriction)
Here:
create trigger: code to create trigger
Trigger_name: name of the trigger.
[for each row]: after make any type of changes in any row present in any column
the trigger will execute.
Advantages of Triggers
1. Triggers can maintain the integrity of the data.
2. To perform a similar operation multiple time in a table we can write a
trigger for that action and after that the operation will execute itself.
Disadvantages of Triggers
1. It is very tough to make changes or to stop the trigger. Most of user even
can’t understand why the trigger is executed.
2. If we can’t understand its mechanism then we can’t stop it from execution.
3. A trigger is executed every time when any operation is performed in a
table. This can increase the burden of a database.
Types of triggers:
1. Row Triggers: Apply on rows
2. Statement Triggers: Apply on statement
3. Before Triggers: Before the operation or event
4. AFTER Triggers: after the operation or event
5. Cascading triggers: a trigger in a trigger
SQL QUERIES:
o CREATE
o ALTER
o DROP
o TRUNCATE
Syntax:
CREATE TABLE TABLE_NAME (COL_NAME DATATYPES(size));
b. DROP: It is used to delete both the structure and record stored in the
table.
c. ALTER: It is used to alter the structure of the database. This change could
be either to modify the characteristics of an existing attribute or probably to
add a new attribute.
Syntax: ALTER TABLE table_name ADD column_name;
d. TRUNCATE: It is used to delete all the rows from the table and free the
space containing the table.
o INSERT
o UPDATE
o DELETE
Syntax:
UPDATE table_name SET [column_name1= value1,...column_nameN = v
alueN] [WHERE condition];
o SELECT
SQL FUNTIONS:
For doing operations on data SQL has many built-in functions, they are categorized
in two categories and further sub-categorized in different seven functions under each
category. The categories are:
2)SUM Function
Sum function is used to calculate the sum of all selected columns. It works on numeric
fields only.
Syntax:
SELECT Sum (column_name) from table_name;
3) AVG function
The AVG function is used to calculate the average value of the numeric type. AVG
function returns the average of all non-Null values.
Syntax:
SELECT Avg (column_name) from table_name;
4) MAX Function
MAX function is used to find the maximum value of a certain column. This function
determines the largest value of all selected values of a column.
Syntax:
SELECT Max (column_name) from table_name;
5) MIN Function
MIN function is used to find the minimum value of a certain column. This function
determines the smallest value of all selected values of a column.
Syntax:
SELECT Min (column_name) from table_name;
UNIT 3:
FUNCTIONAL DEPENDENCIES
Sid Name
01 Raj
02 Raj
Table Student
In this student table, students name are similar “Raj”, but we know that both are
two different students, because they both have separate and unique Sid (Student
ID). So here in this table “Name” attribute is functionally depend on “Sid”
attribute. If we remove “SID” attribute then we can’t differentiate between both
entries that are they same or different entries.
Also, with the help of “Sid” we can retrieve student’s other details like address,
results, mobile number, school name etc.
So, we can say that Functional dependency is a constraint through which we can
find out values of another attribute’s values with given attribute A, but that
attribute A must be unique or a key attribute.
It is denoted as A → B, where A is an attribute or a set of attributes that is capable
of determining the value of attribute B. (B may be a set of attributes as well).
The left side of FD (functional dependencies) is known as a determinant, the
right side of the relation is known as a dependent.
So, in A → B, A is determinant and B is dependent.
1. Insertion anomalies: This occurs when we are not able to insert data into
a database because some attributes may be missing at the time of
insertion.
2. Updation anomalies: This occurs when the same data items are repeated
with the same values and are not linked to each other.
3. Deletion anomalies: This occurs when deleting one part of the data
deletes the other necessary information from the database.
BCNF
It is a strict version of 3NF
4NF
A relation will be in 4NF if
It is in Boyce Codd's normal form and
Hold no multi-valued attribute dependency
5NF
A relation is in 5NF If it is in 4NF and
Should not hold any join dependency between relations.
Let’s take an example of a relational table < StudentCourseDetail > that contains
the details of the employees of the company.
<StudentCourseDetail>
This is a simple database table and not normalized yet.
Let’s convert this table in 1NF table.
<EmployeeDetail>
As we can see in the normalized table all the attributes which have multiple vales such as
Course Id, Course Name and Contact number for each respective single entity, are now
separated in individual lines. So, all the attributes have atomic values.
If a partial dependency exists, we can divide the table to remove the partially
dependent attributes and move them to some other table where they fit in well.
Here in this table Course attributes have 2 functional dependencies, such as SID and CID
SID Course
CID Course
So now we should remove this partial dependency, so the resultant table is become as given
below:
<StudentDetail> <CourseDetail>
So now we have two tables instead of one but these tables are interconnected with the help of
key attributes such as CID is present both the tables.
X is a super key
Or
Y is a prime attribute, means each element of Y is part of some candidate
key
What is Trivial dependency?
Sid Name
Sid City
Sid State
State Pincode
Multivalued Dependency
o Multivalued dependency occurs when two attributes in a table are independent of each
other but, both depend on a third attribute.
o A multivalued dependency consists of at least two attributes that are dependent on a
third attribute that's why it always requires at least three attributes.
Example: Suppose there is a bike manufacturer company which produces two colors (white
and black) of each model every year.
In this case, these two columns can be called as multivalued dependent on BIKE_MODEL. The
representation of these dependencies is shown below:
1. BIKE_MODEL → → MANUF_YEAR
2. BIKE_MODEL → → COLOR
o A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued
dependency.
o For a dependency A → B, if for a single value of A, multiple values of B exist, then the
relation will be a multi-valued dependency.
Example
STUDENT
21 Computer Dancing
21 Math Singing
34 Chemistry Dancing
74 Biology Cricket
59 Physics Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent entity. Hence,
there is no relationship between COURSE and HOBBY.
In the STUDENT relation, a student with STU_ID, 21 contains two courses, Computer and Math and two
hobbies, Dancing and Singing. So there is a Multi-valued dependency on STU_ID, which leads to
unnecessary repetition of data.
So to make the above table into 4NF, we can decompose it into two tables.
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hockey
5 NORMALIZATION FORM OR 5NF
th
o A relation is in 5NF if it is in 4NF and not contains any join dependency and joining
should be lossless.
o 5NF is satisfied when all the tables are broken into as many tables as possible in order
to avoid redundancy.
o 5NF is also known as Project-join normal form (PJ/NF).
Example
SUBJECT LECTURER SEMESTER
In the above table, John takes both Computer and Math class for Semester 1 but he doesn't take
Math class for Semester 2. In this case, combination of all these fields required to identify a
valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and who will
be taking that subject so we leave Lecturer and Subject as NULL. But all three columns
together act as a primary key, so we can't leave other two columns blank. So, to make the above
table into 5NF, we can decompose it into three relations P1, P2 & P3:
P1
SUBJECT LECTURER
Math John
Hindi John
Hindi Akash
Math Rashmi
P2
LECTURER SEMESTER
John Semester 1
John Semester 1
Akash Semester 2
Rashmi Semester 1
P3
SEMSTER SUBJECT
Semester 1 Maths
Semester 1 Hindi
Semester 2 Hindi
Semester 1 Maths
Decomposition in DBMS
Decomposition in Database Management System is to break a relation into
multiple relations to bring it into an appropriate normal form. It helps to
remove redundancy, inconsistencies, and anomalies from a database. The
decomposition of a relation R in a relational schema is the process of replacing
the original relation R with two or more relations in a relational schema. Each
of these relations contains a subset of the attributes of R and together they
include all attributes of R.
Introduction
If a relation is not properly decomposed, then it may lead to other problems like
information loss, etc. There are two types of decomposition as shown below:
We can follow certain rules to ensure that the decomposition is a lossless join
decomposition Let’s say we have a relation R and we decomposed it
into R1 and R2, then the rules are:
1. The union of attributes of both the sub relations R1 and R2 must contain
all the attributes of original relation R.
R1 ∪ R2 = R
2. The intersection of attributes of both the sub relations R1 and R2 must
not be null, i.e., there should be some attributes that are present in both
R1 and R2.
R1 ∩ R2 ≠ ∅
R1 ∩ R2 = Super key of R1 or R2
First, we decompose a large table into small appropriate tables, then apply
natural join to reconstruct the original table.
Student Details
1 Raj English
3 Vikash Maths
1 Harsh Maths
3 Ajay Science
Student Personal Details:
If we want to see a common table then we can apply Natural JOIN between
both tables like this:
In this operation, no data loss occurs, so this is a good option to consider for
decomposition.
Lossy Decomposition
In a lossy decomposition, one or more of these conditions would fail and we will
not be able to recover Complete information as present in the original relation.
1 Raj English
3 Vikash Maths
1 Harsh Maths
3 Ajay Science
Student Personal Details:
Mobile Address
So always decompose a table in such a manner that the data may be easily
reconstructed and retrieved.
Dependency Preserving
The second property of lossless decomposition is dependency preservation which says that
after decomposing a relation R into R1 and R2, all dependencies of the original relation R
must be present either in R1 or R2 or they must be derivable using the combination of
functional dependencies present in R1 and R2.
Theory
Consider a schema R(A,B,C,D) and functional dependencies A->B and C->D which
is decomposed into R1(AB) and R2(CD)
Example
Let a relation R (A, B, C, D) and set a FDs F = {A -> B , A -> C , C -> D} are given.
A relation R is decomposed into –
Operations in Transaction
A certain set of operations takes place when a transaction is done that is used to
perform some logical set of operations. For example: When we go to withdraw
money from ATM, we encounter the following set of operations:
1. Transaction Initiated
2. You have to insert an ATM card
3. Select your choice of language
4. Select whether savings or current account
5. Enter the amount to withdraw
6. Entering your ATM pin
7. Transaction processes
8. You collect the cash
9. You press finish to end transaction
The above mentioned are the set of operations done by you. But in the case of a
transaction in DBMS there are three major operations that are used for a
transaction to get executed in an efficient manner. These are:
Read(X): Read operation is used to read the value of X from the database and
stores it in a buffer in main memory.
Write(X): Write operation is used to write the value back to the database from
the buffer.
1. R(X);
2. X = X - 500;
3. W(X);
1
UNIT 4
Let's assume the value of X before starting of the transaction is 4000.
o The first operation reads X's value from database and stores it in a buffer.
o The second operation will decrease the value of X by 500. So buffer will
contain 3500.
o The third operation will write the buffer's value to the database. So X's final
value will be 3500.
But it may be possible that because of the failure of hardware, software or power,
etc. that transaction may fail before finished all the operations in the set.
For example: If in the above transaction, the debit transaction fails after
executing operation 2 then X's value will remain 4000 in the database which is
not acceptable by the bank.
During the lifetime of a transaction, there are a lot of states to go through. These
states update the operating system about the current state of the transaction and
also tell the user about how to plan further processing of the transaction. These
states decide the regulations which decide the fate of a transaction whether it
will commit or abort.
2
UNIT 4
The ROLLBACK statement undo the changes made by the current
transaction. A transaction cannot undo changes after COMMIT execution.
Active State: When the operations of a transaction are running then the
transaction is said to be active state. If all the read and write operations
are performed without any error then it progresses to the partially
committed state, if somehow any operation fails, then it goes to a state
known as failed state.
Partially Committed: After all the read and write operations are
completed, the changes which were previously made in the main memory
are now made permanent in the database, after which the state will
progress to committed state but in case of a failure it will go to the failed
state.
Failed State: If any operation during the transaction fails due to some
software or hardware issues, then it goes to the failed state . The
3
UNIT 4
occurrence of a failure during a transaction makes a permanent change to
data in the database. The changes made into the local memory data are
rolled back to the previous consistent state.
Aborted State: If the transaction fails during its execution, it goes
from failed state to aborted state and because in the previous states all the
changes were only made in the main memory, these uncommitted
changes are either deleted or rolled back. The transaction at this point can
restart and start afresh from the active state.
Committed State: If the transaction completes all sets of operations
successfully, all the changes made during the partially committed state
are permanently stored and the transaction is stated to be completed, thus
the transaction can progress to finally get terminated in the terminated
state.
Terminated State: If the transaction gets aborted after roll-back or the
transaction comes from the committed state, then the database comes to a
consistent state and is ready for further new transactions since the
previous transaction is now terminated.
There are four major properties that are vital for a transaction to be successful.
These are used to maintain state consistency in the database, both before and
after the transaction. These are called ACID properties.
4
UNIT 4
1. Atomicity: This property means that either the transaction takes place
completely at once or doesn’t happen at all. There is no middle option,
i.e., transactions do not occur partially. Each transaction is considered as
one single step which either runs completely or is not executed at
all. Click here, to learn more about Atomicity in DBMS.
2. Consistency: This property means that the integrity constraints of a
database are maintained so that the database is consistent before and after
the transaction. It refers to the correctness of a database.
3. Isolation: This property means that multiple transactions can occur
concurrently without causing any inconsistency to the database state.
These transactions occur independently without any external interference.
Changes that occur in a particular transaction are not visible/ accessible to
any other transaction until that particular change in that transaction has
been committed.
4. Durability: This property ensures that once the transaction has
completed execution, the updates and modifications to the database are
stored in and written to disk and they remain intact even if a system
failure occurs. These updates become permanent and are stored in the
non-volatile memory.
Schedule
A series of operation from one transaction to another transaction is known as schedule. It is
used to preserve the order of the operation in each of the individual transaction.
Serial Schedule
5
UNIT 4
As the name says, all the transactions are executed serially one after the other.
In serial Schedule, a transaction does not start execution until the currently
running transaction finishes execution. This type of execution of the transaction
is also known as non-interleaved execution. Serial Schedule are always
recoverable, cascades, strict and consistent. A serial schedule always gives the
correct result.
6
UNIT 4
the other transaction proceeds without the completion of the previous
transaction. All the transaction operations are interleaved or mixed with each
other.
In this Schedule, there are two transactions, T1 and T2, executing concurrently.
The operations of T1 and T2 are interleaved. So, this Schedule is an example of
a Non-Serial Schedule.
SERIALIZABILITY
Serializable means obtaining an equivalent output as of a serial schedule for the
same 'n' number of transactions.
Serializability helps preserve the consistency and concurrency of a database.
There are 2 methods widely used to check serializability i.e., Conflict equivalent
and View equivalent.
A non-serial schedule is said to be serializable, if it is conflict equivalent or view-
equivalent to a serial schedule.
7
UNIT 4
Conflict Serializability
G = (V, E),
where V consists a set of vertices, and E consists a set of edges. The set of vertices
is used to contain all the transactions participating in the schedule. The set of
8
UNIT 4
edges is used to contain all edges Ti ->Tj for which one of the three conditions
holds:
For example:
The precedence graph for schedule S1 contains a cycle that's why Schedule S1
is non-conflict serializable.
9
UNIT 4
For example:
The precedence graph for schedule S2 contains no cycle that's why Schedule S2
is Conflict serializable.
10
UNIT 4
View Serializability
S S1
The final write operation in S is done by T3 and in S1, it is also done by T3. So,
S and S1 are view Equivalent.
The first schedule S1 satisfies all three conditions, so we don't need to check
another schedule. Hence, view equivalent serial schedule is:
T1 → T2 → T3
Question-01:
Check whether the given schedule S is conflict serializable or not-
S : R1(A) , R2(A) , R1(B) , R2(B) , R3(B) , W1(A) , W2(B)
Solution-
11
UNIT 4
Step-01:
List all the conflicting operations and determine the dependency between the
transactions-
R2(A) , W 1(A) (T2 → T1)
R1(B) , W 2(B) (T1 → T2)
R3(B) , W 2(B) (T3 → T2)
Step-02:
Draw the precedence graph-
Question-02:
Check whether the given schedule S is conflict serializable and recoverable or not-
12
UNIT 4
Step-02:
Draw the precedence graph-
Question-03:
Check whether the given schedule S is view serializable or not-
13
UNIT 4
Solution-
We know, if a schedule is conflict serializable, then it is surely view serializable.
So, let us check whether the given schedule is conflict serializable or not.
Step-02:
Draw the precedence graph-
Question-04:
Check whether the given schedule S is view serializable or not. If yes, then give the
serial schedule.
S : R1(A) , W2(A) , R3(A) , W1(A) , W3(A)
Solution-
For simplicity and better understanding, we can represent the given schedule pictorially
as-
14
UNIT 4
Step-02:
Draw the precedence graph-
15
UNIT 4
Since, the given schedule S is not conflict serializable, so, it may or may not be
view serializable.
To check whether S is view serializable or not, let us use another method.
Let us check for blind writes.
Recoverability of Schedule
16
UNIT 4
Sometimes a transaction may not execute completely due to a software issue,
system crash or hardware failure. In that case, the failed transaction has to be
rollback. But some other transaction may also have used value produced by the
failed transaction. So we also have to rollback those transactions.
1. Cascading Schedule
2. Cascadeless Schedule
3. Strict Schedule
Cascading Schedule
Cascadeless Schedule
If in a schedule, a transaction is not allowed to read a data item until and
unless the last transaction that has been written is committed/aborted,
17
UNIT 4
then such a schedule is called a Cascadeless Schedule. It avoids cascading
rollback and thus saves CPU time.
Strict Schedule
If in a schedule, until the last transaction that has written it is committed
or aborted, a transaction is neither allowed to read nor write data item,
then such a schedule is called as Strict Schedule.
Concurrency Control
18
UNIT 4
When several transactions execute concurrently without any rules and
protocols, various problems arise that may harm the data integrity of
several databases. These problems are known as concurrency control
problems. Therefore, several rules are designed, to maintain consistency
in the transactions while they are executing concurrently which are
known as concurrency control protocols.
The dirty read problem in DBMS occurs when a transaction reads the
data that has been updated by another transaction that is still
uncommitted. It arises due to multiple uncommitted transactions
executing simultaneously.
The unrepeatable read problem occurs when two or more different values
of the same data are read during the read operations in the same
transaction.
19
UNIT 4
Transaction A and B initially read the value of DT as 1000. Transaction
A modifies the value of DT from 1000 to 1500 and then again transaction
B reads the value and finds it to be 1500. Transaction B finds two
different values of DT in its two different read operations.
The Lost Update problem arises when an update in the data is done over
another update but by two different transactions.
Transaction A Transaction B
R(A)//1000
A=A+500
W(A)//1500
A=A+300
W(A)// 1800
R(A)//1800
Lock-Based Protocols
20
UNIT 4
There are two kinds of locks used in Lock-based protocols:
Shared Lock(S): The locks which disable the write operations but allow
read operations for any data in a transaction are known as shared locks.
They are also known as read-only locks and are represented by 'S'.
Exclusive Lock(X): The locks which allow both the read and write
operations for any data in a transaction are known as exclusive locks.
This is a one-time use mode that can't be utilized on the exact data item
twice. They are represented by 'X'.
Two-Phase Locking
The two phases of Locking are :
Growing Phase: In the growing phase, the transaction only obtains the
lock. The transaction can not release the lock in the growing phase. Only
when the data changes are committed the transaction starts the Shrinking
phase.
Shrinking Phase: Neither are locks obtained nor they are released in this
phase. When all the data changes are stored, only then the locks are
released.
In the above figure, we can see that in the growing phase, all the locks are
obtained till the point when all the locks needed by the transactions are
acquired. This pint is called Lock-Point. After the lock point, the
transaction enters the Shrinking phase.
21
UNIT 4
Two-phase Locking is further classified into three types :
The transaction can release the shared lock after the lock point.
The transaction cannot release any exclusive lock until the
transaction commits.
In strict two-phase locking protocol, if one transaction rollback
then the other transaction should also have to roll back. The
transactions are dependent on each other. This is called Cascading
schedule.
The transaction must lock all the data items it requires in the
transaction before the transaction begins.
If any of the data items are not available for locking before
execution of the lock, then no data items are locked.
The read-and-write data items need to know before the transaction
begins. This is not possible normally.
Conservative two-phase locking protocol is deadlock-free.
Conservative two-phase locking protocol does not ensure a strict
schedule.
The transaction which is created first or you can say older transactions are
given high priority over new transactions.
22
UNIT 4
For example, if there are two transactions T1 and T2. T1 enters the system at
008 and T2 enters the system at 009 then T1 is given priority over T2
In order to ensure that the conflicting operation occurring in the schedule does
not violate the timestamp ordering two-time stamp values relating to each
database item (X) are used :
Advantages
23
UNIT 4
No deadlock occurs when timestamp ordering protocol is used as no
transaction waits.
No older transaction waits for a longer period of time so the protocol is
free from deadlock.
Timestamp based protocol in dbms ensures that there are no conflicting
items in the transaction execution.
Disadvantages
24
UNIT 4
Transactions that perform a write operation to the database only modify it when
a log record is created for that write operation. The logs may or may not record
the reading of data items. The reason is that reading a data item has no effect on
the consistency of the database and is not beneficial to the recovery mechanism.
1. Undo(Ti): Restores the old values of each of the items in Ti that have
been updated by the transaction Ti.
2. Redo(Ti): Updates all data items updated by transaction Ti with their
new values. Undoing a T requires the log to contain both the
record <start> and the record <commit>.
By using immediate updates, the database can be modified and output while still
in the active state of the transaction. Data modifications written during an active
transaction are referred to as uncommitted modifications.
For consistency reasons, old values of data items should be used if the system
crashes or a transaction fails. Undoing an operation is a way to accomplish this.
In order for Ti to start its execution, a log entry called <Ti start> is written. A
25
UNIT 4
new update record is written to the log before the write(X) operation by Ti is
executed. Partially committed Ti records are written to the log as <Ti commit>.
All database modifications are recorded in the log using the delayed database
modification technique to ensure transaction atomicity. As a part of this
technique, a transaction is only partially committed when all the written
statements are applied to the database. As soon as the final action of an order
has been completed, the order is said to be partially committed.
It resets the values of data items that were updated by transaction Ti. The log
includes a list of the data items that Ti updated and their new values. Redo
transactions must be reversible. Repeating them many times must have the same
effect as performing them once.
In the event of a failure, the recovery subsystem consults the log for information
about which transactions need to be redone. If the log record includes both <Ti
start> and <Ti commit> statements, Transaction Ti is redone.
26
UNIT 4
There is a failure just after the record for step write(B) of transaction To. The
recovery process will not perform any redo operations as we only have <To
start> in our log record, but not <To commit>.
When a system crash occurs just after writing the log record C, during recovery
we do only redo (To), since the log disk only contains data to commit and to
start. However, we don't have the <T1 commit> in log disk at the same time, so
we can't do the redo (T1).
If the system crashes just after the log record <T1 commit>, then during the
recovery we will do both redos (T0) and redos (TI) since we have both <To
start> and <To commit> and <T1 start>, <T1 commit> in log disk.
The system uses the log to determine what transactions need to be undone and
which need to be re-done if the system crashes.
The log should contain the records <Ti, Start> and <Ti, Commit> or <Ti,
Commit>; otherwise, the Transaction Ti should be redone. There should be
an undo of Transaction Ti if the log contains the record <Tn, Start> but does
not contain the record <Ti, commit> or <Ti, abort>.
27
UNIT 4
CHECK POINT
The DBMS checkpoint is a procedure of compressing the transaction log file by
transferring the old transaction to permanent storage. It helps in system recovery
when the failure occurs. The procedure of system recovery includes reading of
log file in reverse order and maintaining redo and undo lists.
28
UNIT 4
Following are the steps to be performed for the system recovery using
checkpoint.
The transaction log file are read in the reverse order, ie., from T4 to T1.
Redo and Undo are the lists that are created and maintained by the
system.
If the transactions contain operations like <Tn, start> and <Tn,
commit> together or <Tn, commit> alone then the transaction will be
stored in the redo list. In the above diagram, The transaction T1 contains
only <Tn, commit> and transactions T2 and T3 contain <Tn, start> and
<Tn, commit> both the operations and therefore transactions T1, T2 and
T3 are stored in the redo list.
If the transactions contain operations like <Tn, start> but not <Tn,
commit>, then the transaction will be stored in undo list. In the above
diagram, The transaction T4 contains <Tn, start> but not <Tn, commit>
operation, and therefore transaction T4 is stored in undo list.
29
UNIT 5
File Organization
o The File is a collection of records. Using the primary key, we can access the records.
The type and frequency of access can be determined by the type of file organization
which was used for a given set of records.
o File organization is a logical relationship among various records. This method defines
how file records are mapped onto disk blocks.
o File organization is used to describe the way in which the records are stored in terms of
blocks, and the blocks are placed on the storage medium.
o The first approach to map the database to the file is to use the several files and store
only one fixed length record in any given file. An alternative approach is to structure
our files so that we can contain multiple lengths for records.
This method is the easiest method for file organization. In this method, files
are stored sequentially. This method can be implemented in two ways:
o It is a quite simple method. In this method, we store the record in a sequence, i.e., one after
another. Here, the record will be inserted in the order in which they are inserted into tables.
o In case of updating or deleting of any record, the record will be searched in the memory blocks.
When it is found, then it will be marked for deleting, and the new record is inserted.
UNIT 5
Insertion of the new record:
Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records
are nothing but a row in the table. Suppose we want to insert a new record R2 in the sequence,
then it will be placed at the end of the file. Here, records are nothing but a row in any table.
o In this method, the new record is always inserted at the file's end, and then it will sort the
sequence in ascending or descending order. Sorting of records is based on any primary key or
any other key.
o In the case of modification of any record, it will update the record and then sort the file, and
lastly, the updated record is placed in the right place.
o It contains a fast and efficient method for the huge amount of data.
o In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
o It is simple in design. It requires no much effort to store the data.
o This method is used when most of the records have to be accessed like grade calculation of a
student, generating the salary slip, etc.
o It will waste time as we cannot jump on a particular record that is required but we have to move
sequentially which takes our time.
o Sorted file method takes more time and space for sorting the records.
o It is the simplest and most basic type of organization. It works with data blocks. In heap
file organization, the records are inserted at the file's end. When the records are inserted,
it doesn't require the sorting and ordering of records.
If we want to search, update or delete the data in heap file organization, then we need to traverse
the data from staring of the file till we get the requested record.
If the database is very large then searching, updating or deleting of record will be time-
consuming because there is no sorting or ordering of records. In the heap file organization, we
need to check all the data until we get the requested record.
o It is a very good method of file organization for bulk insertion. If there is a large number
of data which needs to load into the database at a time, then this method is best suited.
o In case of a small database, fetching and retrieving of records is faster than the
sequential record.
o This method is inefficient for the large database because it takes time to search or
modify the record.
o This method is inefficient for large databases.
UNIT 5
Need for an Indexing
The main behind designing a database to store data and get quick access of any
data. Beside this one can easily insert/update/delete data of the database. As
database is of large size hence searching for even one record will take time.
Moreover, even a smaller transaction will take time to perform the action. In
order to reduce the time spent in transaction, indexes are used.
Types of Indexes
According to the attributes defined above, we divide indexing into three types:
It is somewhat like the index (or the table of contents) found in a book. Index of
a book contains topic names along with the page number similarly the index
table of the database contains keys and their corresponding block address.
Example:
3. Cluster Indexing: In cluster indexing main file is sorted using non key
attribute. There will be one entry for each unique value of the non key
attribute. Block pointer will point the first block in which the unique
value appears. Clusters index is an example of sparse index as well as
dense index. Because it doesn’t allow every entry in the index file, but it
allows every entry in the index file, but it allows every unique value in
the index file.
Example:
Let’s assume that a company recruited many employees in various
departments. In this case, clustering indexing should be created for all
employee who being to the same dept.
It is considered in a single cluster, and index point to the cluster, and index
points to the cluster as a whole. Here, Department_no is a non- unique key.
UNIT 5
Single Level Indexing
B tree
B tree is an m-way tree that self-balances. Because of their balanced nature, such trees are
widely used to organize and manage massive datasets and to facilitate searches.
Binary trees have a maximum of 2 children, whereas a Multiway Search Tree, also known as
an M-way tree, has a maximum of m children where m>2. M-way trees are primarily used in
external searching, i.e. retrieving data from secondary storage such as disc files, due to their
structure.
Structure of B Tree
A B-tree of order m is an m-way search tree that is either empty or satisfies the following
characteristics:
The figure above is an example of a B Tree. It has [11] at the root. 9 that is lesser
than 11 falls in the left child. [16, 18] being greater than 11 a falls in the right child. The same
process follows as we go down the tree.
From this, we can see that any node (except root) in a B-tree is at least half-full and this
avoids wastage of storage space. The B tree is perfectly balanced so that the number of nodes
accessed to find a key becomes less.
Operations on B Tree
The B tree may be split or joined to ensure that none of its properties are violated during the
operations.
UNIT 5
Searching in B Tree
The process of searching in a B Tree is similar to that of searching in a Binary Search Tree.
Unlike a BST, however, instead of making a 2-way decision (left or right), a B tree makes
an n-way decision at each node, where n is the node's number of children.
1. The search begins with the root node. Contrast the root with the search element k.
o The search is complete if the root node contains the character k.
o If k is the first value in the root, we move to the leftmost child and recursively
search for the child.
o If the root has only 2 children, go to the rightmost child and search the child
recursively.
o If the root contains more than 2 keys, search the remainder.
2. If k is not found after traversing the entire tree, the search element does not exist in
the B Tree.
Insertion in B Tree
When inserting an element into the B tree, we must first check to see if the item
is already existent in the tree. If the element is found, the insertion does not
occur. If not, we will continue with the insertion. 2 cases must be addressed
when inserting the key into the leaf node:
1. Node contains no more than MAX number of keys - Simply add the key
into the node in its proper location.
2. Node has the maximum number of keys - A key is inserted into the full
node, and then a split occurs, dividing the full node into three parts.
The second or median key becomes the parent node, while the first and
third parts serve as the left and right children. If the parent node also has
a MAX number of keys, the process is repeated.
The following are the steps for inserting an element into B Tree:
1. Determine the maximum number of keys in the node based on the B tree's order.
2. If the tree is empty, a root node is allocated and the key is inserted and acts as the root
node.
3. Search the appropriate node for insertion.`
4. If the node is full:
4.1. Insert the elements in increasing order.
4.2. If the elements are greater than the maximum number of keys, split at the median.
4.3. Push the median key upwards and split the left and right keys as left and right
child respectively.
5. If the node is not full, insert the node in increasing order.
UNIT 5
Deletion in B Tree
During insertion, we had to make sure that the number of keys in the node did
not exceed a certain limit. Similarly, during deletion, we must ensure that the
number of keys remaining in the node does not fall below the minimum number
of keys that a node can hold. As a result, instead of a split, a combination
process occurs in the case of deletion.
We would consider two scenarios when deleting an element from the B Tree:
1.1. If the node has more than MIN keys - Deleting a key does not violate any
property, and thus the key can be easily deleted by shifting other node keys, if
necessary.
1.2. If the node has less than MIN keys - This type of deletion violates a B tree
property. If the keys in the left sibling exceed MIN, keys are borrowed from
there. If the keys in the right sibling exceed MIN, keys are borrowed from there.
If either of these conditions is not met, a node combination occurs with either of
the siblings.
Applications of B Tree
B+ tree
UNIT 5
B+ tree is simply an improved version of the B tree that allows for efficient
and smooth insertion, deletion, and sequential access. It helps to mitigate the
disadvantages of the B-tree by saving data pointers at the leaf node level and
simply storing key values in internal nodes.
The B+ tree is also known as an advanced self-balanced tree since every path
from the tree's root to its leaf is the same length. The fact that all of the leaf
nodes are the same length indicates that they all occur at the same level. It is not
possible for some leaf nodes to appear at the third level while others appear at
the second level.
Structure of B+ Tree
A B+ tree is the same as a B tree; the only difference is that the B+ tree has an
additional level with linked leaves at the bottom. In addition, unlike the B tree,
each node in a B+ tree contains only keys rather than key-value pairs.
UNIT 5
Operations of B+ Tree
Most operations in B+ Tree are faster and make up for the drawbacks of the B
tree.
Searching in B+ Tree
In the B tree, our search ended when we discovered the key, but this will not be
the case in the B+ tree. It is possible that a key exists in the internal node but not
in the leaf node of a B+ tree. This is because when a data item is deleted from
the leaf node, the corresponding key is not deleted from the internal node. The
search will take place only within the leaf nodes.
Insertion in B+ Tree
First, a search is performed, and if the key is not found in the leaf node, we can
have one of two outcomes, depending on whether the leaf node has
the maximum number of keys or not:
1. If the leaf node has fewer than the maximum number of keys, key and data are
simply inserted in an ordered manner into the leaf node, and the index set is not
changed.
2. If the leaf node has the maximum number of keys, we will have to split it. After the
old node, a new leaf node is allocated and inserted into the sequence set (linked list of
leaf nodes). All keys less than the median key remain in the old leaf node, while all
keys greater than or equal to the median key are moved to the new node, along with
the corresponding data items. The median key becomes the new node's first key, and
this key (without a data item) is copied (rather than moved) to the parent node,
which is an internal node. As a result, this median key is present both in the leaf
node and in the internal node which is the parent of the leaf node.
Splitting of a leaf node Keys < Median remains in the old leaf node
Keys >= Median go to a new leaf node The median key is copied to the parent
node
If after splitting of the leaf node, the parent becomes full then again a split has
to be done. The splitting of an internal node is similar to that of the splitting of a
node in a B Tree. When an internal node is split the median key is moved to the
parent node.
Splitting of an internal node Keys < Median remains in the old leaf node.
Keys < Median go to a new leaf node. The median key is moved to the parent
node.
UNIT 5
This splitting continues till we get a nonfull parent node. If the root node is split
then a new root node has to be allocated.
Deletion in B+ Tree
First, a search is performed, and if the key is not found in the leaf node, we
can have one of two outcomes, depending on whether the leaf node has
the minimum number of keys or not:
1. If the leaf node has more than minimum elements, then we can simply delete the key
and its data item, and move other elements of the leaf node if required. In this case,
the index set is not changed, i.e. if the key is present in any internal node also then it
is not deleted from there. This is because the key still serves as a separator key
between its left and right children.
2. If the leaf node has a minimum number of nodes:
2.1. If any of the siblings have more than minimum nodes, then a key is borrowed
from it and the separator key in the parent node is updated accordingly.
2.2. If both siblings have minimum nodes then we merge the underflow node with its
sibling. This is done by moving keys and data from the underflow node to the sibling
node and deleting the underflow node. The merging could result in an underflow
parent node which is an internal node. For the internal nodes, the borrowing and
merging process is similar to that of the B tree.
Applications of B+ Tree
A B+ tree is used to store records very efficiently by indexing the records using
the B+ tree indexed structure. Data access becomes faster and easier as a result
of multi-level indexing.
A B+ tree's structural simplicity is preferred by many database system implementers.
As a result, it is commonly used in multilevel and database indexing.
Advantages of Indexing
Important Terminologies
Data Bucket
Data buckets are the memory locations where the actual data records are
stored.
Hash function
A mathematical function that returns the value of the address of the
desired data block. Mostly, it uses the primary key to generate the address
of a data block.
Hash Index
It denotes the address of a data block generated by the hash function.
Bucket Overflow
The condition when the memory address generated by the hash function
is not empty.i.e. it already contains some data record) is known as bucket
overflow as shown in the image below –
UNIT 5
Hashing techniques can be further divided into two types i.e., static hashing and
dynamic hashing.
1. Static Hashing
In the static hashing, the resultant data bucket address will always remain
the same.
Therefore, if you generate an address for say Student_id = 10 using
hashing function mod (3), the resultant bucket address will always be 1.
So, you will not see any change in the bucket address.
Therefore, in this static hashing method, the number of data buckets in
memory aways remains constant.
Delete a record: Using the hash function, you can first fetch the record
which is you wants to delete. Then you can remove the records for that
address in memory.
If we want to insert some new record into file but the address of a data
bucket generated by the hash function is not empty, or data already exists
in that address. This situation in the static hashing is known as bucket
overflow, to overcome the situation, following methods are used –
Open hashing
Close hashing
Open hashing
In open hashing method, instead of overwriting older and the next
available data block is used to enter the new record, this method is also
known as Linear Probing.
For example: A2 is a new record which you wants to insert. The hash
function generates address as 222. But it is already occupied by some
other value. That’s why the system looks for the next data bucket 501 and
assigns A2 to it.
Close hashing
In the close hashing method. When buckets are full a new bucket it
allocated for the same hash and result are linked after the previous one.
This mechanism is known as overflow changing. This method requires on
extra link filed to each table position.
UNIT 5
2 Dynamic Hashing
The problem with static hashing is that it does not expand or shrink dynamically
as the database size grows or shrinks. Dynamic hashing helps us to overcome
the problem of bucket overflow. Dynamic hashing provides a mechanism in
which data bucket added or removed dynamically and on-demand. Dynamic
hashing is also known as extended hashing.
No backup Disk: If a single disk failure occurs, the whole system fails to
perform. Due to data Redundancy if copy of data is stored in multiple
disks, then even if one disk failure occurs the system can still fetch data
from the other redundant disk.
UNIT 5
Performance: If large amount of data is stored in a single disk it can
degrade the performance and effectiveness of the system. This can be
solved by using multiple disks with Redundant data.
In RAID technique, the combined disks are considered as a single logical disk
by the operating system. These individual disk uses different methods to store
data. It depends on the type of RAID levels used.
RAID 0
RAID 1
RAID 2
RAID 3
RAID 4
RAID 5
RAID 6
RAID 0
RAID 0 implements data striping. The data blocks are placed in multiple disks
without redundancy. None of the disks are used for data redundancy so if one
disk fails then all the data in the array is lost.
Block '10,11,12,13' form a stripe. Also, no data block is being repeated in any
disk.
UNIT 5
Instead of placing one block of data in a disk, we can place more than one block
of data in a disk and then move to another disk.
In above example, block 10 and 11 are first placed in DISK 0 and then we move
to another disk to place further blocks.
Pros of RAID 0
All the disk space is utilized and hence performance is increased.
Data requests can be on multiple disks and not on a single disk hence
improving the throughput.
Cons of RAID 0
Failure of one disk can lead to complete data loss in the respective array.
No data Redundancy is implemented so one disk failure can lead to
system failure.
RAID 1
RAID 1 implements mirroring which means the data of one disk is replicated
in another disk. This helps in preventing system failure as if one disk fails then
the redundant disk takes over.
Here Disk 0 and Disk 1 have the same data as disk 0 is copied to disk 1. Same is
the case with Disk 2 and Disk 3.
Pros of RAID 1
Failure of one Disk does not lead to system failure as there is redundant
data in other disk.
Cons of RAID 1
Extra space is required for each disk as each disk data is copied to some
other disk also.
RAID 2
RAID 2 is used when error in data has to be checked at bit level, which uses a
Hamming code detection method. Two disks are used in this technique. One is
used to store bit of each word in the disk and another is used to store error code
correction (Parity bits) of data words. The structure of this RAID is complex, so
it is not used commonly.
Here Disk 3, Disk 4 and Disk 5 stores the parity bits of Data stored in Disk 0,
Disk 1, and Disk 2 respectively. Parity bits are used to detect the error in data.
UNIT 5
Pros of RAID 2
It checks for error at a bit level for every data word.
One full disk is used to store parity bits which helps in detecting error.
Cons of RAID 2
Large extra space is used for parity bit storage.
RAID 3
Here Disk 3 contains the Parity bits for Disk 0 Disk 1 and Disk 2. If any one of
the Disk's data is lost the data can be reconstructed using parity bits in Disk 3.
Pros of RAID 3
Data can be recovered with the help of parity bits.
Cons of RAID 3
Extra space for storing parity bits is used.
RAID 4
Pros of RAID 4
Parity bits helps to reconstruct the data if at most one data is lost from the
disks.
Cons of RAID 4
Extra space for Parity is required.
If there is more than one data loss from multiple disks then Parity cannot
help us reconstruct the data.
RAID 5
RAID 5 is similar to RAID 4 with only one difference. The parity Rotates
among the Disks.
Here We can see the rotation of Parity bits from Disk 3 to Disk 1.
Pros of RAID 5
Parity is distributed over the disk and makes the performance better.
Data can be reconstructed using parity bits.
UNIT 5
Cons of RAID 5
Parity bits are useful only when there is data loss in at most one Disk. If
there is loss in more than one Disk Block then parity is of no use.
Extra space for parity is required.
RAID 6
If there is more than one Disk failure, then RAID 6 implementation helps in that
case. In RAID 6 there are two parities in each array/row. It is similar to RAID 5
with extra parity.
Here P0,P1,P2,P3 and Q0,Q1,Q2,Q3 are two parity to reconstruct the data if
almost two disks fail.
Pros of RAID 6
More parity helps in reconstructing at most 2 Disk data.
Cons of RAID 6
Extra space is used for both parities. (P and Q).
More than 2 disk failures can not be corrected.