Unit 4
Normalization in DBMS
Normalization in DBMS Normalization is a process of organizing the data in database to avoid data
redundancy, insertion anomaly, update anomaly and deletion anomaly. Normalization is a database
design technique which organizes tables in a manner that reduces redundancy and dependency of data.
It divides larger tables to smaller tables and links them using relationships.
Why is Normalization Important?
Reduces Data Redundancy: Duplicate data is stored efficiently, saving disk space and
reducing inconsistency.
Improves Data Integrity: Ensures the accuracy and consistency of data by organizing it in a
structured manner.
Simplifies Database Design: By following a clear structure, database designs become easier to
maintain and update.
Optimizes Performance: Reduces the chance of anomalies and increases the efficiency of
database operations.
What are Normal Forms in DBMS?
Normalization is a technique used in database design to reduce redundancy and improve data
integrity by organizing data into tables and ensuring proper relationships.
Normalization is also the process of simplifying the design of a database so that it achieves the
optimal structure. Anomalies in DBMS There are three types of anomalies that occur when the
database is not normalized.
1. Insertion Anomaly
2. Update Anomaly
3. Deletion Anomaly
Let us assume we have Employee table as given below
Update anomaly: Update anomaly is something when we are trying to update some records in table,
and that update is causing data inconsistency.
For example, in the above table we have two records for EmpId 100 as he belongs to two department
If we want to update the address of Rock then we have to update the same in two rows or the data will
become inconsistent.
Prashanth C Patel Page 1
Unit 4
Insert anomaly: Insert anomaly is something when we are not able to insert data into tables due to
some constraints. Suppose a new employee joins the company, who is under training and currently not
assigned to any department then we would not be able to insert the data into the table if Emp_Dept
field doesn’t allow nulls.
Delete anomaly: Delete anomaly is something when we delete some data from the table, and due to
that delete operation we loss some other useful data.
For example, if at a point of time the company closes the department 103 then deleting the rows that
are having Emp_Dept as 103 would also delete the information of employee Peter since she is
assigned only to this department.
ADVANTAGES OF NORMALIZATION
Here we can see why normalization is an attractive prospect in RDBMS concepts.
1) A smaller database can be maintained as normalization eliminates the duplicate data.
Overall size of the database is reduced as a result.
2) Better performance is ensured which can be linked to the above point. As databases
become lesser in size, the passes through the data becomes faster and shorter thereby
improving response time and speed.
3) Narrower tables are possible as normalized tables will be fine-tuned and will have
lesser columns which allows for more data records per page.
4) Fewer indexes per table ensures faster maintenance tasks (index rebuilds).
5) Also realizes the option of joining only the tables that are needed.
DISADVANTAGES OF NORMALIZATION
1) More tables to join as by spreading out data into more tables, the need to join table’s
increases and the task becomes more tedious. The database becomes harder to realize as
well.
2) Tables will contain codes rather than real data as the repeated data will be stored as
lines of codes rather than the true data. Therefore, there is always a need to go to the
lookup table.
3) Data model becomes extremely difficult to query against as the data model is
optimized for applications, not for ad hoc querying. (Ad hoc query is a query that cannot
be determined before the issuance of the query. It consists of an SQL that is constructed
dynamically and is usually constructed by desktop friendly query tools.). Hence it is hard
to model the database without knowing what the customer desires.
4) As the normal form type progresses, the performance becomes slower and slower.
5) Proper knowledge is required on the various normal forms to execute the
normalization process efficiently. Careless use may lead to terrible design filled with
major anomalies and data inconsistency.
Prashanth C Patel Page 2
Unit 4
1. First Normal Form (1NF): Eliminating Duplicate Records
A table is in 1NF if it satisfies the following conditions:
All columns contain atomic values (i.e., indivisible values).
Each row is unique (i.e., no duplicate rows).
Each column has a unique name.
The order in which data is stored does not matter.
Example:
Sample Employee table, it displays employees are working with multiple
departments.
Employee Age Department
Melvin 32 Marketing, Sales
Edward 45 Quality Assurance
Alex 36 Human Resource
Employee table following 1NF:
Employee Age Department
Melvin 32 Marketing
Melvin 32 Sales
Edward 45 Quality Assurance
Alex 36 Human Resource
Second normal form(2NF)
A table is said to be in 2NF if:
1. Table is in 1NF
2.It has no Partial Dependency, i.e., no non-prime attribute is dependent on any
proper subset of any candidate key of the table.
First we will understand what are Prime and Non-prime attributes.
Prime attribute − An attribute, which is a part of the candidate key, is known as a prime
attribute.
Non-prime attribute − An attribute, which is not a part of the candidate key, is said to be
a non-prime attribute.
For example, we have following table which is having employee data.
Prashanth C Patel Page 3
Unit 4
Above table is in 1NF as all columns are having atomic values. Here Emp_Id and Dept_Id are the
prime attributes. As per 2NF rule Emp_Name and Dept_Name must be dependent upon
both prime attributes, but here Emp_name can be identified by Emp_Id and Dept_Name can be
identified by Dept_Id alone. So here partial dependency exists. To make this relation in 2NF we have
to break above table as:
Third normal form(3NF)
For a relation to be in Third Normal Form it must satisfy the following −
1. It must be in Second Normal form
2. No non-prime attribute is transitively dependent on prime key attribute.
For example, we have below table for storing employee data.
In above relation Emp_Id is the only prime key attribute.
Now If we see City can be identified by Emp_Id as well as ZIP. ZIP is not a prime
attribute, and also it is not a super key. So we hold below 2 relationships here.
Emp_Id -> ZIP (ZIP can be identified by Emp_Id) ZIP ->
City (City can be identified by ZIP)
Therefore, below transitive dependency is true for above relation. Emp_Id -> ZIP -> City
To convert this relation into 3NF we wil break this into 2 relations as:
Prashanth C Patel Page 4
Unit 4
Boyce-Codd Normal Form (BCNF): The Strongest Form of 3NF
BCNF is a stricter version of 3NF where for every non-trivial functional dependency (X →
Y), X must be a superkey (a unique identifier for a record in the table).
Example: If a table has a dependency (StudentID, CourseID) → Instructor, but neither StudentID
nor CourseID is a superkey, then it violates BCNF. To bring it into BCNF, decompose the table so
that each determinant is a candidate key.
In the above image, Stu_ID is the super-key in the relation Student_Detail and Zip is the
super-key in the relation ZipCodes. So
Stu_ID → Stu_Name, Zip
Zip → City
5. Fourth Normal Form (4NF): Removing Multi-Valued Dependencies(MVD)
A table is in 4NF if it is in BCNF and has no multi-valued dependencies. A multi-valued
dependency occurs when one attribute determines another, and both attributes are independent of
all other attributes in the table.
Example: Consider a table where (StudentID, Language, Hobby) are attributes. If a student can
have multiple hobbies and languages, a multi-valued dependency exists. To resolve this, split the
table into separate tables for Languages and Hobbies.
6. Fifth Normal Form (5NF): Eliminating Join Dependency
5NF is achieved when a table is in 4NF and all join dependencies are removed. This form ensures
that every table is fully decomposed into smaller tables that are logically connected without losing
information.
Example: If a table contains (StudentID, Course, Instructor) and there is a dependency where all
combinations of these columns are needed for a specific relationship, you would split them into
smaller tables to remove redundancy.
The fifth normal form is also called the PJNF - Project-Join Normal Form
It is the most advanced level of Database Normalization.
Using Fifth Normal Form you can fix Join dependency and reduce data redundancy.
It also helps in fixing Update anomalies in DBMS design.
Prashanth C Patel Page 5
Unit 4
Closure of FDs and MVDs:
Functional Dependencies (FDs):
A functional dependency (A -> B) means that the value of attribute A determines the value of
attribute B. The closure of a set of FDs, denoted F+, includes all FDs that can be derived from the
original set using Armstrong's axioms (reflexivity, augmentation, and transitivity).
Multivalued Dependencies (MVDs):
An MVD (A ->> B) means that the value of attribute A determines the set of values of attribute B.
The closure of a set of MVDs, denoted M+, includes all MVDs that can be derived from the
original set using rules similar to those for FDs.
Lossless Join Decomposition:
A lossless join decomposition of a relation into two or more relations is a decomposition where the
original relation can be reconstructed by performing a natural join on the decomposed relations. This
means that no information is lost during the decomposition process.
Lossless join decomposition is important in relational database design for normalization, where
decomposing a relation into smaller, well-structured relations can reduce redundancy and improve
query performance.
Prashanth C Patel Page 6
Unit 4
DBMS TRANSACTION
A transaction can be defined as a group of tasks. A single task is the minimum processing unit
which cannot be divided further.
Lets take an example of a simple transaction. Suppose a bank employee transfers Rs 500 from A's
account to B's account. This very simple and small transaction involves several low-level tasks.
As Account
Open_Account(A)
Old_Balance = A.balance
New_Balance = Old_Balance - 500
A.balance = New_Balance
Close_Account(A)
As Account
Open_Account(B)
Old_Balance = B.balance
New_Balance = Old_Balance + 500
B.balance = New_Balance
Close_Account(B)
Transactions are a set of operations used to perform a logical set of work. A transaction usually
means that the data in the database has changed. One of the major uses of DBMS is to protect the
user data from system failures. It is done by ensuring that all the data is restored to a consistent state
when the computer is restarted after a crash. The transaction is any one execution of the user
program in a DBMS. One of the important properties of the transaction is that it contains a finite
number of steps. Executing the same program multiple times will generate multiple transactions.
Steps for ATM Transaction
1. Transaction Start.
2. Insert your ATM card.
3. Select a language for your transaction.
4. Select the Savings Account option.
5. Enter the amount you want to withdraw.
6. Enter your secret pin.
7. Wait for some time for processing.
8. Collect your Cash.
9. Transaction Completed.
A transaction can include the following basic database access operation.
Read/Access data (R): Accessing the database item from disk (where the database stored data)
to memory variable.
Write/Change data (W): Write the data item from the memory variable to the disk.
Commit: Commit is a transaction control language that is used to permanently save the
changes done in a transaction
Prashanth C Patel Page 7
Unit 4
Example: Transfer of 50₹ from Account A to Account B. Initially A= 500₹, B= 800₹. This data is
brought to RAM from Hard Disk.
R(A) -- 500 // Accessed from RAM.
A = A-50 // Deducting 50₹ from A.
W(A)--450 // Updated in RAM.
R(B) -- 800 // Accessed from RAM.
B=B+50 // 50₹ is added to B's Account.
W(B) --850 // Updated in RAM.
commit // The data in RAM is taken back to Hard Disk.
Transaction management in a Database Management System (DBMS) ensures that database
transactions are executed reliably and follow ACID properties: Atomicity, Consistency, Isolation,
and Durability.
Transaction States
Transactions can be implemented using SQL queries and Servers. In the diagram, you can see
how transaction states work.
Atomicity
States that all operations of the transaction take place at once if not, the transactions are
aborted.
There is no midway, i.e., the transaction cannot occur partially. Each transaction is treated as
one unit and either run to completion or is not executed at all.
Atomicity involves the following two operations:
Abort: If a transaction stops or fails, none of the changes it made will be saved or visible.
Commit: If a transaction completes successfully, all the changes it made will be saved and
visible.
Consistency
The rules (integrity constraint) that keep the database accurate and consistent are followed
before and after a transaction.
When a transaction is completed, it leaves the database either as it was before or in a new stable
state.
This property means every transaction works with a reliable and consistent version of the
database.
The transaction is used to transform the database from one consistent state to another consistent
state. A transaction changes the database from one consistent state to another consistent state.
Prashanth C Patel Page 8
Unit 4
Isolation
It shows that the data which is used at the time of execution of a transaction cannot be used by
the second transaction until the first one is completed.
In isolation, if the transaction T1 is being executed and using the data item X, then that data
item can't be accessed by any other transaction T2 until the transaction T1ends.
The concurrency control subsystem of the DBMS enforced the isolation property
Durability
The durability property is used to indicate the performance of the database's consistent state. It
states that the transaction made the permanent changes.
They cannot be lost by the erroneous operation of a faulty transaction or by the system failure.
When a transaction is completed, then the database reaches a state known as the consistent
state. That consistent state cannot be lost, even in the event of a system's failure.
The recovery subsystem of the DBMS has the responsibility of Durability property.
Concurrent execution
Concurrent execution in a Database Management System (DBMS) refers to the ability of multiple
transactions to access and modify shared data simultaneously. While this improves throughput and
resource utilization, it can also lead to data inconsistencies if not managed properly. Concurrency
control mechanisms, such as locking and time stamping, are used to ensure that concurrent
transactions don't interfere with each other, thereby maintaining data integrity
In a DBMS, concurrent execution might offer a number of issues that need to be resolved in order to
guarantee accurate and dependable database operation. Some of the issues with concurrent execution
in DBMS include the following
Lost Update
When two or more transactions try to update the same data item at the same time, a lost update
happens, and the outcome relies on the sequence in which the transactions are executed. The
modifications made by the other transaction will be lost if one transaction overwrites them before they
are committed. Inconsistent data and inaccurate findings might occur from lost updates.
Dirty Read
When a transaction accesses data that has already been updated but hasn't been committed, it's known
as a dirty read. The information read by the first transaction will be invalid if the modifying
transaction rolls back. Data discrepancies and inaccurate outcomes might occur from dirty readings.
Prashanth C Patel Page 9
Unit 4
Schedule
Schedule, as the name suggests, is a process of lining the transactions and executing them one by
one. When there are multiple transactions that are running in a concurrent manner and the order of
operation is needed to be set so that the operations do not overlap each other, Scheduling is brought
into play and the transactions are timed accordingly.
Serial Schedules: Schedules in which the transactions are executed non-interleaved, i.e., a serial
schedule is one in which no transaction starts until a running transaction has ended are called serial
schedules. Example: Consider the following schedule involving two transactions T 1 and T 2 .
T1 T2
R(A)
W(A)
R(B)
W(B)
R(A)
R(B)
where R(A) denotes that a read operation is performed on some data item 'A' This is a serial
schedule since the transactions perform serially in the order T 1 —> T 2.
Non-Serial Schedule: This is a type of Scheduling where the operations of multiple transactions
are interleaved. This might lead to a rise in the concurrency problem. The transactions are executed
in a non-serial manner, keeping the end result correct and same as the serial schedule. Unlike the
serial schedule where one transaction must wait for another to complete all its operation, in the non-
serial schedule, the other transaction proceeds without waiting for the previous transaction to
complete. This sort of schedule does not provide any benefit of the concurrent transaction. It can be
of two types namely, Serializable and Non-Serializable Schedule. The Non-Serial Schedule can be
divided further into Serializable and Non-Serializable.
Serializable: This is used to maintain the consistency of the database. It is mainly used in the Non-
Serial scheduling to verify whether the scheduling will lead to any inconsistency or not. On the
Prashanth C Patel Page 10
Unit 4
other hand, a serial schedule does not need the serializability because it follows a transaction only
when the previous transaction is complete.
A serializable schedule helps in improving both resource utilization and CPU throughput. These are
of two types:
1. Conflict Serializable: A schedule is called conflict serializable if it can be transformed into a
serial schedule by swapping non-conflicting operations. Two operations are said to be
conflicting if all conditions satisfy:
They belong to different transactions
They operate on the same data item
At Least one of them is a write operation
2. View Serializable: A Schedule is called view serializable if it is view equal to a serial schedule
(no overlapping transactions). A conflict schedule is a view serializable but if the serializability
contains blind writes, then the view serializable does not conflict serializable.
Non-Serializable: The non-serializable schedule is divided into two types, Recoverable and Non-
recoverable Schedule.
1. Recoverable Schedule: Schedules in which transactions commit only after all transactions
whose changes they read commit are called recoverable schedules.
T1 T2
R(A)
W(A)
W(A)
R(A)
commit
commit
This is a recoverable schedule since T 1 commits before T 2 , that makes the value read by
T 2 correct. There can be three types of recoverable schedule:
Cascading Schedule: Also called Avoids cascading aborts/rollbacks (ACA). When there is a
failure in one transaction and this leads to the rolling back or aborting other dependent transactions,
then such scheduling is referred to as Cascading rollback or cascading abort. Example:
Cascadeless Schedule: Schedules in which transactions read values only after all transactions whose
changes they are going to read commit are called cascadeless schedules.
Prashanth C Patel Page 11