Normalization:
Normalization is the process of organizing the data in the database.
Normalization is used to minimize the redundancy from a relation or set of
relations. It is also used to eliminate undesirable characteristics like
Insertion, Update, and Deletion Anomalies.
Normalization divides the larger table into smaller and links them using
relationships.
The normal form is used to reduce redundancy from the database table.
Types of Normal Forms:
Normal Form Description
A relation is in 1NF if it contains an
1NF
atomic value.
A relation will be in 2NF if it is in 1NF
and all non-key attributes are fully
2NF
functional dependent on the primary
key.
A relation will be in 3NF if it is in 2NF
3NF
and no transition dependency exists.
A stronger definition of 3NF is known
BCNF
as Boyce Codd's normal form.
A relation will be in 4NF if it is in Boyce
4NF Codd's normal form and has no multi-
valued dependency.
A relation is in 5NF. If it is in 4NF and
5NF does not contain any join dependency,
joining should be lossless.
1. First Normal Form (1NF)
For a table to be in the First Normal Form, it should follow the following 4
rules:
It should only have single(atomic) valued attributes/columns.
Values stored in a column should be of the same domain.
All the columns in a table should have unique names.
And the order in which data is stored should not matter.
Let's see an example.
If we have an Employee table in which we store the employee
information along with the employee skillset, the table will look like
this:
emp_id emp_name emp_mobile emp_skills
1 KKS 9999957773 Python, JavaScript
2 KK 8888853337 HTML, CSS, JavaScript
emp_id emp_name emp_mobile emp_skills
3 SWAMY 7777720008 Java, Linux, C++
The above table has 4 columns:
All the columns have different names.
All the columns hold values of the same type like emp_name has all the
names, emp_mobile has all the contact numbers, etc.
The order in which we save data doesn't matter
But the emp_skills column holds multiple comma-separated values, while as per the
First Normal form, each column should have a single value.
Hence the above table fails to pass the First Normal form.
So how do you fix the above table? There are two ways to do this:
1. Remove the emp_skills column from the Employee table and keep it in some other
table.
2. Or add multiple rows for the employee and each row is linked with one skill.
1. Create Separate tables for Employee and Employee Skills
So the Employee table will look like this,
emp_id emp_name emp_mobile
1 KKS 9999957773
2 KK 8888853337
3 SWAMY 7777720008
And the new Employee_Skill table:
emp_id emp_skill
1 Python
1 JavaScript
2 HTML
2 CSS
2 JavaScript
3 Java
emp_id emp_skill
3 Linux
3 C++
2. Second Normal Form (2NF)
For a table to be in the Second Normal Form,
1. It should be in the First Normal form.
2. And, it should not have Partial Dependency.
Let's take an example to understand Partial dependency and the Second Normal Form.
What is Partial Dependency?
When a table has a primary key that is made up of two or more columns, then all the
columns(not included in the primary key) in that table should depend on the entire primary key
and not on a part of it. If any column(which is not in the primary key) depends on a part of the
primary key then we say we have Partial dependency in the table.
example.:
If we have two tables Students and Subjects, to store student information and information
related to subjects.
Student table:
student_id student_name branch
1 KKS CSE
2 KK Mechanical
Subject Table:
subject_nam
subject_id
e
1 C Language
2 DSA
Operating
3
System
And we have another table Score to store the marks scored by students in any subject like this,
student_id subject_id marks teacher_name
1 1 70 Miss. C
1 2 82 Mr. D
2 1 65 Mr. Op
student_id subject_id marks teacher_name
Now in the above table, the primary key is student_id + subject_id, because both these
information are required to select any row of data.
But in the Score table, we have a column teacher_name, which depends on the subject
information or just the subject_id, so we should not keep that information in the Score table.
The column teacher_name should be in the Subjects table. And then the entire system will be
Normalized as per the Second Normal Form.
Updated Subject table:
subject_i
subject_name teacher_name
d
1 C Language Miss. C
2 DSA Mr. D
3 Operating System Mr. Op
Updated Score table:
student_id subject_id marks
1 1 70
1 2 82
2 1 65
Third Normal Form (3NF):
A relation will be in 3NF if it is in 2NF and not contain any transitive partial dependency.
3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
If there is no transitive dependency for non-prime attributes, then the relation must be
in third normal form.
A relation is in third normal form if it holds atleast one of the following conditions for every
non-trivial function dependency X → Y.
1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key .
Example:
EMPLOYEE_DETAIL table:
EMP_ID EMP_NAME EMP_PINCODE EMP_STATE EMP_CITY
1 KKS 201010 AP KURNOOL
2 KK 02228 TS HYD
3 SWAMY 60007 TN KERALA
4 LEO 06389 KA BALARI
Super key in the table above:
1. {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME, EMP_PINCODE}..so on
Candidate key: {EMP_ID}
Non-prime attributes: In the given table, all attributes except EMP_ID are non-prime.
Here, EMP_STATE & EMP_CITY dependent on EMP_PINCODE and EMP_PINCODE
dependent on EMP_ID. The non-prime attributes (EMP_STATE, EMP_CITY)
transitively dependent on super key(EMP_ID). It violates the rule of third normal
form.
That's why we need to move the EMP_CITY and EMP_STATE to the new
<EMPLOYEE_PINCODE> table, with EMP_PINCODE as a Primary key.
EMPLOYEE table:
EMP_ID EMP_NAME EMP_PINCODE
1 KKS 201010
2 KK 02228
3 SWAMY 60007
4 LEO 06389
EMPLOYEE_PINCODE table:
EMP_PINCODE EMP_STATE EMP_CITY
201010 AP KURNOOL
02228 TS HYD
60007 TN KERALA
06389 KA BALARI
Boyce Codd normal form (BCNF):
BCNF is the advance version of 3NF. It is stricter than 3NF.
A table is in BCNF if every functional dependency X → Y, X is the super key of the table.
For BCNF, the table should be in 3NF, and for every FD, LHS is super key.
Example: Let's assume there is a company where employees work in more than one
department.
EMPLOYEE table:
EMP_ID EMP_COUNTRY EMP_DEPT DEPT_TYPE EMP_DEPT_NO
1 India Designing D394 283
1 India Testing D394 300
2 UK Stores D283 232
2 UK Developing D283 549
In the above table Functional dependencies are as follows:
EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate key: {EMP-ID, EMP-DEPT}
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
1 India
1 India
EMP_DEPT table:
EMP_DEPT DEPT_TYPE EMP_DEPT_NO
Designing D394 283
Testing D394 300
Stores D283 232
Developing D283 549
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
D394 283
D394 300
D283 232
D283 549
Functional dependencies:
EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
For the first table: EMP_ID
For the second table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT}
Now, this is in BCNF because left side part of both the functional dependencies is a key.