0% found this document useful (0 votes)
51 views67 pages

RDBMS Unit-1

The document provides an overview of Relational Database Management Systems (RDBMS), explaining the differences between data and information, and detailing the structure and functionality of databases. It covers the characteristics, advantages, and disadvantages of RDBMS, as well as the roles of database administrators and various database models. Additionally, it discusses data abstraction, data definition language (DDL), and data manipulation language (DML) in the context of database management.

Uploaded by

riyadadhich07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views67 pages

RDBMS Unit-1

The document provides an overview of Relational Database Management Systems (RDBMS), explaining the differences between data and information, and detailing the structure and functionality of databases. It covers the characteristics, advantages, and disadvantages of RDBMS, as well as the roles of database administrators and various database models. Additionally, it discusses data abstraction, data definition language (DDL), and data manipulation language (DML) in the context of database management.

Uploaded by

riyadadhich07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Relational Database Management Systems

Data: The raw information (facts and figures) input by the user is called Data. Raw means
data isn’t been processed yet. It has no significance beyond its existence and simply exists.
Data means which has no meaning of itself.

Information: Information is the processed data. It is the result of processing. It is


meaningful.

Data Data Processor Information

What is the difference between Data and Information?


Data Information

Raw Data Processed Data


It doesn’t help in decision making It helps in decision making
It doesn’t required any prerequisite It require prerequisite as Data
Unorganized Organized
Database
 Data is a collection of a distinct small unit of information. It can be used in a variety of
forms like text, numbers, media, bytes, etc. it can be stored in pieces of paper or
electronic memory, etc.

 Word 'Data' is originated from the word 'datum' that means 'single piece of
information.' It is plural of the word datum.

 In computing, Data is information that can be translated into a form for efficient
movement and processing. Data is interchangeable.

Database Management Systems


 Database Management Systems (DBMS) are software systems used to store, retrieve,
and run queries on data. A DBMS serves as an interface between an end-user and a
database, allowing users to create, read, update, and delete data in the database.
 DBMS manage the data, the database engine, and the database schema, allowing for
data to be manipulated or extracted by users and other programs. This helps provide
data security, data integrity, concurrency, and uniform data administration procedures.
Relational Database Management Systems
 A relational database is a digital database based on the relational model of data, as
proposed by E.F.Codd in 1970. Relational DBMS owes its foundation to the fact that the
values of each table are related to others. It has the capability to handle larger
magnitudes of data and simulate queries easily.
 It is basically a program that allows us to create, delete, and update a relational
database. Relational database is a database system that stores and retrieves date in a
tabular format organized in the form of rows and columns.

RDBMS data integrity by simulating the following features:


 Entity Integrity: No two records of the database table can be completely duplicate.
 Referential Integrity: Only the rows of those tables can be deleted which are not
used by other tables. Otherwise, it may lead to data inconsistency.
 User-defined integrity: rules defined by the users based on confidentiality and
access.
 Domain Integrity: the columns of the database tables are enclosed within some
structured limits, based on default values, type of data or ranges.

Characteristics
 Data must be stored in tabular form in DB file, that is, it should be organized in the form
of rows and columns.
 Each row of table is called record/tuple. Collection of such records is known as the
cardinality of the table.
 Each column of the table is called an attribute/field. Collection of such columns is called
the arity of the table.
 No two records of the DB table can be same. Data duplicity is therefore avoided by using
a candidate key. Candidate key is a minimum set of attributes required to identify each
record uniquely.
 Tables are related to each other with the help for foreign keys.
 Database tables also allow a NULL value that is if the values of any of the element of the
table are not filled or are missing, it becomes a NULL value, which is not equivalent to
zero. (NOTE: Primary key can’t have a NULL value).
Relational Database Pros
 Easy to manage: Each table can be independently manipulated without affecting
others.
 Security: It is more secure consisting of multiple levels of security. Access of data
shared can be limited.
 Flexible: Relational databases can be very flexible in how they allow users to change or
store data. This can be very helpful when updating data or adding more data to a stored
set.
 Accuracy: Because relational databases are usually designed to be easy to navigate, it
can also be easy to search for potential errors or inconsistencies within a database and
correct them right away.
 Usability: Relational databases are often very easy to use, which can be especially
helpful for users who want a simple data storage solution. It's also possible to navigate
and share data sets easily within a relational database.
 Users: RDBMS supports client-side architecture storing multiple users together.
 Facilitates storage and retrieval of large amount of data.
 Easy Data Handling:
o Data fetching is faster because of relational architecture.
o Data redundancy or duplicity is avoided due to keys, indexes, and normalization
principles.
o Data consistency is ensured because RDBMS is based on ACID properties for data
transactions (Atomicity Consistency Isolation Durability).
o Fault Tolerance: Replication of databases provides simultaneous access and helps
the system recover in case of disasters, such as power failures or sudden
shutdowns

Disadvantages Relational Database Cons


 High Cost and Extensive Hardware and Software Support: Huge costs and
setups are required to make these systems functional.
 Scalability: In case of addition of more data, servers along with additional power and
memory are required.
 Complexity: Voluminous data creates complexity in understanding of relations and
may lower down the performance.
 Structured Limits: The fields or a column of a relational database system is enclosed
within various limits, which may lead to loss of data.

Difference between RDBMS and DBMS


 Database Management System (DBMS) is software that is used to define, create and
maintain a database and provides controlled access to the data.
 Relational Database Management System (RDBMS) is an advanced version of a DBMS.

DBMS RDBMS
DBMS stores data as file. RDBMS stores data in tabular form.
Data elements need to access individually. Multiple data elements can be accessed at the
same time.
No relationship between data. Data is stored in the form of tables which are
related to each other.
Normalization is not present. Normalization is present.
DBMS does not support distributed database. RDBMS supports distributed database.
It stores data in either a navigational or It uses a tabular structure where the headers
hierarchical form. are the column names, and the rows contain
corresponding values.
It deals with small quantity of data. It deals with large amount of data.
Data redundancy is common in this model. Keys and indexes do not allow Data
redundancy.
It is used for small organization and deal with It is used to handle large amount of data.
small data.
It supports single user. It supports multiple users.
Data fetching is slower for the large amount Data fetching is fast because of relational
of data. approach.
The data in a DBMS is subject to low security There exists multiple levels of data security in
levels with regards to data manipulation. a RDBMS.
Low software and hardware necessities. Higher software and hardware necessities.
Examples: XML, Window Registry, etc. Examples: MySQL, PostgreSQL, SQL Server,
Oracle, Microsoft Access etc.

Data Abstraction
Data Abstraction is a process of hiding unwanted or irrelevant details from the end user. It
provides a different view and helps in achieving data independence which is used to enhance
the security of data.
There are mainly 3 levels of data abstraction:

 View Level or External Schema: This is the highest level of abstraction. Only a part of
the actual database is viewed by the users. This level exists to ease the accessibility of the
database by an individual user. Users view data in the form of rows and columns. Tables and
relations are used to store data. Multiple views of the same database may exist. Users can
just view the data and interact with the database, storage and implementation details are
hidden from them.
 Example: If we have a login-id and password in a university system, then as a student,
we can view our marks, attendance, fee structure, etc. But the faculty of the university
will have a different view. He will have options like salary, edit marks of a student,
enter attendance of the students, etc. So, both the student and the faculty have a
different view. By doing so, the security of the system also increases. In this example,
the student can't edit his marks but the faculty who is authorized to edit the marks can
edit the student's marks. Similarly, the dean of the college or university will have some
more authorization and accordingly, he will have his view. So, different users will have
a different view according to the authorization they have.

 Physical Level or Internal Schema: This is the lowest level of data abstraction. It tells
us how the data is actually stored in memory. The access methods like sequential or random
access and file organization methods like B+ trees, hashing used for the same. Usability, size
of memory, and the number of times the records are factors that we need to know while
designing the database.
 Example: The Database Administrators (DBA) decides that which data should be kept at
which particular disk drive, how the data has to be fragmented, where it has to be
stored etc. They decide if the data has to be centralized or distributed. Though we see
the data in the form of tables at view level the data here is actually stored in the form
of files only. It totally depends on the DBA, how he/she manages the database at the
physical level. Suppose we need to store the details of an employee. Blocks of storage
and the amount of memory used for these purposes are kept hidden from the user.
 Logical Level or Conceptual Level: This level tells how the data is actually stored and
structured. We have different data models by which we can store the data. It also stores
the relationship among the data entities in relatively simple structures. At this level, the
information available to the user at the view level is unknown.
 Example: Let us take an example where we use the relational model for storing the
data. We have to store the data of a student; the columns in the student table will be
student_name, age, mail_id, roll_no etc. We have to define all these at this level while
we are creating the database. Though the data is stored in the database but the
structure of the tables like the student table, teacher table, books table, etc are defined
here in the conceptual level or logical level. Also, how the tables are related to each
other is defined here. Overall, we can say that we are creating a blueprint of the data
at the conceptual level.

Data Definition Language (DDL)


o DDL stands for Data Definition Language. It is used to define database structure or
pattern.
o It is used to create schema, tables, indexes, constraints, etc. in the database.
o Using the DDL statements, you can create the skeleton of the database.
o Data definition language is used to store the information of metadata like the number of
tables and schemas, their names, indexes, columns in each table, constraints, etc.

Here are some tasks that come under DDL:

o Create: It is used to create objects in the database.


o Alter: It is used to alter the structure of the database.
o Drop: It is used to delete objects from the database.
o Truncate: It is used to remove all records from a table.
o Rename: It is used to rename an object.
o Comment: It is used to comment on the data dictionary.
Data Manipulation Language (DML)
DML stands for Data Manipulation Language. It is used for accessing and manipulating data in a database. It
handles user requests.

Here are some tasks that come under DML:

o Select: It is used to retrieve data from a database.


o Insert: It is used to insert data into a table.
o Update: It is used to update existing data within a table.
o Delete: It is used to delete all records from a table.
o Merge: It performs UPSERT operation, i.e., insert or update operations.
o Call: It is used to call a structured query language or a Java subprogram.
o Explain Plan: It has the parameter of explaining data.
o Lock Table: It controls concurrency.

Database Administrator
 DBA can be a single person or group of person.
 DBA has central control over both data & application programs.
 DBA is responsible for everything that is related to database. He makes policies,
strategies & provides technical supports.
 Role / Responsibilities of DBA
1. Schema definition & modification –
o Database schema is the overall structure of the database. It is the responsibility
of DBA to create schema.
o Database schema is created by writing a set of data definition statements in
DDL.
o Modification of schema can also be performed by DBA.
2. Granting of authorization for data access –
o DBA can grant authorization of data access to different users so that database
will not be accessible by unauthorized users.
3. Routine maintenance checks –
o The DBA is responsible for taking the database backup periodically in order to be
able to recover data from any failures due to human hardware or software
malfunctioning & restore the database to a consistent state.
4. New Software Installation –
o It is the responsibility of the DBA to install new DBMS software, application
software & other related software. DBA also test the new software after
installation.
5. Monitoring Performance –
o It is the responsibility of DBA to monitor the performance of the database. A
database monitors the new CPU & memory usage of computer.
6. Security enforcement & administration –
o DBA is responsible for establishing & monitoring the security of the database
system.
o DBA check & audit security problems. It can also add & remove users if required.
7. Deciding the storage structure & access strategy –
o The DBA must also decide how the data is to be represented in the database &
must specify the representation by writing the storage structure definition.
8. Database availability –
o DBA ensure that database information is always available to all users in a form
that suits their needs.

DBMS Database Models


 A Database model defines the logical design and structure of a database and defines
how data will be stored, accessed and updated in a database management system.
 While the Relational Model is the most widely used database model, there are other
models too:
 Hierarchical Model
 Network Model
 Entity-relationship Model
 Relational Model
 Hierarchical Model
 This database model organizes data into a tree-like-structure, with a single root, to
which all the other data is linked.
 The hierarchy starts from the Root data, and expands like a tree, adding child nodes to
the parent nodes.
 In this model, a child node will only have a single parent node.
 This model efficiently describes many real-world relationships like index of a book,
recipes etc.
 In hierarchical model, data is organized into tree-like structure with one -to-many
relationships between two different types of data.
 For example, one department can have many courses, many professors and of-course
many students.

 Network Model
 This is an extension of the Hierarchical model. In this model data is organized more like
a graph, and are allowed to have more than one parent node.
 In this database model data is more related as more relationships are established in this
database model. Also, as the data is more related, hence accessing the data is also
easier and fast.
 This database model was used to map many-to-many data relationships.
 This was the most widely used database model, before Relational Model was
introduced.

 Entity-relationship Model
 E-R Models are defined to represent the relationships into pictorial form to make it
easier to understand.
 This model is good to design a database, which can then be turned into tables in
relational model.
 Let's take an example, If we have to design a School Database, then Student will be an
entity with attributes name, age, address etc. As Address is generally complex, it can be
another entity with attributes street name, pincode, city etc, and there will be a
relationship between them.
 Relational Model
 In this model, data is organized in twodimensional tables and the relationship is
maintained by storing a common field.
 This model was introduced by E.F Codd in 1970, and since then it has been the most
widely used database model, infect, we can say the only database model used around
the world.
 The basic structure of data in the relational model is tables. All the information related
to a particular type is stored in rows of that table.
 Hence, tables are also known as relations in relational model.

 Table
 A table is a collection of data elements organized in terms of rows and columns.
 A table is also considered as a convenient representation of relations. But a table
can have duplicate row of data while a true relation cannot have duplicate data.
 Table is the simplest form of data storage. Below is an example of an Employee
table.

 Tuple
 A single row of a table, which contains a single record for that relation, is called a
tuple.

 Null Value
 NULL is the term used to represent a missing value.
 A NULL value in a table is a value in a field that appears to be blank.

 Attribute Domain
 Every attribute has some pre-defined value scope, known as attribute domain.
 For example,
 Age cannot be less than zero and
 Telephone numbers cannot contain a digit outside 0- 9.

Database System Architecture


• The DBMS design depends upon its architecture. The basic client/server architecture is
used to deal with a large number of PCs, web servers, database servers and other
components that are connected with networks.
• The client/server architecture consists of many PCs and a workstation which are
connected via the network.

• DBMS architecture depends upon how users are connected to the database to get their
request done.

TYPES OF DBMS ARCHITECTURE

 1-TIER ARCHITECTURE
o In this architecture, the database is directly available to the user. It means the user
can directly sit on the DBMS and uses it.
o Any changes done here will directly be done on the database itself. It doesn't provide
a handy tool for end users.
o The 1-Tier architecture is used for development of the local application, where
programmers can directly communicate with the database for the quick response.
 2-TIER ARCHITECTURE
o The 2-Tier architecture is same as basic client-server. In the two-tier architecture,
applications on the client end can directly communicate with the database at the
server side. For this interaction, API's like: ODBC, JDBC are used.
o The user interfaces and application programs are run on the client-side.
o The server side is responsible to provide the functionalities like: query processing and
transaction management.
o To communicate with the DBMS, client-side application establishes a connection
with the server side.
 3-TIER ARCHITECTURE
o The 3-Tier architecture contains another layer between the client and server. In this
architecture, client can't directly communicate with the server.
o The application on the client-end interacts with an application server which further
communicates with the database system.
o End user has no idea about the existence of the database beyond the application
server. The database also has no idea about any other user beyond the application.
o The 3-Tier architecture is used in case of large web application.

Entity Relationship Model


 ER model stands for an Entity-Relationship model.
 ER model or diagram is the graphical representation of an organization’s data storage
requirements.
 This model is used to define the data elements and relationship for a specified system.
 It develops a conceptual (logical) design for the database.
 It also develops a very simple and easy to design view of data, like a flow chart.
 Example:
o Suppose we design a school database. In this database, the student will be an entity
with attributes like address, name, id, age, etc. The address can be another entity
with attributes like city, street name, pin code, etc and there will be a relationship
between them.

 Entity
o An Entity may be a real world object with a physical (Tangible) or conceptual
existence (Intangible).
o Tangible Entity: Tangible Entities are those entities which exist in the real world
physically. Example: Person, car, etc.
o Intangible Entity: Intangible Entities are those entities which exist only logically and
have no physical existence. Example: Bank Account, etc.
o In the ER diagram, an entity can be represented as rectangles.
Entity
 Component of ER Diagram

 Example of Entity in DBMS


 An entity has attributes, which can be considered as properties describing it.
 For example, for Professor Entity, the attributes are Professor Name, Professor
Address, Professor Salary, etc.
 The attribute value gets stored in the database.

 Here, Professor Name, Professor Address and Professor Salary are attributes.
ProfessorID is the primary key
 Strong Entity
 The strong entity has a primary key. Weak entities are dependent on strong entity. Its
existence is not dependent on any other entity.
 Strong Entity is represented by a single rectangle: Strong Entity
 Continuing our previous example, Professor is a strong entity here, and the primary key
is Professor_ID.
 Weak Entity
 An entity that depends on another entity called a weak entity.
 The weak entity doesn't contain any key attribute of its own.
 The weak entity in DBMS does not have a primary key and are dependent on the parent
entity.
 The weak entity is represented by a double rectangle.

 Continuing our previous example, Professor is a strong entity, and the primary key is
Professor_ID. However, another entity is Professor_Dependents, which is our Weak
Entity.
 Professor Dependents can be his/her father, mother, siblings, child, husband, wife or
another.
 Example of Strong and Weak Entity
 The Strong Entity is Professor, whereas Dependent is a Weak Entity.
 ID is the primary key (represented with a line) and Name in Dependent entity is called
Partial Key (represented with a dotted line).

 Entity Type
 The entity type is a collection of the entity having similar attributes.
 So, in simple worlds we can say that, an entity type in an ER diagram is defined by a
name of entity (here, STUDENT) and a set of attributes of that entity (here, Roll_no,
Student_name, Age, Mobile_no).

 Entity Set
 A set of all entities is called as entity set.
 e.g. E1 is an entity having Entity Type Student and set of all students is called Entity
Set. In ER diagram, Entity Type is represented as:
 Entity, Entity Type & Entity Set

 Attribute
 The attribute is used to describe the property of an entity.
 Or in simple words, we can say that attributes are the properties or features of an
entity.
 Eclipse (ellipse) is used to represent an attribute. attribute
 Eg. : roll_no, name, , name, birthdate, etc. can be attributes of a student.

 Types of Attributes
1. Key Attributes
2. Composite Attribute
3. Multivalued Attributes
4. Derived Attributes
5. Some More Attributes
1. Key Attribute
 The attribute which uniquely identifies each entity in the entity set is called key
attribute.
 It represents a primary key.
 The key attribute is represented by an ellipse with the text underlined.
 For example, Roll_No will be unique for each student

Roll_No

2. Composite Attribute
 An attribute that composed of many other attributes is known as a composite attribute.
 The composite attribute is represented by an ellipse, and those ellipses are connected
with an ellipse.
 For example, Address attribute of student Entity type consists of Street, City, State, and
Country.

3. Multivalued Attribute
 An attribute which can have more than one value, is known as a multivalued attribute.
 The double oval is used to represent multivalued attribute.
 For example, a student can have more than one phone number.

Phone_n
o
4. Derived Attribute
 An attribute which can be derived from other attribute is known as a derived attribute.
 It can be represented by a dashed ellipse.
 For example, a person's age changes over time and can be derived from another
attribute like Date of birth.
Age

5. Some more Attribute


 Simple attribute − Simple attributes are atomic values, which cannot be divided further.
 For example, a student's phone number is an atomic value of 10 digits.
 Single-value attribute − Single-value attributes contain single value.
 For example − Social_Security_Number.
 The complete entity type Student
Multivalued Attribute

MAPPING CONSTRAINTS
o A mapping constraint is a data constraint that expresses the number of entities to which
another entity can be related via a relationship set.
o It is most useful in describing the relationship sets that involve more than two entity
sets.
o For binary relationship set R on an entity set A and B, there are four possible mapping
cardinalities. These are as follows:
1. One to one (1:1)
2. One to many (1:M)
3. Many to one (M:1)
4. Many to many (M:M)

1. One-to-one
In one-to-one mapping, an entity in E1 is associated with at most one entity in E2, and an
entity in E2 is associated with at most one entity in E1.
2. One-to-many
In one-to-many mapping, an entity in E1 is associated with any number of entities in E2,
and an entity in E2 is associated with at most one entity in E1.

3. Many-to-one
In one-to-many mapping, an entity in E1 is associated with at most one entity in E2, and an
entity in E2 is associated with any number of entities in E1.

4. Many-to-many
In many-to-many mapping, an entity in E1 is associated with any number of entities in E2,
and an entity in E2 is associated with any number of entities in E1.

GENERALIZATION
• Generalization is like a bottom-up approach in which two or more entities of lower level
combine to form a higher level entity if they have some attributes in common.
• In generalization, an entity of a higher level can also combine with the entities of the
lower level to form a further higher level entity.
• Generalization is more like subclass and superclass system, but the only difference is the
approach. Generalization uses the bottom-up approach.
• In generalization, entities are combined to form a more generalized entity, i.e.,
subclasses are combined to make a superclass.
• For example, Faculty and Student entities can be generalized and create a higher level
entity Person.

AGGREGATION
In aggregation, the relation between two entities is treated as a single entity. In aggregation,
relationship with its corresponding entities is aggregated into a higher level entity.

For example: Center entity offers the Course entity act as a single entity in the relationship
which is in a relationship with another entity visitor. In the real world, if a visitor visits a
coaching center then he will never enquiry about the Course only or just about the Center
instead he will ask the enquiry about both.

USE OF ER MODEL FOR THE DESIGN OF DATABASES


o Designing Database Applications
 Database application are large and complex
 Some of the design areas:
 Database schema (physical/logical/view)
 Programs that access and update data
 Security constraints for data access
 Requires familiarity with the problem domain
 Domain experts must help drive requirements
o Overview of the Design Process

 Requirements definition phase


 Collect user requirements
 Information that needs to be represented
 Operations to perform on that information
 Several techniques for representing this info (requirements document), i.e.
textual description, UML...
 Conceptual-design phase
 The conceptual schema describes the information structure of an application,
which is not subject to change as the application evolves.
 A high-level representation of the database’s structure and constraints
 Physical and logical design issues are ignored at this stage
 Often represented graphically
 Includes specification of functional requirements
 What operations will be performed against the data?
 Can be used to verify the conceptual schema
 Are all operations possible?
 How complicated is it?
 Logical-design phase
 Convert conceptual schema into an implementation data model, e.g. E-R
schema
 Physical-design phase
 Implement database model including possible additional design an tuning
decisions (e.g. indexes, disk-level partitioning of data)

SEQUENTIAL FILE ORGANIZATION


The easiest method for file Organization is Sequential method. In this method the file are
stored one after another in a sequential manner. There are two ways to implement this
method:
 Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e. one after other in the order in which they are inserted into the tables.
1. Insertion of new record –
Let the R1, R3 and so on up to R5 and R4 be four records in the sequence. Here, records are
nothing but a row in any table. Suppose a new record R2 has to be inserted in the
sequence, and then it is simply placed at the end of the file.

 Sorted File Method –In this method, as the name itself suggest whenever a new record
has to be inserted, it is always inserted in a sorted (ascending or descending) manner.
Sorting of records may be based on any primary key or any other key.

1. Insertion of new record –


Let us assume that there is a preexisting sorted sequence of four records R1, R3, and so on
up to R7 and R8. Suppose a new record R2 has to be inserted in the sequence, then it will
be inserted at the end of the file and then it will sort the sequence.

Pros and Cons of Sequential File Organization –


Pros –
 Fast and efficient method for huge amount of data.
 Simple design.
 Files can be easily stored in magnetic tapes i.e. cheaper storage mechanism.
Cons –
 Time wastage as we cannot jump on a particular record that is required, but we have to
move in a sequential manner which takes our time.
 Sorted file method is inefficient as it takes time and space for sorting records.

RANDOM FILE ORGANIZATION


 Records are stored randomly but accessed directly.
 To access a file stored randomly, a record key is used to determine where a record is
stored on the storage media.
 Magnetic and optical disks allow data to be stored and accessed randomly.

Pros of random file access


 Quick retrieval of records.
 The records can be of different sizes.

INDEX SEQUENTIAL FILE ORGANIZATION


ISAM (Indexed Sequential Access Method) method is an advanced sequential file organization.
In this method, records are stored in the file using the primary key. An index value is
generated for each primary key and mapped with the record. This index contains the address
of the record in the file.

If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.

Pros of ISAM:
 In this method, each record has the address of its data block, searching a record in a huge
database is quick and easy.
 This method supports range retrieval and partial retrieval of records. Since the index is
based on the primary key values, we can retrieve the data for the given range of value. In
the same way, the partial value can also be easily searched, i.e., the student name starting
with 'JA' can be easily searched.
Cons of ISAM
 This method requires extra space in the disk to store the index value.
 When the new records are inserted, then these files have to be reconstructed to maintain
the sequence.
 When the record is deleted, then the space used by it needs to be released. Otherwise, the
performance of the database will slow down.

Relational Algebra
Relational algebra is a procedural query language. It gives a step by step process to obtain
the result of the query. It uses operators to perform queries.

 Definition
 “Relational algebra is a theoretical language with operators that are applied on one or
two relations to produce another relation.”
 Both the operands and the result are tables

 Types of Relational operation

NORMALIZATION UP TO DKNF
It is basically a process in database to organize data efficiently. Basically there are two goals of
doing normalization these are as follows:

1. To remove repeated data or in simple words we can say to remove redundant data.
2. Second one is to ensure that there will be data dependencies.
3. DKNF stands for Domain Key Normal Form requires the database that contains no
constraints other than domain constraints and key constraints.
4. In DKNF, it is easy to build a database.
5. It avoids general constraints in the database which are not clear domain or key
constraints.
6. The 3NF, 4NF, 5NF and BCNF are special cases of the DKNF.
7. It is achieved when every constraint on the relation is a logical consequence of the
definition.
Unit-2

OBJECT ORIENTED DATA MODELING


 Need of Object Oriented Data Model:
To represent the complex real world problems there was a need for a data model that is
closely related to real world. Object Oriented Data Model represents the real world
problems easily.

 Object Oriented Data Model:


In Object Oriented Data Model, data and their relationships are contained in a single
structure which is referred as object in this data model. In this, real world problems are
represented as objects with different attributes. All objects have multiple relationships
between them. Basically, it is combination of Object Oriented programming and Relational
Database Model as it is clear from the following figure:

Object Oriented Object Oriented Data Model = Combination of Object Oriented Programming + Relational
database model d Data Model

 Components of Object Oriented Data Model :

Objects – An object is an abstraction of a real world entity or we can say it is an instance of


class. Objects encapsulate data and code into a single unit which provide data abstraction
by hiding the implementation details from the user. For example: Instances of student,
doctor, engineer in above figure.
Attribute – An attribute describes the properties of object. For example: Object is
STUDENT and its attribute are Roll no, Branch, Setmarks () in the Student class.
Methods – Method represents the behavior of an object. Basically, it represents the real-
world action. For example: Finding a STUDENT marks in above figure as Setmarks ().
Class – A class is a collection of similar objects with shared structure i.e. attributes and
behavior i.e. methods. An object is an instance of class. For example: Person, Student,
Doctor, and Engineer in above figure.
class student
{
char Name[20];
int roll_no;
--
--
public:
void search();
void update();
}
In this example, students refers to class and S1, S2 are the objects of class which can be
created in main function.
Inheritance –
By using inheritance, new class can inherit the attributes and methods of the old class i.e.
base class. For example: as classes Student, Doctor and Engineer are inherited from the
base class Person.

GENERALIZATION
• Generalization is like a bottom-up approach in which two or more entities of lower level
combine to form a higher level entity if they have some attributes in common.

• In generalization, an entity of a higher level can also combine with the entities of the
lower level to form a further higher level entity.

• Generalization is more like subclass and superclass system, but the only difference is the
approach. Generalization uses the bottom-up approach.

• In generalization, entities are combined to form a more generalized entity, i.e.,


subclasses are combined to make a superclass.

• For example, Faculty and Student entities can be generalized and create a higher level
entity Person.
AGGREGATION
In aggregation, the relation between two entities is treated as a single entity. In aggregation,
relationship with its corresponding entities is aggregated into a higher level entity.

For example: Center entity offers the Course entity act as a single entity in the relationship
which is in a relationship with another entity visitor. In the real world, if a visitor visits a
coaching center then he will never enquiry about the Course only or just about the Center
instead he will ask the enquiry about both.

DISTRIBUTED DATABSE DESIGN


 A distributed database is a collection of multiple interconnected databases, which are
spread physically across various locations that communicate via a computer network.
 A distributed database is basically a database that is not limited to one system, it is spread
over different sites, i.e., on multiple computers or over a network of computers. A
distributed database system is located on various sites that don’t share physical
components. This may be required when a particular database needs to be accessed by
various users globally. It needs to be managed such that for the users it looks like one single
database.

Features
 Databases in the collection are logically interrelated with each other. Often they represent
a single logical database.
 Data is physically stored across multiple sites. Data in each site can be managed by a
DBMS independent of the other sites.
 The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
 A distributed database is not a loosely connected file system.
 A distributed database incorporates transaction processing, but it is not synonymous with
a transaction processing system.
Advantages of Distributed Databases
Following are the advantages of distributed databases over centralized databases.
 Modular Development − If the system needs to be expanded to new locations or new
units, in centralized database systems, the action requires substantial efforts and
disruption in the existing functioning. However, in distributed databases, the work simply
requires adding new computers and local data to the new site and finally connecting them
to the distributed system, with no interruption in current functions.
 More Reliable − In case of database failures, the total system of centralized databases
comes to a halt. However, in distributed systems, when a component fails, the functioning
of the system continues may be at a reduced performance. Hence DDBMS is more
reliable.
 Better Response − If data is distributed in an efficient manner, then user requests can be
met from local data itself, thus providing faster response. On the other hand, in
centralized systems, all queries have to pass through the central computer for processing,
which increases the response time.
 Lower Communication Cost − In distributed database systems, if data is located locally
where it is mostly used, then the communication costs for data manipulation can be
minimized. This is not feasible in centralized systems.
Types of Distributed Databases
Distributed databases can be broadly classified into homogeneous and heterogeneous
distributed database environments, each with further sub-divisions, as shown in the following
illustration.

Homogeneous Distributed Databases


In a homogeneous distributed database, all the sites use identical DBMS and operating
systems. Its properties are −
 The sites use very similar software.
 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to process user
requests.
 The database is accessed through a single interface as if it is a single database.

Types of Homogeneous Distributed Database


There are two types of homogeneous distributed database −
 Autonomous − each database is independent that functions on its own. They are
integrated by a controlling application and use message passing to share data updates.
 Non-autonomous − Data is distributed across the homogeneous nodes and a central or
master DBMS co-ordinates data updates across the sites.

Heterogeneous Distributed Databases


In a heterogeneous distributed database, different sites have different operating systems,
DBMS products and data models. Its properties are −
 Different sites use dissimilar schemas and software.
 The system may be composed of a variety of DBMSs like relational, network,
hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in processing
user requests.

Types of Heterogeneous Distributed Databases


 Federated − the heterogeneous database systems are independent in nature and
integrated together so that they function as a single database system.
 Un-federated − the database systems employ a central coordinating module through
which the databases are accessed.

DISTRIBUTED DBMS ARCHITECTURES


DDBMS architectures are generally developed depending on three parameters −
 Distribution − It states the physical distribution of data across the different sites.
 Autonomy − It indicates the distribution of control of the database system and the
degree to which each constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.
Architectural Models
Some of the common architectural models are −
1. Client - Server Architecture for DDBMS
2. Peer - to - Peer Architecture for DDBMS
3. Multi - DBMS Architecture
1. Client - Server Architecture for DDBMS
This is a two-level architecture where the functionality is divided into servers and
clients. The server functions primarily encompass data management, query processing,
optimization and transaction management. Client functions include mainly user
interface. However, they have some functions like consistency checking and transaction
management.
The two different client - server architecture are −
 Single Server Multiple Client
 Multiple Server Multiple Client (shown in the following diagram)

2. Peer- to-Peer Architecture for DDBMS


In these systems, each peer acts both as a client and a server for imparting database
services. The peers share their resource with other peers and co-ordinate their activities.

This architecture generally has four levels of schemas −


 Global Conceptual Schema − Depicts the global logical view of data.
 Local Conceptual Schema − Depicts logical data organization at each site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.

3. Multi - DBMS Architectures


This is an integrated database system formed by a collection of two or more
autonomous database systems.
Multi-DBMS can be expressed through six levels of schemas −
 Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that comprises
of global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across different sites
and multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at each site.
 Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS −

 Model with multi-database conceptual level.

 Model without multi-database conceptual level.


Design Alternatives
The distribution design alternatives for the tables in a DDBMS are as follows −

 Non-replicated and non-fragmented


 Fully replicated
 Partially replicated
 Fragmented
 Mixed

Non-replicated & Non-fragmented


In this design alternative, different tables are placed at different sites. Data is placed so
that it is at a close proximity to the site where it is used most. It is most suitable for
database systems where the percentage of queries needed to join information in tables
placed at different sites is low. If an appropriate distribution strategy is adopted, then this
design alternative helps to reduce the communication cost during data processing.

Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored. Since,
each site has its own copy of the entire database; queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost
during update operations. Hence, this is suitable for systems where a large number of
queries is required to be handled whereas the number of database updates is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution of the
tables is done in accordance to the frequency of access. This takes into consideration the
fact that the frequency of accessing the tables vary considerably from site to site. The
number of copies of the tables (or portions) depends on how frequently the access
queries execute and the site which generate the access queries.

Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or
partitions, and each fragment can be stored at different sites. This considers the fact that
it seldom happens that all data stored in a table is required at a given site. Moreover,
fragmentation increases parallelism and provides better disaster recovery. Here, there is
only one copy of each fragment in the system, i.e. no redundant data.
The three fragmentation techniques are −

 Verticalfragmentation
 Horizontal fragmentation

 Hybrid fragmentation

Mixed Distribution
This is a combination of fragmentation and partial replications. Here, the tables are
initially fragmented in any form (horizontal or vertical), and then these fragments are
partially replicated across the different sites according to the frequency of accessing the
fragments.

DATA COMMUNICATION CONCEPT


 Data communication refers to the exchange of data between a source and a receiver via
form of transmission media such as a wire cable. Data communication is said to be local if
communicating devices are in the same building or a similarly restricted geographical
area.
 A data communication system may collect data from remote locations through data
transmission circuits, and then outputs processed results to remote locations. The
different data communication techniques which are presently in widespread use evolved
gradually either to improve the data communication techniques already existing or to
replace the same with better options and features.
Components of data communication system
A Communication system has following components:
1. Message: It is the information or data to be communicated. It can consist of text,
numbers, pictures, sound or video or any combination of these.
2. Sender: It is the device/computer that generates and sends that message
3. Receiver: It is the device or computer that receives the message. The location of
receiver computer is generally different from the sender computer. The distance
between sender and receiver depends upon the types of network used in between.
4. Medium: It is the channel or physical path through which the message is carried from
sender to the receiver. The medium can be wired like twisted pair wire, coaxial cable,
fiber-optic cable or wireless like laser, radio waves, and microwaves.

CONCURRENCY CONTROL IN DBMS


It is the process of managing simultaneous execution of transactions in a shared database to
ensure the serializability of transactions.

Purpose of concurrency control

(i) To enforce isolation


(ii) To preserve database consistency
(iii) To resolve read-write and write-write conflicts.

Concurrency control techniques:-

1. Lock-based Protocol: a lock guarantees exclusive use of a data item to a current


transaction.
 To access data item (lock acquire)
 After completion of transaction (release lock)
[All data items must be accessed in a mutually exclusive manner.]

Types of locks
1. Shared lock
 Lock-s (read data item value)
2. Exclusive lock
 Lock-x (both read and write)

Compatibility b/w Lock Modes

S X
S T F

X F F
Note: Any no. of transactions can hold shared lock or an item but exclusive lock can be hold only
by one transactions of a time.

CONVERSION OF LOCKS
i. Upgrading – Read-Lock -> Write-Lock
ii. Downgrading – Write-Lock -> Read-Lock

RECOVERY IN DDBMS
Recovery is the most complicated process in distributed databases. Recovery of a failed
system in the communication network is very difficult.

For example:
Consider that, location A sends message to location B and expects response from B but B is
unable to receive it. There are several problems for this situation which are as follows.

 Message was failed due to failure in the network.


 Location B sent message but not delivered to location A.
 Location B crashed down.
 So it is actually very difficult to find the cause of failure in a large communication
network.
 Distributed commit in the network is also a serious problem which can affect the
recovery in distributed databases.

NEED OF RECOVERY
A database is a very huge system with lots of data and transaction. The transaction in the
database is executed at each seconds of time and is very critical to the database. If there is
any failure or crash while executing the transaction, then it expected that no data is lost. It is
necessary to revert the changes of transaction to previously committed point. There are
various techniques to recover the data depending on the type of failure or crash.
Generalization of failure:

 Transaction failure
 System crash
 Disk failure

When a DBMS recovers from a crash, it should maintain the following –

 It should check the states of all the transactions, which were being executed.
 A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.
 It should check whether the transaction can be completed now or it needs to be rolled
back.
 No transactions would be allowed to leave the DBMS in an inconsistent state.

TRANSACTIONS
A transaction is a program including a collection of database operations, executed as a
logical unit of data processing. The operations performed in a transaction include one or
more of database operations like insert, delete, update or retrieve data. It is an atomic
process that is either performed into completion entirely or is not performed at all. A
transaction involving only data retrieval without any data update is called read-only
transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
 read_item() − reads data item from storage to main memory.
 modify_item() − change value of item in the main memory.
 write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.

Transaction Operations
The low level operations performed in a transaction are −
 begin_transaction − A marker that specifies start of transaction execution.
 Read_item or write_item − Database operations that may be interleaved with main
memory operations as a part of transaction.
 end_transaction − A marker that specifies end of transaction.
 Commit − A signal to specify that the transaction has been successfully completed in its
entirety and will not be undone.
 Rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be
rolled back.

Transaction States
A transaction may go through a subset of five states, active, partially committed, committed,
failed and aborted.
 Active − the initial state where the transaction enters is the active state. The transaction
remains in this state while it is executing read, write or other operations.
 Partially Committed − the transaction enters this state after the last statement of the
transaction has been executed.
 Committed − the transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
 Failed − the transaction goes from partially committed state or active state to failed state
when it is discovered that normal execution can no longer proceed or system checks fail.
 Aborted − this is the state after the transaction has been rolled back after failure and the
database has been restored to its state that was before the transaction began.
The following state transition diagram depicts the states in the transaction and the low level
transaction operations that cause change in states.

Desirable Properties of Transactions


Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation, and
Durability.
 Atomicity − This property states that a transaction is an atomic unit of processing, that is,
either it is performed in its entirety or not performed at all. No partial update should
exist.
 Consistency − A transaction should take the database from one consistent state to
another consistent state. It should not adversely affect any data item in the database.
 Isolation − A transaction should be executed as if it is the only one in the system. There
should not be any interference from the other concurrent transactions that are
simultaneously running.
 Durability − If a committed transaction brings about a change, that change should be
durable in the database and not lost in case of any failure.

Schedules and Conflicts


In a system with a number of simultaneous transactions, a schedule is the total order of
execution of operations. Given a schedule S comprising of n transactions, say T1, T2,
T3………..Tn; for any transaction Ti, the operations in Ti must execute as laid down in the
schedule S.

Types of Schedules
There are two types of schedules −
 Serial Schedules − in a serial schedule, at any point of time, only one transaction is active,
i.e. there is no overlapping of transactions. This is depicted in the following graph –

 Parallel Schedules − in parallel schedules, more than one transactions are active
simultaneously, i.e. the transactions contain operations that overlap at time. This is
depicted in the following graph −

Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active
transactions perform non-compatible operations. Two operations are said to be in conflict,
when all of the following three conditions exists simultaneously −
 The two operations are parts of different transactions.
 Both the operations access the same data item.
 At least one of the operations is a write_item() operation, i.e. it tries to modify the data
item.
SERIALIZABILITY
A serializable schedule of ‘n’ transactions is a parallel schedule which is equivalent to a serial
schedule comprising of the same ‘n’ transactions. A serializable schedule contains the
correctness of serial schedule while ascertaining better CPU utilization of parallel schedule.

Equivalence of Schedules
Equivalence of two schedules can be of the following types −
 Result equivalence − Two schedules producing identical results are said to be result
equivalent.
 View equivalence − Two schedules that perform similar action in a similar manner are
said to be view equivalent.
 Conflict equivalence − Two schedules are said to be conflict equivalent if both contain
the same set of transactions and has the same order of conflicting pairs of operations.
TWO PHASE LOCKING PROTOCOL
Every transaction will lock and unlock the data item in two different phases.

 Growing Phase − All the locks are issued in this phase. No locks are released, after all
changes to data-items are committed and then the second phase (shrinking phase) starts.
 Shrinking phase − No locks are issued in this phase, all the changes to data-items are
noted (stored) and then locks are released.
The 2PL locking protocol is represented diagrammatically as follows −

In the growing phase transaction reaches a point where all the locks it may need has been
acquired. This point is called LOCK POINT.
After the lock point has been reached, the transaction enters a shrinking phase.
Types
Two phase locking is of two types −
 Strict two phase locking protocol
A transaction can release a shared lock after the lock point, but it cannot release any
exclusive lock until the transaction commits. This protocol creates a cascade less schedule.

Cascading schedule: In this schedule one transaction is dependent on another transaction. So


if one has to rollback then the other has to rollback.

 Rigorous two phase locking protocol


A transaction cannot release any lock either shared or exclusive until it commits.
The 2PL protocol guarantees Serializability, but cannot guarantee that deadlock will not
happen.
Example
Let T1 and T2 are two transactions.
T1=A+B and T2=B+A

Here,
Lock-X (B): Cannot execute Lock-X(B) since B is locked by T2.
Lock-X (A): Cannot execute Lock-X(A) since A is locked by T1.
In the above situation T1 waits for B and T2 waits for A. The waiting time never ends. Both
the transaction cannot proceed further at least any one releases the lock voluntarily. This
situation is called deadlock.

The wait for graph is as follows –

Wait for graph: It is used in the deadlock detection method, creating a node for each
transaction, creating an edge Ti to Tj, if Ti is waiting to lock an item locked by Tj. A cycle in
WFG indicates a deadlock has occurred. WFG is created at regular intervals.
Unit-3

QUERY OPTIMIZATION
Query Processing: Query processing refers to activities including translation of high level
languages (HLL) queries into operations at physical file level, query optimization
transformations, and actual evaluation of queries.
The activities involved in parsing, validating, execution and optimizing a query is called Query
Processing.
Steps
The steps involved in query processing and optimization are as follows −
 A sequence of primitive operations that can be used to evaluate a query is called query
execution plan or query evaluation plan.
 The query execution engine takes a query evaluation plan, executes that plan and
produces the desired output. The different execution plans for a given query can have
different costs based on the number of disks. It is the responsibility of the system to
construct a query evaluation plan which minimizes the cost of query evaluation. This task
is called query optimization.
 A query optimization is expressed in high level query language and is scanned, parsed and
validated. The scanner identifies the SQL keywords, attributes and relation names in the
text of the query.
 The parser checks the syntax which is used to determine if the query is formulated
according to syntax rules of the query language.
 Finally, the query is evaluated by checking that all attributes and relation names are valid
and semantically meaningful in the schema of a particular database.
 An internal representation of the query is then created which is either a tree or a graph
known as query tree or query graph.
 If the query written SQL is translated into relational algebra, then its internal
representation is a query tree. Otherwise, if TRC or DRC its internal representation is a
query graph. A graph has many possible execution strategies and the process of choosing
a strategy with minimum cost is called query optimization.
The query processing and optimization in the DBMS are explained in the form of a diagram
below −
ALGORITHM FOR EXTERNAL SORTING
External sorting
External sorting is a technique in which the data is stored on the secondary memory, in
which part by part data is loaded into the main memory and then sorting can be done over
there. Then this sorted data will be stored in the intermediate files. Finally, these files will be
merged to get a sorted data. Thus by using the external sorting technique, a huge amount of
data can be sorted easily. In case of external sorting, all the data cannot be accommodated on
the single memory; in this case, some amount of memory needs to be kept on a memory such
as hard disk, compact disk and so on.

The requirement of external sorting is there, where the data we have to store in the main
memory does not fit into it. Basically, it consists of two phases that are:

1. Sorting phase: This is a phase in which a large amount of data is sorted in an


intermediate file.
2. Merge phase: In this phase, the sorted files are combined into a single larger file.

One of the best examples of external sorting is external merge sort.


External merge sort
The external merge sort is a technique in which the data is stored in intermediate files and
then each intermediate files are sorted independently and then combined or merged to get a
sorted data.

For example: Let us consider there are 10,000 records which have to be sorted. For this, we
need to apply the external merge sort method. Suppose the main memory has a capacity to
store 500 records in a block, with having each block size of 100 records.

In this example, we can see 5 blocks will be sorted in intermediate files. This process will be
repeated 20 times to get all the records. Then by this, we start merging a pair of
intermediate files in the main memory to get a sorted output.

Two-Way Merge Sort


Two-way merge sort is a technique which works in two stages which are as follows here:

Stage 1: Firstly break the records into the blocks and then sort the individual record with the
help of two input tapes.

Stage 2: In this merge the sorted blocks and then create a single sorted file with the help of
two output tapes.

By this, it can be said that two-way merge sort uses the two input tapes and two output
tapes for sorting the data.

Algorithm for Two-Way Merge Sort:


Step 1) Divide the elements into the blocks of size M. Sort each block and then write on
disk.
Step 2) Merge two runs

1. Read first value on every two runs.


2. Then compare it and sort it.
3. Write the sorted record on the output tape.

Step 3) Repeat the step 2 and get longer and longer runs on alternates tapes. Finally, at last,
we will get a single sorted list.

Analysis
This algorithm requires log(N/M) passes with initial run pass. Therefore, at each pass
the N records are processed and at last we will get a time complexity as O(N log(N/M).

HEURISTICS IN QUERY OPTIMIZATION


Cost-based optimization is expensive. Heuristics are used to reduce the number of choices
that must be made in a cost-based approach.
Rules
Heuristic optimization transforms the expression-tree by using a set of rules which improve
the performance. These rules are as follows −
 Perform the SELECTION process foremost in the query. This should be the first action for
any SQL table. By doing so, we can decrease the number of records required in the query,
rather than using all the tables during the query.
 Perform all the projection as soon as achievable in the query. Somewhat like a selection
but this method helps in decreasing the number of columns in the query.
 Perform the most restrictive joins and selection operations. What this means is that select
only those sets of tables and/or views which will result in a relatively lesser number of
records and are extremely necessary in the query. Obviously any query will execute better
when tables with few records are joined.
Some systems use only heuristics and the others combine heuristics with partial cost-based
optimization.
Steps in heuristic optimization
Let’s see the steps involve in heuristic optimization, which are explained below −
 Deconstruct the conjunctive selections into a sequence of single selection operations.
 Move the selection operations down the query tree for the earliest possible execution.
 First execute those selections and join operations which will produce smallest relations.
 Replace the Cartesian product operation followed by selection operation with join
operation.
 Deconstructive and move the tree down as far as possible.
 Identify those subtrees whose operations are pipelined.

MULTI-MEDIA DATABASE
Multimedia database is a collection of multimedia data which includes text, images,
graphics (drawings, sketches), animations, audio, and video, among others. These databases
have extensive amounts of data which can be multimedia and multisource. The framework
which manages these multimedia databases and their different types so that the data can be
stored, utilized, and delivered in more than one way is known as a multimedia database
management system.
The multimedia database can be classified into three types. These types are:
1. Static media
2. Dynamic media
3. Dimensional media

The contents of a multimedia database management system can be:


1. Media data: It is the actual data which represents an object.
2. Media format data: The information such as resolution, sampling rate, encoding system,
etc. about the format of the media data under consideration after is undergoes
acquisition, processing, and encoding is the media format data.
3. Media keyword data: Media keyword data are the keyword description related to the
generation of data. This data is also known as content descriptive data. Examples of
content descriptive data are place, time, date of recording.
4. Media feature data: Media feature data contains data which is content dependent such
as kind of texture, distribution of, and the different shapes present in the data.

The types of multimedia applications that are based on the data management
characteristics are:
1. Repository applications: An extensive amount of multimedia data stored along with
metadata for retrieval purposes.
2. Presentation applications: These involve the delivery of multimedia data subject to the
temporal constraint. An optimal viewing or listening experience requires DBMS to deliver
the data at a certain rate which offers the quality of service, which is above a particular
threshold. This data is processed as it is being delivered.
3. Collaborative work using multimedia information: It involves the execution of a complex
task by merging drawings and changing notifications.

This still leads to a number of challenges to multimedia databases. These are:


1. Modeling: Work in this area can improve the database versus information retrieval
techniques.
2. Design: The physical, conceptual, and logical design of multimedia databases is not
addressed entirely leading to performance and tuning issues.
3. Storage: The storage of databases on a standard disc can lead to problems like
representation, mapping to disc hierarchies, compression, etc.
4. Performance: Audio-video synchronization and audio playback applications are where
physical limitations dominate. Parallel processing can reduce these problems, but these
techniques have not been completely developed yet. Multimedia databases also consume
a lot of processing power and bandwidth.
5. Queries and Retrieval: Multimedia such as images, audio, video lead to retrieval and
queries issues such as efficient query formation, query execution, etc.

Multimedia Database Applications:


1. Documents and record management: Industries which keep a lot of documentation and
records. Ex: Insurance claim industry.
2. Knowledge dissemination: Multimedia database is an extremely efficient tool for
knowledge dissemination and providing several resources. Ex: electronic books
3. Education and training: Multimedia sources can be used to create resources useful in
education and training. These are popular sources of learning in recent days. Ex: Digital
libraries.
4. Real-time monitoring and control: Multimedia presentation when coupled with active
database technology can be an effective means for controlling and monitoring complex
tasks. Ex: Manufacture control
5. Marketing
6. Advertisement
7. Retailing
8. Entertainment
9. Travel
DATA MINING
What is Data Mining?

 The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.

 In other words, we can say that Data Mining is the process of investigating hidden patterns
of information to various perspectives for categorization into useful data, which is collected
and assembled in particular areas such as data warehouses, efficient analysis, data mining
algorithm, helping decision making and other data requirement to eventually cost-cutting
and generating revenue.

 Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).

 Data Mining is a process used by organizations to extract specific data from huge databases
to solve business problems. It primarily turns raw data into useful information.

 Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such as
text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. There are many
powerful instruments and techniques available to mine data and find better insight from it.

Types of Data Mining


Data mining can be performed on the following types of data:

Relational Database:
A relational database is a collection of multiple data sets formally organized by tables,
records, and columns from which data can be accessed in various ways without having to
recognize the database tables. Tables convey and share information, which facilitates data
search-ability, reporting, and organization.

Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, and where an organization has kept various
kinds of information.

Object-Relational Database:
A combination of an object-oriented database model and relational database model is
called an object-relational model. It supports Classes, Objects, Inheritance, etc. One of the
primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.

Transactional Database:
A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately. Even though
this was a unique capability a very long while back, today, most of the relational database
systems support transactional database activities.

Advantages of Data Mining


o The Data Mining technique enables organizations to obtain knowledge-based data.
o Data mining enables organizations to make lucrative modifications in operation and
production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as the prediction of
trends and behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.

Disadvantages of Data Mining


o There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card
purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance training to
work on.
o Different data mining instruments operate in distinct ways due to the different algorithms
used in their design. Therefore, the selection of the right data mining tools is a very
challenging task.
o The data mining techniques are not precise, so that it may lead to severe consequences in
certain conditions.

Data Mining Applications


Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences,
product positioning, and impact on sales, customer satisfaction, and corporate profits. Data
mining enables a retailer to use point-of-sale records of customer purchases to develop
products and promotions that help the organization to attract the customer.

ASSOCIATION RULES IN DATA MINING


Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the
variables of dataset. It is based on different rules to discover the interesting relations
between variables in the database.

The association rule learning is one of the very important concepts of machine learning, and
it is employed in Market Basket analysis, Web usage mining, continuous production,
etc. Here market basket analysis is a technique used by the various big retailers to discover
the associations between items. We can understand it by taking an example of a
supermarket, as in a supermarket, all products that are purchased together is put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so
these products are stored within a shelf or mostly nearby. Consider the below diagram:
Association rule learning can be divided into three types of algorithms:

1. Apriori
2. Éclat
3. F-P Growth Algorithm

How does Association Rule Learning work?

Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or relation between two items
is known as single cardinality. It is all about creating rules, and if the number of items
increases, then cardinality also increases accordingly. So, to measure the associations
between thousands of data items, there are several metrics. These metrics are given below:

o Support
o Confidence
o Lift

Let's understand each of them:


Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined
as the fraction of the transaction T that contains the item set X. If there are X datasets, then
for transactions T, it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X
and Y occur together in the dataset when the occurrence of X is already given. It is the ratio
of the transaction that contains X and Y to the number of records that contain X.

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of
each other.
o Lift>1: It determines the degree to which the two item sets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a
negative effect on another.

Types of Association Rule Learning


Association rule learning can be divided into three algorithms:

 Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on
the databases that contain transactions. This algorithm uses a breadth-first search and Hash
Tree to calculate the item set efficiently.
It is mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions for
patients.

 Éclat Algorithm
Éclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-
first search technique to find frequent item sets in a transaction database. It performs faster
execution than Apriori Algorithm.

 F-P Growth Algorithm


The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the
Apriori Algorithm. It represents the database in the form of a tree structure that is known as
a frequent pattern or tree. The purpose of this frequent tree is to extract the most frequent
patterns.

CLASSIFICATION
Data Mining: Data mining in general terms means mining or digging deep into data that is in
different forms to gain patterns, and to gain knowledge on that pattern. In the process of data
mining, large data sets are first sorted, then patterns are identified and relationships are
established to perform data analysis and solve problems.

Classification: It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of
a set of categories (subpopulations), a new observation belongs to, on the basis of a training
set of data containing observations and whose categories membership is known.

Example: Before starting any project, we need to check its feasibility. In this case, a classifier
is required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to
further approve it. It is a two-step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.

2. Classification Step: Model used to predict class labels and testing the constructed model
on test data and hence estimate the accuracy of the classification rules.

DATA WAREHOUSING
A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction data
from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on
providing support for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in


support of management's decisions."

Need for Data Warehouse


Data Warehouse is needed for the following reasons:
1. Business User: Business users require a data warehouse to view summarized data from
the past. Since these people are non-technical, the data may be presented to them in an
elementary form.
2. Store historical data: Data Warehouse is required to store the time variable data from
the past. This input is made to be used for various purposes.
3. Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and
consistency in data.
5. High response time: Data warehouse has to be ready for somewhat unexpected loads
and types of queries, which demands a significant degree of flexibility and quick
response time.

Architecture of Data Warehouse


Data Warehousing involves data cleaning, data integration, and data consolidations. A Data
Warehouse has 3-layer architecture −

Data Source Layer


It defines how the data comes to a Data Warehouse. It involves various data sources and
operational transaction systems, flat files, applications, etc.

Integration Layer
It consists of Operational Data Store and Staging area. Staging area is used to perform data
cleansing, data transformation and loading data from different sources to a data
warehouse. As multiple data sources are available for extraction at different time zones,
staging area is used to store the data and later to apply transformations on data.

Presentation Layer
This is used to perform BI reporting by end users. The data in a DW system is accessed by BI
users and used for reporting and analysis.
The following illustration shows the common architecture of a Data Warehouse System.

CHARACTERISTICS OF DATA WAREHOUSE

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not useful concerning the subject
and including all data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions attributes types, etc., among
different data sources.

Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6
months, 12 months, or even previous data from a data warehouse. These variations with a transactions
system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data. Therefore,
the DW does not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. Non-Volatile defines that once
entered into the warehouse and data should not change.

XML TREE DATA MODEL


XML documents form a tree structure that starts at "the root" and branches to "the leaves".
XML DOCUMENT
An XML document is a basic unit of XML information composed of elements and other
markup in an orderly package. An XML document can contains wide variety of data. For
example, database of numbers, numbers representing molecular structure or a mathematical
equation.

XML Document Example


A simple document is shown in the following example −

The following image depicts the parts of XML document.

Document Prolog Section


Document Prolog comes at the top of the document, before the root element. This section
contains −

 XML declaration
 Document type declaration
Document Elements Section
Document Elements are the building blocks of XML. These divide the document into a
hierarchy of sections, each serving a specific purpose. You can separate a document into
multiple sections so that they can be rendered differently, or used by a search engine. The
elements can be containers, with a combination of text and other elements.

XML DTD SCHEMA


The XML Document Type Declaration, commonly known as DTD, is a way to describe XML
language precisely. DTDs check vocabulary and validity of the structure of XML documents
against grammatical rules of appropriate XML language.
An XML DTD can be either specified inside the document, or it can be kept in a separate
document and then liked separately.
Syntax
Basic syntax of a DTD is as follows –

In the above syntax,


 The DTD starts with <!DOCTYPE delimiter.
 An element tells the parser to parse the document from the specified root element.
 DTD identifier is an identifier for the document type definition, which may be the path to
a file on the system or URL to a file on the internet. If the DTD is pointing to external path,
it is called External Subset.
 The square brackets [ ] enclose an optional list of entity declarations called Internal
Subset.
SECURITY AND INTEGRITY OF DATABASES
1. Data Security:
Data security refers to the prevention of data from unauthorized users. It is only allowed to
access the data to the authorized users. In database, the DBA or head of department can
access all the data. Some users are only allowed to retrieve data, whereas others are allowed
to retrieve as well as to modify the data.

2. Data Integrity:
Data integrity is defined as the data contained in the database is both correct and
consistent. For this purpose, the data stored in the database must satisfy certain types of
procedures (rules). The data in a database must be correct and consistent. So, data stored in
the database must satisfy certain types of procedure (rules). DBMS provides different ways
to implement such types of constraints (rules). This improves data integrity in a database.

Difference between Data Security and Data Integrity :


S.No Data Security Data Integrity
.
1. Data security refers to the prevention of data Data integrity refers to the quality of data,
corruption through the use of controlled which assures the data is complete and has a
access mechanisms. whole structure.
2. Its motive is the protection of data. Its motive is the validity of data.
3. Its work is to only the people who should Its work is to check the data is correct and not
have access to the data are the only ones corrupt.
who can access the data.
4. It refers to making sure that data is accessed It refers to the structure of the data and how
by its intended users, thus ensuring the it matches the schema of the database.
privacy and protection of data.
5. Some of the popular means of data security Some of the means to preserve integrity are
are authentication/authorization, masking, backing up, error detection, designing a
and encryptions. suitable user interface and correcting data.
6. It relates to the physical form of data against It relates to the logical protection (correct,
accidental or intentional loss or misuse and complete and consistence) of data.
destruction.
7. It avoids unauthorized access of data. It avoids human error when data is entered.
8. It can be implemented through : It can be implemented by following rule :
 user accounts (passwords)  Primary Key
 authentication schemes  Foreign Key
 Relationship

CRYPTOGRAPHY AND ITS TYPES


Cryptography is an important aspect when we deal with network security. ‘Crypto’ means
secret or hidden. Cryptography is the science of secret writing with the intention of keeping
the data secret. Cryptanalysis, on the other hand, is the science or sometimes the art of
breaking cryptosystems. These both terms are a subset of what is called as Cryptology.

1. Cryptography –
Cryptography is classified into symmetric cryptography, asymmetric cryptography and
hashing. Below are the descriptions of these types.

1. Symmetric key cryptography – It involves usage of one secret key along with encryption
and decryption algorithms which help in securing the contents of the message. The
strength of symmetric key cryptography depends upon the number of key bits. It is
relatively faster than asymmetric key cryptography. There arises a key distribution
problem as the key has to be transferred from the sender to receiver through a secure
channel.
2. Asymmetric key cryptography – It is also known as public key cryptography because it
involves usage of a public key along with secret key. It solves the problem of key
distribution as both parties’ uses different keys for encryption/decryption. It is not
feasible to use for decrypting bulk messages as it is very slow compared to symmetric key
cryptography.

3. Hashing – It involves taking the plain-text and converting it to a hash value of fixed size by
a hash function. This process ensures integrity of the message as the hash value on both,
sender\’s and receiver\’s side should match if the message is unaltered.

SQL *PLUS DATA-TYPES


Data types are used to represent the nature of the data that can be stored in the database
table. For example, in a particular column of a table, if we want to store a string type of data
then we will have to declare a string data type of this column.
Data types mainly classified into three categories for every database.

o String Data types


o Numeric Data types
o Date and time Data types

SQL Server String Data Type

char(n) It is a fixed width character string data type. Its size can be up to 8000 characters.
varchar(n) It is a variable width character string data type. Its size can be up to 8000 characters.
varchar(max) It is a variable width character string data types. Its size can be up to 1,073,741,824
characters.
text It is a variable width character string data type. Its size can be up to 2GB of text data.
nchar It is a fixed width Unicode string data type. Its size can be up to 4000 characters.
nvarchar It is a variable width Unicode string data type. Its size can be up to 4000 characters.
ntext It is a variable width Unicode string data type. Its size can be up to 2GB of text data.
binary(n) It is a fixed width Binary string data type. Its size can be up to 8000 bytes.
varbinary It is a variable width Binary string data type. Its size can be up to 8000 bytes.
image It is also a variable width Binary string data type. Its size can be up to 2GB.

SQL Server Numeric Data Types


bit It is an integer that can be 0, 1 or null.
tinyint It allows whole numbers from 0 to 255.
Smallint It allows whole numbers between -32,768 and 32,767.
Int It allows whole numbers between -2,147,483,648 and 2,147,483,647.
bigint It allows whole numbers between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807.
float(n) It is used to specify floating precision number data from -1.79E+308 to 1.79E+308. The n
parameter indicates whether the field should hold the 4 or 8 bytes. Default value of n is 53.
real It is a floating precision number data from -3.40E+38 to 3.40E+38.
money It is used to specify monetary data from -922,337,233,685,477.5808 to
922,337,203,685,477.5807.

SQL Server Date and Time Data Type


datetime It is used to specify date and time combination. It supports range from January 1, 1753, to
December 31, 9999 with an accuracy of 3.33 milliseconds.
datetime2 It is used to specify date and time combination. It supports range from January 1, 0001 to
December 31, 9999 with an accuracy of 100 nanoseconds
date It is used to store date only. It supports range from January 1, 0001 to December 31, 9999
time It stores time only to an accuracy of 100 nanoseconds
timestamp It stores a unique number when a new row gets created or modified. The time stamp value is
based upon an internal clock and does not correspond to real time. Each table may contain
only one-time stamp variable.

SQL *PLUS CONSTRAINTS


Constraints can be specified when the table is created with the CREATE TABLE statement, or
after the table is created with the ALTER TABLE statement.

Syntax :
SQL constraints are used to specify rules for the data in a table.

Constraints are used to limit the type of data that can go into a table. This ensures the
accuracy and reliability of the data in the table. If there is any violation between the
constraint and the data action, the action is aborted.

Constraints can be column level or table level. Column level constraints apply to a column,
and table level constraints apply to the whole table.

The following constraints are commonly used in SQL:

 NOT NULL- Ensures that a column cannot have a NULL value


 UNIQUE - Ensures that all values in a column are different
 PRIMARY KEY - A combination of a NOT NULL and UNIQUE. Uniquely identifies each row
in a table
 FOREIGN KEY - Prevents actions that would destroy links between tables
 CHECK - Ensures that the values in a column satisfies a specific condition
 DEFAULT - Sets a default value for a column if no value is specified
 CREATE INDEX - Used to create and retrieve data from the database very quickly

You might also like