RDBMS Unit-1
RDBMS Unit-1
Data: The raw information (facts and figures) input by the user is called Data. Raw means
data isn’t been processed yet. It has no significance beyond its existence and simply exists.
Data means which has no meaning of itself.
Word 'Data' is originated from the word 'datum' that means 'single piece of
information.' It is plural of the word datum.
In computing, Data is information that can be translated into a form for efficient
movement and processing. Data is interchangeable.
Characteristics
Data must be stored in tabular form in DB file, that is, it should be organized in the form
of rows and columns.
Each row of table is called record/tuple. Collection of such records is known as the
cardinality of the table.
Each column of the table is called an attribute/field. Collection of such columns is called
the arity of the table.
No two records of the DB table can be same. Data duplicity is therefore avoided by using
a candidate key. Candidate key is a minimum set of attributes required to identify each
record uniquely.
Tables are related to each other with the help for foreign keys.
Database tables also allow a NULL value that is if the values of any of the element of the
table are not filled or are missing, it becomes a NULL value, which is not equivalent to
zero. (NOTE: Primary key can’t have a NULL value).
Relational Database Pros
Easy to manage: Each table can be independently manipulated without affecting
others.
Security: It is more secure consisting of multiple levels of security. Access of data
shared can be limited.
Flexible: Relational databases can be very flexible in how they allow users to change or
store data. This can be very helpful when updating data or adding more data to a stored
set.
Accuracy: Because relational databases are usually designed to be easy to navigate, it
can also be easy to search for potential errors or inconsistencies within a database and
correct them right away.
Usability: Relational databases are often very easy to use, which can be especially
helpful for users who want a simple data storage solution. It's also possible to navigate
and share data sets easily within a relational database.
Users: RDBMS supports client-side architecture storing multiple users together.
Facilitates storage and retrieval of large amount of data.
Easy Data Handling:
o Data fetching is faster because of relational architecture.
o Data redundancy or duplicity is avoided due to keys, indexes, and normalization
principles.
o Data consistency is ensured because RDBMS is based on ACID properties for data
transactions (Atomicity Consistency Isolation Durability).
o Fault Tolerance: Replication of databases provides simultaneous access and helps
the system recover in case of disasters, such as power failures or sudden
shutdowns
DBMS RDBMS
DBMS stores data as file. RDBMS stores data in tabular form.
Data elements need to access individually. Multiple data elements can be accessed at the
same time.
No relationship between data. Data is stored in the form of tables which are
related to each other.
Normalization is not present. Normalization is present.
DBMS does not support distributed database. RDBMS supports distributed database.
It stores data in either a navigational or It uses a tabular structure where the headers
hierarchical form. are the column names, and the rows contain
corresponding values.
It deals with small quantity of data. It deals with large amount of data.
Data redundancy is common in this model. Keys and indexes do not allow Data
redundancy.
It is used for small organization and deal with It is used to handle large amount of data.
small data.
It supports single user. It supports multiple users.
Data fetching is slower for the large amount Data fetching is fast because of relational
of data. approach.
The data in a DBMS is subject to low security There exists multiple levels of data security in
levels with regards to data manipulation. a RDBMS.
Low software and hardware necessities. Higher software and hardware necessities.
Examples: XML, Window Registry, etc. Examples: MySQL, PostgreSQL, SQL Server,
Oracle, Microsoft Access etc.
Data Abstraction
Data Abstraction is a process of hiding unwanted or irrelevant details from the end user. It
provides a different view and helps in achieving data independence which is used to enhance
the security of data.
There are mainly 3 levels of data abstraction:
View Level or External Schema: This is the highest level of abstraction. Only a part of
the actual database is viewed by the users. This level exists to ease the accessibility of the
database by an individual user. Users view data in the form of rows and columns. Tables and
relations are used to store data. Multiple views of the same database may exist. Users can
just view the data and interact with the database, storage and implementation details are
hidden from them.
Example: If we have a login-id and password in a university system, then as a student,
we can view our marks, attendance, fee structure, etc. But the faculty of the university
will have a different view. He will have options like salary, edit marks of a student,
enter attendance of the students, etc. So, both the student and the faculty have a
different view. By doing so, the security of the system also increases. In this example,
the student can't edit his marks but the faculty who is authorized to edit the marks can
edit the student's marks. Similarly, the dean of the college or university will have some
more authorization and accordingly, he will have his view. So, different users will have
a different view according to the authorization they have.
Physical Level or Internal Schema: This is the lowest level of data abstraction. It tells
us how the data is actually stored in memory. The access methods like sequential or random
access and file organization methods like B+ trees, hashing used for the same. Usability, size
of memory, and the number of times the records are factors that we need to know while
designing the database.
Example: The Database Administrators (DBA) decides that which data should be kept at
which particular disk drive, how the data has to be fragmented, where it has to be
stored etc. They decide if the data has to be centralized or distributed. Though we see
the data in the form of tables at view level the data here is actually stored in the form
of files only. It totally depends on the DBA, how he/she manages the database at the
physical level. Suppose we need to store the details of an employee. Blocks of storage
and the amount of memory used for these purposes are kept hidden from the user.
Logical Level or Conceptual Level: This level tells how the data is actually stored and
structured. We have different data models by which we can store the data. It also stores
the relationship among the data entities in relatively simple structures. At this level, the
information available to the user at the view level is unknown.
Example: Let us take an example where we use the relational model for storing the
data. We have to store the data of a student; the columns in the student table will be
student_name, age, mail_id, roll_no etc. We have to define all these at this level while
we are creating the database. Though the data is stored in the database but the
structure of the tables like the student table, teacher table, books table, etc are defined
here in the conceptual level or logical level. Also, how the tables are related to each
other is defined here. Overall, we can say that we are creating a blueprint of the data
at the conceptual level.
Database Administrator
DBA can be a single person or group of person.
DBA has central control over both data & application programs.
DBA is responsible for everything that is related to database. He makes policies,
strategies & provides technical supports.
Role / Responsibilities of DBA
1. Schema definition & modification –
o Database schema is the overall structure of the database. It is the responsibility
of DBA to create schema.
o Database schema is created by writing a set of data definition statements in
DDL.
o Modification of schema can also be performed by DBA.
2. Granting of authorization for data access –
o DBA can grant authorization of data access to different users so that database
will not be accessible by unauthorized users.
3. Routine maintenance checks –
o The DBA is responsible for taking the database backup periodically in order to be
able to recover data from any failures due to human hardware or software
malfunctioning & restore the database to a consistent state.
4. New Software Installation –
o It is the responsibility of the DBA to install new DBMS software, application
software & other related software. DBA also test the new software after
installation.
5. Monitoring Performance –
o It is the responsibility of DBA to monitor the performance of the database. A
database monitors the new CPU & memory usage of computer.
6. Security enforcement & administration –
o DBA is responsible for establishing & monitoring the security of the database
system.
o DBA check & audit security problems. It can also add & remove users if required.
7. Deciding the storage structure & access strategy –
o The DBA must also decide how the data is to be represented in the database &
must specify the representation by writing the storage structure definition.
8. Database availability –
o DBA ensure that database information is always available to all users in a form
that suits their needs.
Network Model
This is an extension of the Hierarchical model. In this model data is organized more like
a graph, and are allowed to have more than one parent node.
In this database model data is more related as more relationships are established in this
database model. Also, as the data is more related, hence accessing the data is also
easier and fast.
This database model was used to map many-to-many data relationships.
This was the most widely used database model, before Relational Model was
introduced.
Entity-relationship Model
E-R Models are defined to represent the relationships into pictorial form to make it
easier to understand.
This model is good to design a database, which can then be turned into tables in
relational model.
Let's take an example, If we have to design a School Database, then Student will be an
entity with attributes name, age, address etc. As Address is generally complex, it can be
another entity with attributes street name, pincode, city etc, and there will be a
relationship between them.
Relational Model
In this model, data is organized in twodimensional tables and the relationship is
maintained by storing a common field.
This model was introduced by E.F Codd in 1970, and since then it has been the most
widely used database model, infect, we can say the only database model used around
the world.
The basic structure of data in the relational model is tables. All the information related
to a particular type is stored in rows of that table.
Hence, tables are also known as relations in relational model.
Table
A table is a collection of data elements organized in terms of rows and columns.
A table is also considered as a convenient representation of relations. But a table
can have duplicate row of data while a true relation cannot have duplicate data.
Table is the simplest form of data storage. Below is an example of an Employee
table.
Tuple
A single row of a table, which contains a single record for that relation, is called a
tuple.
Null Value
NULL is the term used to represent a missing value.
A NULL value in a table is a value in a field that appears to be blank.
Attribute Domain
Every attribute has some pre-defined value scope, known as attribute domain.
For example,
Age cannot be less than zero and
Telephone numbers cannot contain a digit outside 0- 9.
• DBMS architecture depends upon how users are connected to the database to get their
request done.
1-TIER ARCHITECTURE
o In this architecture, the database is directly available to the user. It means the user
can directly sit on the DBMS and uses it.
o Any changes done here will directly be done on the database itself. It doesn't provide
a handy tool for end users.
o The 1-Tier architecture is used for development of the local application, where
programmers can directly communicate with the database for the quick response.
2-TIER ARCHITECTURE
o The 2-Tier architecture is same as basic client-server. In the two-tier architecture,
applications on the client end can directly communicate with the database at the
server side. For this interaction, API's like: ODBC, JDBC are used.
o The user interfaces and application programs are run on the client-side.
o The server side is responsible to provide the functionalities like: query processing and
transaction management.
o To communicate with the DBMS, client-side application establishes a connection
with the server side.
3-TIER ARCHITECTURE
o The 3-Tier architecture contains another layer between the client and server. In this
architecture, client can't directly communicate with the server.
o The application on the client-end interacts with an application server which further
communicates with the database system.
o End user has no idea about the existence of the database beyond the application
server. The database also has no idea about any other user beyond the application.
o The 3-Tier architecture is used in case of large web application.
Entity
o An Entity may be a real world object with a physical (Tangible) or conceptual
existence (Intangible).
o Tangible Entity: Tangible Entities are those entities which exist in the real world
physically. Example: Person, car, etc.
o Intangible Entity: Intangible Entities are those entities which exist only logically and
have no physical existence. Example: Bank Account, etc.
o In the ER diagram, an entity can be represented as rectangles.
Entity
Component of ER Diagram
Here, Professor Name, Professor Address and Professor Salary are attributes.
ProfessorID is the primary key
Strong Entity
The strong entity has a primary key. Weak entities are dependent on strong entity. Its
existence is not dependent on any other entity.
Strong Entity is represented by a single rectangle: Strong Entity
Continuing our previous example, Professor is a strong entity here, and the primary key
is Professor_ID.
Weak Entity
An entity that depends on another entity called a weak entity.
The weak entity doesn't contain any key attribute of its own.
The weak entity in DBMS does not have a primary key and are dependent on the parent
entity.
The weak entity is represented by a double rectangle.
Continuing our previous example, Professor is a strong entity, and the primary key is
Professor_ID. However, another entity is Professor_Dependents, which is our Weak
Entity.
Professor Dependents can be his/her father, mother, siblings, child, husband, wife or
another.
Example of Strong and Weak Entity
The Strong Entity is Professor, whereas Dependent is a Weak Entity.
ID is the primary key (represented with a line) and Name in Dependent entity is called
Partial Key (represented with a dotted line).
Entity Type
The entity type is a collection of the entity having similar attributes.
So, in simple worlds we can say that, an entity type in an ER diagram is defined by a
name of entity (here, STUDENT) and a set of attributes of that entity (here, Roll_no,
Student_name, Age, Mobile_no).
Entity Set
A set of all entities is called as entity set.
e.g. E1 is an entity having Entity Type Student and set of all students is called Entity
Set. In ER diagram, Entity Type is represented as:
Entity, Entity Type & Entity Set
Attribute
The attribute is used to describe the property of an entity.
Or in simple words, we can say that attributes are the properties or features of an
entity.
Eclipse (ellipse) is used to represent an attribute. attribute
Eg. : roll_no, name, , name, birthdate, etc. can be attributes of a student.
Types of Attributes
1. Key Attributes
2. Composite Attribute
3. Multivalued Attributes
4. Derived Attributes
5. Some More Attributes
1. Key Attribute
The attribute which uniquely identifies each entity in the entity set is called key
attribute.
It represents a primary key.
The key attribute is represented by an ellipse with the text underlined.
For example, Roll_No will be unique for each student
Roll_No
2. Composite Attribute
An attribute that composed of many other attributes is known as a composite attribute.
The composite attribute is represented by an ellipse, and those ellipses are connected
with an ellipse.
For example, Address attribute of student Entity type consists of Street, City, State, and
Country.
3. Multivalued Attribute
An attribute which can have more than one value, is known as a multivalued attribute.
The double oval is used to represent multivalued attribute.
For example, a student can have more than one phone number.
Phone_n
o
4. Derived Attribute
An attribute which can be derived from other attribute is known as a derived attribute.
It can be represented by a dashed ellipse.
For example, a person's age changes over time and can be derived from another
attribute like Date of birth.
Age
MAPPING CONSTRAINTS
o A mapping constraint is a data constraint that expresses the number of entities to which
another entity can be related via a relationship set.
o It is most useful in describing the relationship sets that involve more than two entity
sets.
o For binary relationship set R on an entity set A and B, there are four possible mapping
cardinalities. These are as follows:
1. One to one (1:1)
2. One to many (1:M)
3. Many to one (M:1)
4. Many to many (M:M)
1. One-to-one
In one-to-one mapping, an entity in E1 is associated with at most one entity in E2, and an
entity in E2 is associated with at most one entity in E1.
2. One-to-many
In one-to-many mapping, an entity in E1 is associated with any number of entities in E2,
and an entity in E2 is associated with at most one entity in E1.
3. Many-to-one
In one-to-many mapping, an entity in E1 is associated with at most one entity in E2, and an
entity in E2 is associated with any number of entities in E1.
4. Many-to-many
In many-to-many mapping, an entity in E1 is associated with any number of entities in E2,
and an entity in E2 is associated with any number of entities in E1.
GENERALIZATION
• Generalization is like a bottom-up approach in which two or more entities of lower level
combine to form a higher level entity if they have some attributes in common.
• In generalization, an entity of a higher level can also combine with the entities of the
lower level to form a further higher level entity.
• Generalization is more like subclass and superclass system, but the only difference is the
approach. Generalization uses the bottom-up approach.
• In generalization, entities are combined to form a more generalized entity, i.e.,
subclasses are combined to make a superclass.
• For example, Faculty and Student entities can be generalized and create a higher level
entity Person.
AGGREGATION
In aggregation, the relation between two entities is treated as a single entity. In aggregation,
relationship with its corresponding entities is aggregated into a higher level entity.
For example: Center entity offers the Course entity act as a single entity in the relationship
which is in a relationship with another entity visitor. In the real world, if a visitor visits a
coaching center then he will never enquiry about the Course only or just about the Center
instead he will ask the enquiry about both.
Sorted File Method –In this method, as the name itself suggest whenever a new record
has to be inserted, it is always inserted in a sorted (ascending or descending) manner.
Sorting of records may be based on any primary key or any other key.
If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.
Pros of ISAM:
In this method, each record has the address of its data block, searching a record in a huge
database is quick and easy.
This method supports range retrieval and partial retrieval of records. Since the index is
based on the primary key values, we can retrieve the data for the given range of value. In
the same way, the partial value can also be easily searched, i.e., the student name starting
with 'JA' can be easily searched.
Cons of ISAM
This method requires extra space in the disk to store the index value.
When the new records are inserted, then these files have to be reconstructed to maintain
the sequence.
When the record is deleted, then the space used by it needs to be released. Otherwise, the
performance of the database will slow down.
Relational Algebra
Relational algebra is a procedural query language. It gives a step by step process to obtain
the result of the query. It uses operators to perform queries.
Definition
“Relational algebra is a theoretical language with operators that are applied on one or
two relations to produce another relation.”
Both the operands and the result are tables
NORMALIZATION UP TO DKNF
It is basically a process in database to organize data efficiently. Basically there are two goals of
doing normalization these are as follows:
1. To remove repeated data or in simple words we can say to remove redundant data.
2. Second one is to ensure that there will be data dependencies.
3. DKNF stands for Domain Key Normal Form requires the database that contains no
constraints other than domain constraints and key constraints.
4. In DKNF, it is easy to build a database.
5. It avoids general constraints in the database which are not clear domain or key
constraints.
6. The 3NF, 4NF, 5NF and BCNF are special cases of the DKNF.
7. It is achieved when every constraint on the relation is a logical consequence of the
definition.
Unit-2
Object Oriented Object Oriented Data Model = Combination of Object Oriented Programming + Relational
database model d Data Model
GENERALIZATION
• Generalization is like a bottom-up approach in which two or more entities of lower level
combine to form a higher level entity if they have some attributes in common.
• In generalization, an entity of a higher level can also combine with the entities of the
lower level to form a further higher level entity.
• Generalization is more like subclass and superclass system, but the only difference is the
approach. Generalization uses the bottom-up approach.
• For example, Faculty and Student entities can be generalized and create a higher level
entity Person.
AGGREGATION
In aggregation, the relation between two entities is treated as a single entity. In aggregation,
relationship with its corresponding entities is aggregated into a higher level entity.
For example: Center entity offers the Course entity act as a single entity in the relationship
which is in a relationship with another entity visitor. In the real world, if a visitor visits a
coaching center then he will never enquiry about the Course only or just about the Center
instead he will ask the enquiry about both.
Features
Databases in the collection are logically interrelated with each other. Often they represent
a single logical database.
Data is physically stored across multiple sites. Data in each site can be managed by a
DBMS independent of the other sites.
The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
A distributed database is not a loosely connected file system.
A distributed database incorporates transaction processing, but it is not synonymous with
a transaction processing system.
Advantages of Distributed Databases
Following are the advantages of distributed databases over centralized databases.
Modular Development − If the system needs to be expanded to new locations or new
units, in centralized database systems, the action requires substantial efforts and
disruption in the existing functioning. However, in distributed databases, the work simply
requires adding new computers and local data to the new site and finally connecting them
to the distributed system, with no interruption in current functions.
More Reliable − In case of database failures, the total system of centralized databases
comes to a halt. However, in distributed systems, when a component fails, the functioning
of the system continues may be at a reduced performance. Hence DDBMS is more
reliable.
Better Response − If data is distributed in an efficient manner, then user requests can be
met from local data itself, thus providing faster response. On the other hand, in
centralized systems, all queries have to pass through the central computer for processing,
which increases the response time.
Lower Communication Cost − In distributed database systems, if data is located locally
where it is mostly used, then the communication costs for data manipulation can be
minimized. This is not feasible in centralized systems.
Types of Distributed Databases
Distributed databases can be broadly classified into homogeneous and heterogeneous
distributed database environments, each with further sub-divisions, as shown in the following
illustration.
Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored. Since,
each site has its own copy of the entire database; queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost
during update operations. Hence, this is suitable for systems where a large number of
queries is required to be handled whereas the number of database updates is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution of the
tables is done in accordance to the frequency of access. This takes into consideration the
fact that the frequency of accessing the tables vary considerably from site to site. The
number of copies of the tables (or portions) depends on how frequently the access
queries execute and the site which generate the access queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or
partitions, and each fragment can be stored at different sites. This considers the fact that
it seldom happens that all data stored in a table is required at a given site. Moreover,
fragmentation increases parallelism and provides better disaster recovery. Here, there is
only one copy of each fragment in the system, i.e. no redundant data.
The three fragmentation techniques are −
Verticalfragmentation
Horizontal fragmentation
Hybrid fragmentation
Mixed Distribution
This is a combination of fragmentation and partial replications. Here, the tables are
initially fragmented in any form (horizontal or vertical), and then these fragments are
partially replicated across the different sites according to the frequency of accessing the
fragments.
Types of locks
1. Shared lock
Lock-s (read data item value)
2. Exclusive lock
Lock-x (both read and write)
S X
S T F
X F F
Note: Any no. of transactions can hold shared lock or an item but exclusive lock can be hold only
by one transactions of a time.
CONVERSION OF LOCKS
i. Upgrading – Read-Lock -> Write-Lock
ii. Downgrading – Write-Lock -> Read-Lock
RECOVERY IN DDBMS
Recovery is the most complicated process in distributed databases. Recovery of a failed
system in the communication network is very difficult.
For example:
Consider that, location A sends message to location B and expects response from B but B is
unable to receive it. There are several problems for this situation which are as follows.
NEED OF RECOVERY
A database is a very huge system with lots of data and transaction. The transaction in the
database is executed at each seconds of time and is very critical to the database. If there is
any failure or crash while executing the transaction, then it expected that no data is lost. It is
necessary to revert the changes of transaction to previously committed point. There are
various techniques to recover the data depending on the type of failure or crash.
Generalization of failure:
Transaction failure
System crash
Disk failure
It should check the states of all the transactions, which were being executed.
A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.
It should check whether the transaction can be completed now or it needs to be rolled
back.
No transactions would be allowed to leave the DBMS in an inconsistent state.
TRANSACTIONS
A transaction is a program including a collection of database operations, executed as a
logical unit of data processing. The operations performed in a transaction include one or
more of database operations like insert, delete, update or retrieve data. It is an atomic
process that is either performed into completion entirely or is not performed at all. A
transaction involving only data retrieval without any data update is called read-only
transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
read_item() − reads data item from storage to main memory.
modify_item() − change value of item in the main memory.
write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.
Transaction Operations
The low level operations performed in a transaction are −
begin_transaction − A marker that specifies start of transaction execution.
Read_item or write_item − Database operations that may be interleaved with main
memory operations as a part of transaction.
end_transaction − A marker that specifies end of transaction.
Commit − A signal to specify that the transaction has been successfully completed in its
entirety and will not be undone.
Rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be
rolled back.
Transaction States
A transaction may go through a subset of five states, active, partially committed, committed,
failed and aborted.
Active − the initial state where the transaction enters is the active state. The transaction
remains in this state while it is executing read, write or other operations.
Partially Committed − the transaction enters this state after the last statement of the
transaction has been executed.
Committed − the transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
Failed − the transaction goes from partially committed state or active state to failed state
when it is discovered that normal execution can no longer proceed or system checks fail.
Aborted − this is the state after the transaction has been rolled back after failure and the
database has been restored to its state that was before the transaction began.
The following state transition diagram depicts the states in the transaction and the low level
transaction operations that cause change in states.
Types of Schedules
There are two types of schedules −
Serial Schedules − in a serial schedule, at any point of time, only one transaction is active,
i.e. there is no overlapping of transactions. This is depicted in the following graph –
Parallel Schedules − in parallel schedules, more than one transactions are active
simultaneously, i.e. the transactions contain operations that overlap at time. This is
depicted in the following graph −
Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active
transactions perform non-compatible operations. Two operations are said to be in conflict,
when all of the following three conditions exists simultaneously −
The two operations are parts of different transactions.
Both the operations access the same data item.
At least one of the operations is a write_item() operation, i.e. it tries to modify the data
item.
SERIALIZABILITY
A serializable schedule of ‘n’ transactions is a parallel schedule which is equivalent to a serial
schedule comprising of the same ‘n’ transactions. A serializable schedule contains the
correctness of serial schedule while ascertaining better CPU utilization of parallel schedule.
Equivalence of Schedules
Equivalence of two schedules can be of the following types −
Result equivalence − Two schedules producing identical results are said to be result
equivalent.
View equivalence − Two schedules that perform similar action in a similar manner are
said to be view equivalent.
Conflict equivalence − Two schedules are said to be conflict equivalent if both contain
the same set of transactions and has the same order of conflicting pairs of operations.
TWO PHASE LOCKING PROTOCOL
Every transaction will lock and unlock the data item in two different phases.
Growing Phase − All the locks are issued in this phase. No locks are released, after all
changes to data-items are committed and then the second phase (shrinking phase) starts.
Shrinking phase − No locks are issued in this phase, all the changes to data-items are
noted (stored) and then locks are released.
The 2PL locking protocol is represented diagrammatically as follows −
In the growing phase transaction reaches a point where all the locks it may need has been
acquired. This point is called LOCK POINT.
After the lock point has been reached, the transaction enters a shrinking phase.
Types
Two phase locking is of two types −
Strict two phase locking protocol
A transaction can release a shared lock after the lock point, but it cannot release any
exclusive lock until the transaction commits. This protocol creates a cascade less schedule.
Here,
Lock-X (B): Cannot execute Lock-X(B) since B is locked by T2.
Lock-X (A): Cannot execute Lock-X(A) since A is locked by T1.
In the above situation T1 waits for B and T2 waits for A. The waiting time never ends. Both
the transaction cannot proceed further at least any one releases the lock voluntarily. This
situation is called deadlock.
Wait for graph: It is used in the deadlock detection method, creating a node for each
transaction, creating an edge Ti to Tj, if Ti is waiting to lock an item locked by Tj. A cycle in
WFG indicates a deadlock has occurred. WFG is created at regular intervals.
Unit-3
QUERY OPTIMIZATION
Query Processing: Query processing refers to activities including translation of high level
languages (HLL) queries into operations at physical file level, query optimization
transformations, and actual evaluation of queries.
The activities involved in parsing, validating, execution and optimizing a query is called Query
Processing.
Steps
The steps involved in query processing and optimization are as follows −
A sequence of primitive operations that can be used to evaluate a query is called query
execution plan or query evaluation plan.
The query execution engine takes a query evaluation plan, executes that plan and
produces the desired output. The different execution plans for a given query can have
different costs based on the number of disks. It is the responsibility of the system to
construct a query evaluation plan which minimizes the cost of query evaluation. This task
is called query optimization.
A query optimization is expressed in high level query language and is scanned, parsed and
validated. The scanner identifies the SQL keywords, attributes and relation names in the
text of the query.
The parser checks the syntax which is used to determine if the query is formulated
according to syntax rules of the query language.
Finally, the query is evaluated by checking that all attributes and relation names are valid
and semantically meaningful in the schema of a particular database.
An internal representation of the query is then created which is either a tree or a graph
known as query tree or query graph.
If the query written SQL is translated into relational algebra, then its internal
representation is a query tree. Otherwise, if TRC or DRC its internal representation is a
query graph. A graph has many possible execution strategies and the process of choosing
a strategy with minimum cost is called query optimization.
The query processing and optimization in the DBMS are explained in the form of a diagram
below −
ALGORITHM FOR EXTERNAL SORTING
External sorting
External sorting is a technique in which the data is stored on the secondary memory, in
which part by part data is loaded into the main memory and then sorting can be done over
there. Then this sorted data will be stored in the intermediate files. Finally, these files will be
merged to get a sorted data. Thus by using the external sorting technique, a huge amount of
data can be sorted easily. In case of external sorting, all the data cannot be accommodated on
the single memory; in this case, some amount of memory needs to be kept on a memory such
as hard disk, compact disk and so on.
The requirement of external sorting is there, where the data we have to store in the main
memory does not fit into it. Basically, it consists of two phases that are:
For example: Let us consider there are 10,000 records which have to be sorted. For this, we
need to apply the external merge sort method. Suppose the main memory has a capacity to
store 500 records in a block, with having each block size of 100 records.
In this example, we can see 5 blocks will be sorted in intermediate files. This process will be
repeated 20 times to get all the records. Then by this, we start merging a pair of
intermediate files in the main memory to get a sorted output.
Stage 1: Firstly break the records into the blocks and then sort the individual record with the
help of two input tapes.
Stage 2: In this merge the sorted blocks and then create a single sorted file with the help of
two output tapes.
By this, it can be said that two-way merge sort uses the two input tapes and two output
tapes for sorting the data.
Step 3) Repeat the step 2 and get longer and longer runs on alternates tapes. Finally, at last,
we will get a single sorted list.
Analysis
This algorithm requires log(N/M) passes with initial run pass. Therefore, at each pass
the N records are processed and at last we will get a time complexity as O(N log(N/M).
MULTI-MEDIA DATABASE
Multimedia database is a collection of multimedia data which includes text, images,
graphics (drawings, sketches), animations, audio, and video, among others. These databases
have extensive amounts of data which can be multimedia and multisource. The framework
which manages these multimedia databases and their different types so that the data can be
stored, utilized, and delivered in more than one way is known as a multimedia database
management system.
The multimedia database can be classified into three types. These types are:
1. Static media
2. Dynamic media
3. Dimensional media
The types of multimedia applications that are based on the data management
characteristics are:
1. Repository applications: An extensive amount of multimedia data stored along with
metadata for retrieval purposes.
2. Presentation applications: These involve the delivery of multimedia data subject to the
temporal constraint. An optimal viewing or listening experience requires DBMS to deliver
the data at a certain rate which offers the quality of service, which is above a particular
threshold. This data is processed as it is being delivered.
3. Collaborative work using multimedia information: It involves the execution of a complex
task by merging drawings and changing notifications.
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns
of information to various perspectives for categorization into useful data, which is collected
and assembled in particular areas such as data warehouses, efficient analysis, data mining
algorithm, helping decision making and other data requirement to eventually cost-cutting
and generating revenue.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases
to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such as
text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. There are many
powerful instruments and techniques available to mine data and find better insight from it.
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables,
records, and columns from which data can be accessed in various ways without having to
recognize the database tables. Tables convey and share information, which facilitates data
search-ability, reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, and where an organization has kept various
kinds of information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model is
called an object-relational model. It supports Classes, Objects, Inheritance, etc. One of the
primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately. Even though
this was a unique capability a very long while back, today, most of the relational database
systems support transactional database activities.
The association rule learning is one of the very important concepts of machine learning, and
it is employed in Market Basket analysis, Web usage mining, continuous production,
etc. Here market basket analysis is a technique used by the various big retailers to discover
the associations between items. We can understand it by taking an example of a
supermarket, as in a supermarket, all products that are purchased together is put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so
these products are stored within a shelf or mostly nearby. Consider the below diagram:
Association rule learning can be divided into three types of algorithms:
1. Apriori
2. Éclat
3. F-P Growth Algorithm
Association rule learning works on the concept of If and Else Statement, such as if A then B.
Here the If element is called antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or relation between two items
is known as single cardinality. It is all about creating rules, and if the number of items
increases, then cardinality also increases accordingly. So, to measure the associations
between thousands of data items, there are several metrics. These metrics are given below:
o Support
o Confidence
o Lift
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X
and Y occur together in the dataset when the occurrence of X is already given. It is the ratio
of the transaction that contains X and Y to the number of records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of
each other.
o Lift>1: It determines the degree to which the two item sets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a
negative effect on another.
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on
the databases that contain transactions. This algorithm uses a breadth-first search and Hash
Tree to calculate the item set efficiently.
It is mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions for
patients.
Éclat Algorithm
Éclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-
first search technique to find frequent item sets in a transaction database. It performs faster
execution than Apriori Algorithm.
CLASSIFICATION
Data Mining: Data mining in general terms means mining or digging deep into data that is in
different forms to gain patterns, and to gain knowledge on that pattern. In the process of data
mining, large data sets are first sorted, then patterns are identified and relationships are
established to perform data analysis and solve problems.
Classification: It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of
a set of categories (subpopulations), a new observation belongs to, on the basis of a training
set of data containing observations and whose categories membership is known.
Example: Before starting any project, we need to check its feasibility. In this case, a classifier
is required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to
further approve it. It is a two-step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed model
on test data and hence estimate the accuracy of the classification rules.
DATA WAREHOUSING
A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction data
from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on
providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
Integration Layer
It consists of Operational Data Store and Staging area. Staging area is used to perform data
cleansing, data transformation and loading data from different sources to a data
warehouse. As multiple data sources are available for extraction at different time zones,
staging area is used to store the data and later to apply transformations on data.
Presentation Layer
This is used to perform BI reporting by end users. The data in a DW system is accessed by BI
users and used for reporting and analysis.
The following illustration shows the common architecture of a Data Warehouse System.
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not useful concerning the subject
and including all data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions attributes types, etc., among
different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6
months, 12 months, or even previous data from a data warehouse. These variations with a transactions
system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data. Therefore,
the DW does not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. Non-Volatile defines that once
entered into the warehouse and data should not change.
XML declaration
Document type declaration
Document Elements Section
Document Elements are the building blocks of XML. These divide the document into a
hierarchy of sections, each serving a specific purpose. You can separate a document into
multiple sections so that they can be rendered differently, or used by a search engine. The
elements can be containers, with a combination of text and other elements.
2. Data Integrity:
Data integrity is defined as the data contained in the database is both correct and
consistent. For this purpose, the data stored in the database must satisfy certain types of
procedures (rules). The data in a database must be correct and consistent. So, data stored in
the database must satisfy certain types of procedure (rules). DBMS provides different ways
to implement such types of constraints (rules). This improves data integrity in a database.
1. Cryptography –
Cryptography is classified into symmetric cryptography, asymmetric cryptography and
hashing. Below are the descriptions of these types.
1. Symmetric key cryptography – It involves usage of one secret key along with encryption
and decryption algorithms which help in securing the contents of the message. The
strength of symmetric key cryptography depends upon the number of key bits. It is
relatively faster than asymmetric key cryptography. There arises a key distribution
problem as the key has to be transferred from the sender to receiver through a secure
channel.
2. Asymmetric key cryptography – It is also known as public key cryptography because it
involves usage of a public key along with secret key. It solves the problem of key
distribution as both parties’ uses different keys for encryption/decryption. It is not
feasible to use for decrypting bulk messages as it is very slow compared to symmetric key
cryptography.
3. Hashing – It involves taking the plain-text and converting it to a hash value of fixed size by
a hash function. This process ensures integrity of the message as the hash value on both,
sender\’s and receiver\’s side should match if the message is unaltered.
char(n) It is a fixed width character string data type. Its size can be up to 8000 characters.
varchar(n) It is a variable width character string data type. Its size can be up to 8000 characters.
varchar(max) It is a variable width character string data types. Its size can be up to 1,073,741,824
characters.
text It is a variable width character string data type. Its size can be up to 2GB of text data.
nchar It is a fixed width Unicode string data type. Its size can be up to 4000 characters.
nvarchar It is a variable width Unicode string data type. Its size can be up to 4000 characters.
ntext It is a variable width Unicode string data type. Its size can be up to 2GB of text data.
binary(n) It is a fixed width Binary string data type. Its size can be up to 8000 bytes.
varbinary It is a variable width Binary string data type. Its size can be up to 8000 bytes.
image It is also a variable width Binary string data type. Its size can be up to 2GB.
Syntax :
SQL constraints are used to specify rules for the data in a table.
Constraints are used to limit the type of data that can go into a table. This ensures the
accuracy and reliability of the data in the table. If there is any violation between the
constraint and the data action, the action is aborted.
Constraints can be column level or table level. Column level constraints apply to a column,
and table level constraints apply to the whole table.