0% found this document useful (0 votes)
19 views56 pages

Adm Unit 2

This document discusses transaction processing in databases, detailing the definition, operations, and states of transactions, as well as the ACID properties that ensure data integrity. It explains the concepts of schedules, conflicts, and serializability, highlighting the importance of maintaining consistency during concurrent execution of transactions. Additionally, it covers recoverability of schedules and the role of concurrency control in managing simultaneous database operations.

Uploaded by

SRIKANTH KETHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views56 pages

Adm Unit 2

This document discusses transaction processing in databases, detailing the definition, operations, and states of transactions, as well as the ACID properties that ensure data integrity. It explains the concepts of schedules, conflicts, and serializability, highlighting the importance of maintaining consistency during concurrent execution of transactions. Additionally, it covers recoverability of schedules and the role of concurrency control in managing simultaneous database operations.

Uploaded by

SRIKANTH KETHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

[ADVANCED DATABASE AND MINING]

UNIT II
TRANSACTION PROCESSING:
A transaction is a program including a collection of database operations, executed as
a logical unit of data processing. The operations performed in a transaction include
one or more of database operations like insert, delete, update or retrieve data. It is an
atomic process that is either performed into completion entirely or is not performed
at all. A transaction involving only data retrieval without any data update is called read-
only transaction.
Each high level operation can be divided into a number of low level tasks or operations.
For example, a data update operation can be divided into three tasks −
 read_item() − reads data item from storage to main memory.
 modify_item() − change value of item in the main memory.
 write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for
all transactions, read and write forms the basic database operations.

TRANSACTION OPERATIONS

The low level operations performed in a transaction are −


 begin_transaction − A marker that specifies start of transac on
execution.
 read_item or write_item − Database opera ons that may be interleaved
with main memory operations as a part of transaction.
 end_transaction − A marker that specifies end of transac on.
 commit − A signal to specify that the transac on has been successfully
completed in its entirety and will not be undone.
 rollback − A signal to specify that the transaction has been unsuccessful
and so all temporary changes in the database are undone. A committed
transaction cannot be rolled back.

TRANSACTION STATES

A transaction may go through a subset of five states, active, partially committed,


committed, failed and aborted.
 Active − The ini al state where the transac on enters is the ac ve state.
The transaction remains in this state while it is executing read, write or
other operations.
 Partially Committed − The transac on enters this state after the last
statement of the transaction has been executed.

UNIT - II 1
[ADVANCED DATABASE AND MINING]

 Committed − The transac on enters this state a er successful


completion of the transaction and system checks have issued commit
signal.
 Failed − The transac on goes from par ally commi ed state or active
state to failed state when it is discovered that normal execution can no
longer proceed or system checks fail.
 Aborted − This is the state a er the transac on has been rolled back
after failure and the database has been restored to its state that was
before the transaction began.
The following state transition diagram depicts the states in the transaction and the low
level transaction operations that causes change in states.

Desirable Properties of Transactions

Any transaction must maintain the ACID properties, viz. Atomicity, Consistency,
Isolation, and Durability.
 Atomicity − This property states that a transac on is an atomic unit of
processing, that is, either it is performed in its entirety or not performed
at all. No partial update should exist.
 Consistency − A transac on should take the database from one
consistent state to another consistent state. It should not adversely
affect any data item in the database.
 Isolation − A transac on should be executed as if it is the only one in the
system. There should not be any interference from the other concurrent
transactions that are simultaneously running.
 Durability − If a commi ed transac on brings about a change, that
change should be durable in the database and not lost in case of any
failure.

UNIT - II 2
[ADVANCED DATABASE AND MINING]

Schedules and Conflicts

In a system with a number of simultaneous transactions, a schedule is the total order


of execution of operations. Given a schedule S comprising of n transactions, say T1, T2,
T3………..Tn; for any transaction Ti, the operations in Ti must execute as laid down in
the schedule S.
Types of Schedules
There are two types of schedules −
 Serial Schedules − In a serial schedule, at any point of me, only one
transaction is active, i.e. there is no overlapping of transactions. This is
depicted in the following graph –

 Parallel Schedules − In parallel schedules, more than one transac ons


are active simultaneously, i.e. the transactions contain operations that
overlap at time. This is depicted in the following graph –

Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active
transactions perform non-compatible operations. Two operations are said to be in
conflict, when all of the following three conditions exists simultaneously −
 The two operations are parts of different transactions.
 Both the operations access the same data item.

UNIT - II 3
[ADVANCED DATABASE AND MINING]

 At least one of the operations is a write_item() operation, i.e. it tries to


modify the data item.

SERIALIZABILITY SCHEDULE

A serializable schedule of ‘n’ transactions is a parallel schedule which is equivalent to


a serial schedule comprising of the same ‘n’ transactions. A serializable schedule
contains the correctness of serial schedule while ascertaining better CPU utilization of
parallel schedule.

Equivalence of Schedules
Equivalence of two schedules can be of the following types −
 Result equivalence − Two schedules producing iden cal results are said
to be result equivalent.
 View equivalence − Two schedules that perform similar ac on in a
similar manner are said to be view equivalent.
 Conflict equivalence − Two schedules are said to be conflict equivalent if
both contain the same set of transactions and has the same order of
conflicting pairs of operations.

ACID Properties in DBMS

DBMS is the management of data that should remain integrated when any changes
are done in it. It is because if the integrity of the data is affected, whole data will get
disturbed and corrupted. Therefore, to maintain the integrity of the data, there are
four properties described in the database management system, which are known as
the ACID properties. The ACID properties are meant for the transaction that goes
through a different group of tasks, and there we come to see the role of the ACID
properties.

In this section, we will learn and understand about the ACID properties. We will learn
what these properties stand for and what does each property is used for. We will also
understand the ACID properties with the help of some examples.

ACID Properties

The expansion of the term ACID defines for:

UNIT - II 4
[ADVANCED DATABASE AND MINING]

1) ATOMICITY: The term atomicity defines that the data remains atomic. It means if
any operation is performed on the data, either it should be performed or executed
completely or should not be executed at all. It further means that the operation should
not break in between or execute partially. In the case of executing operations on the
transaction, the operation should be completely executed and not partially.

Example: If Remo has account A having $30 in his account from which he wishes to
send $10 to Sheero's account, which is B. In account B, a sum of $ 100 is already
present. When $10 will be transferred to account B, the sum will become $110. Now,
there will be two operations that will take place. One is the amount of $10 that Remo
wants to transfer will be debited from his account A, and the same amount will get
credited to account B, i.e., into Sheero's account. Now, what happens - the first
operation of debit executes successfully, but the credit operation, however, fails. Thus,
in Remo's account A, the value becomes $20, and to that of Sheero's account, it
remains $100 as it was previously present.

UNIT - II 5
[ADVANCED DATABASE AND MINING]

In the above diagram, it can be seen that after crediting $10, the amount is still $100
in account B. So, it is not an atomic transaction.

The below image shows that both debit and credit operations are done successfully.
Thus the transaction is atomic.

Thus, when the amount loses atomicity, then in the bank systems, this becomes a huge
issue, and so the atomicity is the main focus in the bank systems.

2) CONSISTENCY: The word consistency means that the value should remain
preserved always. In DBMS

, the integrity of the data should be maintained, which means if a change in the
database is made, it should remain preserved always. In the case of transactions, the
integrity of the data is very essential so that the database remains consistent before
and after the transaction. The data should always be correct.

Example:

UNIT - II 6
[ADVANCED DATABASE AND MINING]

In the above figure, there are three accounts, A, B, and C, where A is making a
transaction T one by one to both B & C. There are two operations that take place, i.e.,
Debit and Credit. Account A firstly debits $50 to account B, and the amount in account
A is read $300 by B before the transaction. After the successful transaction T, the
available amount in B becomes $150. Now, A debits $20 to account C, and that time,
the value read by C is $250 (that is correct as a debit of $50 has been successfully done
to B). The debit and credit operation from account A to C has been done successfully.
We can see that the transaction is done successfully, and the value is also read
correctly. Thus, the data is consistent. In case the value read by B and C is $300, which
means that data is inconsistent because when the debit operation executes, it will not
be consistent.

4) ISOLATION: The term 'isolation' means separation. In DBMS, Isolation is the


property of a database where no data should affect the other one and may occur
concurrently. In short, the operation on one database should begin when the
operation on the first database gets complete. It means if two operations are being
performed on two different databases, they may not affect the value of one another.
In the case of transactions, when two or more transactions occur simultaneously, the
consistency should remain maintained. Any changes that occur in any particular
transaction will not be seen by other transactions until the change is not committed in
the memory.

Example: If two operations are concurrently running on two different accounts, then
the value of both accounts should not get affected. The value should remain
persistent. As you can see in the below diagram, account A is making T1 and T2

UNIT - II 7
[ADVANCED DATABASE AND MINING]

transactions to account B and C, but both are executing independently without


affecting each other. It is known as Isolation.

4) DURABILITY: Durability ensures the permanency of something. In DBMS, the term


durability ensures that the data after the successful execution of the operation
becomes permanent in the database. The durability of the data should be so perfect
that even if the system fails or leads to a crash, the database still survives. However, if
gets lost, it becomes the responsibility of the recovery manager for ensuring the
durability of the database. For committing the values, the COMMIT command must be
used every time we make changes.

Therefore, the ACID property of DBMS plays a vital role in maintaining the consistency
and availability of data in the database.

Thus, it was a precise introduction of ACID properties in DBMS. We have discussed


these properties in the transaction section also.

SCHEDULE

A series of operation from one transaction to another transaction is known as


schedule. It is used to preserve the order of the operation in each of the individual
transaction.

UNIT - II 8
[ADVANCED DATABASE AND MINING]

1. Serial Schedule

The serial schedule is a type of schedule where one transaction is executed completely
before starting another transaction. In the serial schedule, when the first transaction
completes its cycle, then the next transaction is executed.

For example: Suppose there are two transactions T1 and T2 which have some
operations. If it has no interleaving of operations, then there are the following two
possible outcomes:

1. Execute all the operations of T1 which was followed by all the operations of T2.
2. Execute all the operations of T1 which was followed by all the operations of T2.

o In the given (a) figure, Schedule A shows the serial schedule where T1 followed
by T2.
o In the given (b) figure, Schedule B shows the serial schedule where T2 followed
by T1.

2. Non-serial Schedule

o If interleaving of operations is allowed, then there will be non-serial schedule.


o It contains many possible orders in which the system can execute the individual
operations of the transactions.
o In the given figure (c) and (d), Schedule C and Schedule D are the non-serial
schedules. It has interleaving of operations.

UNIT - II 9
[ADVANCED DATABASE AND MINING]

3. Serializable schedule

o The serializability of schedules is used to find non-serial schedules that allow


the transaction to execute concurrently without interfering with one another.
o It identifies which schedules are correct when executions of the transaction
have interleaving of their operations.
o A non-serial schedule will be serializable if its result is equal to the result of its
transactions executed serially.

UNIT - II 10
[ADVANCED DATABASE AND MINING]

Schedule A and Schedule B are serial schedule.

Schedule C and Schedule D are Non-serial schedule.

RECOVERABILITY OF SCHEDULE

Sometimes a transaction may not execute completely due to a software issue, system
crash or hardware failure. In that case, the failed transaction has to be rollback. But

UNIT - II 11
[ADVANCED DATABASE AND MINING]

some other transaction may also have used value produced by the failed transaction.
So we also have to rollback those transactions.

The above table 1 shows a schedule which has two transactions. T1 reads and writes
the value of A and that value is read and written by T2. T2 commits but later on, T1
fails. Due to the failure, we have to rollback T1. T2 should also be rollback because it
reads the value written by T1, but T2 can't be rollback because it already committed.
So this type of schedule is known as irrecoverable schedule.

Irrecoverable schedule: The schedule will be irrecoverable if Tj reads the updated


value of Ti and Tj committed before Ti commit.

The above table 2 shows a schedule with two transactions. Transaction T1 reads and
writes A, and that value is read and written by transaction T2. But later on, T1 fails.
Due to this, we have to rollback T1. T2 should be rollback because T2 has read the

UNIT - II 12
[ADVANCED DATABASE AND MINING]

value written by T1. As it has not committed before T1 commits so we can rollback
transaction T2 as well. So it is recoverable with cascade rollback.

Recoverable with cascading rollback: The schedule will be recoverable with cascading
rollback if Tj reads the updated value of Ti. Commit of Tj is delayed till commit of Ti.

The above Table 3 shows a schedule with two transactions. Transaction T1 reads and
write A and commits, and that value is read and written by T2. So this is a cascade less
recoverable schedule.

CONCURRENCY CONTROL

Concurrency Control is the management procedure that is required for controlling


concurrent execution of the operations that take place on a database.

But before knowing about concurrency control, we should know about concurrent
execution.

Concurrent Execution in DBMS

o In a multi-user system, multiple users can access and use the same database at
one time, which is known as the concurrent execution of the database. It means
that the same database is executed simultaneously on a multi-user system by
different users.
o While working on the database transactions, there occurs the requirement of
using the database by multiple users for performing different operations, and
in that case, concurrent execution of the database is performed.
o The thing is that the simultaneous execution that is performed should be done
in an interleaved manner, and no operation should affect the other executing
operations, thus maintaining the consistency of the database. Thus, on making
UNIT - II 13
[ADVANCED DATABASE AND MINING]

the concurrent execution of the transaction operations, there occur several


challenging problems that need to be solved.

Problems with Concurrent Execution

In a database transaction, the two main operations are READ and WRITE operations.
So, there is a need to manage these two operations in the concurrent execution of the
transactions as if these operations are not performed in an interleaved manner, and
the data may become inconsistent. So, the following problems occur with the
Concurrent Execution of the operations:

Problem 1: Lost Update Problems (W - W Conflict)

The problem occurs when two different database transactions perform the read/write
operations on the same database items in an interleaved manner (i.e., concurrent
execution) that makes the values of the items incorrect hence making the database
inconsistent.

For example:

Consider the below diagram where two transactions TX and TY, are performed on the
same account A where the balance of account A is $300.

o At time t1, transaction TX reads the value of account A, i.e., $300 (only read).

UNIT - II 14
[ADVANCED DATABASE AND MINING]

o At time t2, transaction TX deducts $50 from account A that becomes $250 (only
deducted and not updated/write).
o Alternately, at time t3, transaction TY reads the value of account A that will be
$300 only because TX didn't update the value yet.
o At time t4, transaction TY adds $100 to account A that becomes $400 (only
added but not updated/write).
o At time t6, transaction TX writes the value of account A that will be updated as
$250 only, as TY didn't update the value yet.
o Similarly, at time t7, transaction TY writes the values of account A, so it will write
as done at time t4 that will be $400. It means the value written by TX is lost, i.e.,
$250 is lost.

Hence data becomes incorrect, and database sets to inconsistent.

Dirty Read Problems (W-R Conflict)

The dirty read problem occurs when one transaction updates an item of the database,
and somehow the transaction fails, and before the data gets rollback, the updated
database item is accessed by another transaction. There comes the Read-Write Conflict
between both transactions.

For example:

Consider two transactions TX and TY in the below diagram performing read/write


operations on account A where the available balance in account A is $300:

UNIT - II 15
[ADVANCED DATABASE AND MINING]

o At time t1, transaction TX reads the value of account A, i.e., $300.


o At time t2, transaction TX adds $50 to account A that becomes $350.
o At time t3, transaction TX writes the updated value in account A, i.e., $350.
o Then at time t4, transaction TY reads account A that will be read as $350.
o Then at time t5, transaction TX rollbacks due to server problem, and the value
changes back to $300 (as initially).
o But the value for account A remains $350 for transaction TY as committed,
which is the dirty read and therefore known as the Dirty Read Problem.

Unrepeatable Read Problem (W-R Conflict)

Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two
different values are read for the same database item.

For example:

Consider two transactions, TX and TY, performing the read/write operations on


account A, having an available balance = $300. The diagram is shown below:

UNIT - II 16
[ADVANCED DATABASE AND MINING]

o At time t1, transaction TX reads the value from account A, i.e., $300.
o At time t2, transaction TY reads the value from account A, i.e., $300.
o At time t3, transaction TY updates the value of account A by adding $100 to the
available balance, and then it becomes $400.
o At time t4, transaction TY writes the updated value, i.e., $400.
o After that, at time t5, transaction TX reads the available value of account A, and
that will be read as $400.
o It means that within the same transaction TX, it reads two different values of
account A, i.e., $ 300 initially, and after updation made by transaction TY, it
reads $400. It is an unrepeatable read and is therefore known as the
Unrepeatable read problem.

Thus, in order to maintain consistency in the database and avoid such problems that
take place in concurrent execution, management is needed, and that is where the
concept of Concurrency Control comes into role.

Reasons for using Concurrency control method is DBMS:

 To apply Isolation through mutual exclusion between conflicting transactions


 To resolve read-write and write-write conflict issues
 To preserve database consistency through constantly preserving execution
obstructions
 The system needs to control the interaction among the concurrent
transactions. This control is achieved using concurrent-control schemes.

UNIT - II 17
[ADVANCED DATABASE AND MINING]

 Concurrency control helps to ensure serializability

Example
Assume that two people who go to electronic kiosks at the same time to buy a movie
ticket for the same movie and the same show time.

However, there is only one seat left in for the movie show in that particular theatre.
Without concurrency control in DBMS, it is possible that both moviegoers will end up
purchasing a ticket. However, concurrency control method does not allow this to
happen. Both moviegoers can still access information written in the movie seating
database. But concurrency control only provides a ticket to the buyer who has
completed the transaction process first.

CONCURRENCY CONTROL PROTOCOLS


Different concurrency control protocols offer different benefits between the amount
of concurrency they allow and the amount of overhead that they impose. Following
are the Concurrency Control techniques in DBMS:

 Lock-Based Protocols
 Two Phase Locking Protocol
 Timestamp-Based Protocols
 Validation-Based Protocols

LOCK-BASED PROTOCOLS:

Lock Based Protocols in DBMS is a mechanism in which a transaction cannot Read or


Write the data until it acquires an appropriate lock. Lock based protocols help to
eliminate the concurrency problem in DBMS for simultaneous transactions by locking
or isolating a particular transaction to a single user.
A lock is a data variable which is associated with a data item. This lock signifies that
operations that can be performed on the data item. Locks in DBMS help synchronize
access to the database items by concurrent transactions.

All lock requests are made to the concurrency-control manager. Transactions


proceed only once the lock request is granted.

Binary Locks: A Binary lock on a data item can either locked or unlocked states.
Shared/exclusive: This type of locking mechanism separates the locks in DBMS based
on their uses. If a lock is acquired on a data item to perform a write operation, it is
called an exclusive lock.

1. Shared Lock (S):


UNIT - II 18
[ADVANCED DATABASE AND MINING]

A shared lock is also called a Read-only lock. With the shared lock, the data item can
be shared between transactions. This is because you will never have permission to
update data on the data item.

For example, consider a case where two transactions are reading the account
balance of a person. The database will let them read by placing a shared lock.
However, if another transaction wants to update that account’s balance, shared lock
prevent it until the reading process is over.

2. Exclusive Lock (X):

With the Exclusive Lock, a data item can be read as well as written. This is exclusive
and can’t be held concurrently on the same data item. X-lock is requested using lock-
x instruction. Transactions may unlock the data item after finishing the ‘write’
operation.

For example, when a transaction needs to update the account balance of a person.
You can allows this transaction by placing X lock on it. Therefore, when the second
transaction wants to read or write, exclusive lock prevent this operation.

3. Simplistic Lock Protocol

This type of lock-based protocols allows transactions to obtain a lock on every object
before beginning operation. Transactions may unlock the data item after finishing
the ‘write’ operation.

4. Pre-claiming Locking

Pre-claiming lock protocol helps to evaluate operations and create a list of required
data items which are needed to initiate an execution process. In the situation when
all locks are granted, the transaction executes. After that, all locks release when all of
its operations are over.

Starvation

Starvation is the situation when a transaction needs to wait for an indefinite period
to acquire a lock.

Following are the reasons for Starvation:

 When waiting scheme for locked items is not properly managed


 In the case of resource leak
 The same transaction is selected as a victim repeatedly

Deadlock

UNIT - II 19
[ADVANCED DATABASE AND MINING]

Deadlock refers to a specific situation where two or more processes are waiting for
each other to release a resource or more than two processes are waiting for the
resource in a circular chain.

TWO PHASE LOCKING PROTOCOL

Two Phase Locking Protocol also known as 2PL protocol is a method of concurrency
control in DBMS that ensures serializability by applying a lock to the transaction data
which blocks other transactions to access the same data simultaneously. Two Phase
Locking protocol helps to eliminate the concurrency problem in DBMS.
This locking protocol divides the execution phase of a transaction into three different
parts.

 In the first phase, when the transaction begins to execute, it requires


permission for the locks it needs.
 The second part is where the transaction obtains all the locks. When a
transaction releases its first lock, the third phase starts.
 In this third phase, the transaction cannot demand any new locks. Instead, it
only releases the acquired locks.

The Two-Phase Locking protocol allows each transaction to make a lock or unlock
request in two steps:

 Growing Phase: In this phase transaction may obtain locks but may not
release any locks.
 Shrinking Phase: In this phase, a transaction may release locks but not obtain
any new lock

UNIT - II 20
[ADVANCED DATABASE AND MINING]

It is true that the 2PL protocol offers serializability. However, it does not ensure that
deadlocks do not happen.

In the above-given diagram, you can see that local and global deadlock detectors are
searching for deadlocks and solve them with resuming transactions to their initial
states.

Strict Two-Phase Locking Method


Strict-Two phase locking system is almost similar to 2PL. The only difference is that
Strict-2PL never releases a lock after using it. It holds all the locks until the commit
point and releases all the locks at one go when the process is over.

Centralized 2PL
In Centralized 2 PL, a single site is responsible for lock management process. It has
only one lock manager for the entire DBMS.

Primary copy 2PL


Primary copy 2PL mechanism, many lock managers are distributed to different sites.
After that, a particular lock manager is responsible for managing the lock for a set of
data items. When the primary copy has been updated, the change is propagated to
the slaves.

Distributed 2PL
In this kind of two-phase locking mechanism, Lock managers are distributed to all
sites. They are responsible for managing locks for data at that site. If no data is
replicated, it is equivalent to primary copy 2PL. Communication costs of Distributed
2PL are quite higher than primary copy 2PL

TIMESTAMP-BASED PROTOCOLS
Timestamp based Protocol in DBMS is an algorithm which uses the System Time or
Logical Counter as a timestamp to serialize the execution of concurrent transactions.
The Timestamp-based protocol ensures that every conflicting read and write
operations are executed in a timestamp order.
The older transaction is always given priority in this method. It uses system time to
determine the time stamp of the transaction. This is the most commonly used
concurrency protocol.

Lock-based protocols help you to manage the order between the conflicting
transactions when they will execute. Timestamp-based protocols manage conflicts as
soon as an operation is created.

UNIT - II 21
[ADVANCED DATABASE AND MINING]

Example:

Suppose there are there transactions T1, T2, and T3.


T1 has entered the system at time 0010
T2 has entered the system at 0020
T3 has entered the system at 0030
Priority will be given to transaction T1, then transaction T2 and lastly Transaction T3.
Advantages:

 Schedules are serializable just like 2PL protocols


 No waiting for the transaction, which eliminates the possibility of deadlocks!

Disadvantages:

Starvation is possible if the same transaction is restarted and continually aborted

VALIDATION BASED PROTOCOL

Validation based Protocol in DBMS also known as Optimistic Concurrency Control


Technique is a method to avoid concurrency in transactions. In this protocol, the
local copies of the transaction data are updated rather than the data itself, which
results in less interference while execution of the transaction.
The Validation based Protocol is performed in the following three phases:

1. Read Phase
2. Validation Phase
3. Write Phase

Read Phase
In the Read Phase, the data values from the database can be read by a transaction
but the write operation or updates are only applied to the local data copies, not the
actual database.

Validation Phase
In Validation Phase, the data is checked to ensure that there is no violation of
serializability while applying the transaction updates to the database.

Write Phase

UNIT - II 22
[ADVANCED DATABASE AND MINING]

In the Write Phase, the updates are applied to the database if the validation is
successful, else; the updates are not applied, and the transaction is rolled back.

Characteristics of Good Concurrency Protocol


An ideal concurrency control DBMS mechanism has the following objectives:

 Must be resilient to site and communication failures.


 It allows the parallel execution of transactions to achieve maximum
concurrency.
 Its storage mechanisms and computational methods should be modest to
minimize overhead.
 It must enforce some constraints on the structure of atomic actions of
transactions.

In case of transaction the term ACID has been used significantly to state some of
important properties that a transaction must follow. We all know ACID stands for
Atomicity, Consistency, Isolation and Durability and these properties collectively
called as ACID Properties.

Properties of transaction

Database system ensures ACID property −


 Atomicity − Either all or none of the transac on opera on is done.
 Consistency − A transaction transfer from one consistent (correct) state
to another consistent state.
 Isolation − A transac on is isolated from other transac ons. i.e. A
transaction is not affected by another transaction. Although multiple
transactions execute concurrently it must appear as if the transaction are
running serially (one after the other).
 Durability − The results of transac ons are permanent i.e. the result will
never be lost with subsequent failure, durability refers to long lasting i.e.
permanency.

ISOLATION

It determines the visibility of transactions of other systems. A lower level allows every
user to access the same data. Therefore, it involves high risk of data privacy and
security of the system. However, a higher isolation level reduces the type of
concurrency over the data but requires more resources and is slower than lower
isolation levels.

UNIT - II 23
[ADVANCED DATABASE AND MINING]

The isolation protocols help safeguards the data from unwanted transactions. They
maintain the integrity of every data by defining how and when the changes made by
one operation are visible to others.

LEVELS OF ISOLATION

There are four levels of isolations which are explained below −


 Read Uncommitted − It is the lowest level of isola on. At this level; the
dirty reads are allowed, which means one can read the uncommitted
changes made by another.
 Read committed − It allows no dirty reads, and clearly states that any
uncommitted data is committed now it is read.
 Repeatable Read − This is the most restricted level of isola on. The
transaction holds read locks on all the rows it references and write locks
over all the rows it updates/inserts/deletes. So, there is no chance of
non-repeatable reads.
 Serializable − The highest level of civiliza on. It determines that all
concurrent transactions be executed serially.

Example

Consider an example of isolation.


What is the isolation level of transaction E?
session begins
SET GLOBAL TRANSACTION
ISOLATION LEVEL SERIALIZABLE;
session ends
session begins
SET SESSION TRANSACTION
ISOLATION LEVEL REPEATABLE READ;
transaction A
transaction B
SET TRANSACTION
ISOLATION LEVEL READ UNCOMMITTED;
transaction C
SET TRANSACTION
ISOLATION LEVEL READ COMMITTED;
UNIT - II 24
[ADVANCED DATABASE AND MINING]

transaction D
transaction E
session ends
Check which option −
A- Serializable
B- Repeatable read
C- Read uncommitted

ONLINE ANALYTICAL PROCESSING:


What is OLAP (Online Analytical Processing)?

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software


technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible
views of data that has been transformed from raw information to reflect the real
dimensionality of the enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and support


the capability for complex estimations, trend analysis, and sophisticated data
modeling. It is rapidly enhancing the essential foundation for Intelligent Solutions
containing Business Performance Management, Planning, Budgeting, Forecasting,
Financial Documenting, Analysis, Simulation-Models, Knowledge Discovery, and Data
Warehouses Reporting. OLAP enables end-clients to perform ad hoc analysis of record
in multiple dimensions, providing the insight and understanding they require for
better decision making.

Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an organization.

Finance and accounting:How to find Nth Highest Salary in SQL

o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling

Sales and Marketing

UNIT - II 25
[ADVANCED DATABASE AND MINING]

o Sales analysis and forecasting


o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation

Production

o Production planning
o Defect analysis

OLAP cubes have two main purposes. The first is to provide business users with a data
model more intuitive to them than a tabular model. This model is called a Dimensional
Model.

The second purpose is to enable fast query response that is usually difficult to achieve
using tabular models.

How OLAP Works?

Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries
that are typically very hard to execute over tabular databases, namely aggregation,
joining, and grouping. These queries are calculated during a process that is usually
called 'building' or 'processing' of the OLAP cube. This process happens overnight, and
by the time end users get to work - data will have been updated.

OLAP Guidelines (Dr.E.F.Codd Rule)

Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines
and requirements as the basis for selecting OLAP systems:

UNIT - II 26
[ADVANCED DATABASE AND MINING]

1) Multidimensional Conceptual View: This is the central features of an OLAP system.


By needing a multidimensional view, it is possible to carry out methods like slice and
dice.

2) Transparency: Make the technology, underlying information repository, computing


operations, and the dissimilar nature of source data totally transparent to users. Such
transparency helps to improve the efficiency and productivity of the users.

3) Accessibility: It provides access only to the data that is actually required to perform
the particular analysis, present a single, coherent, and consistent view to the clients.
The OLAP system must map its own logical schema to the heterogeneous physical data
stores and perform any necessary transformations. The OLAP operations should be
sitting between data sources (e.g., data warehouses) and an OLAP front-end.

4) Consistent Reporting Performance: To make sure that the users do not feel any
significant degradation in documenting performance as the number of dimensions or
the size of the database increases. That is, the performance of OLAP should not suffer
as the number of dimensions is increased. Users must observe consistent run time,
response time, or machine utilization every time a given query is run.

UNIT - II 27
[ADVANCED DATABASE AND MINING]

5) Client/Server Architecture: Make the server component of OLAP tools sufficiently


intelligent that the various clients to be attached with a minimum of effort and
integration programming. The server should be capable of mapping and consolidating
data between dissimilar databases.

6) Generic Dimensionality: An OLAP method should treat each dimension as


equivalent in both is structure and operational capabilities. Additional operational
capabilities may be allowed to selected dimensions, but such additional tasks should
be grantable to any dimension.

7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific
analytical model being created and loaded that optimizes sparse matrix handling.
When encountering the sparse matrix, the system must be easy to dynamically assume
the distribution of the information and adjust the storage and access to obtain and
maintain a consistent level of performance.

8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity,
and access security.

9) Unrestricted cross-dimensional Operations: It provides the ability for the methods


to identify dimensional order and necessarily functions roll-up and drill-down methods
within a dimension or across the dimension.

10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation


direction like as reorientation (pivoting), drill-down and roll-up, and another
manipulation to be accomplished naturally and precisely via point-and-click and drag
and drop methods on the cells of the scientific model. It avoids the use of a menu or
multiple trips to a user interface.

11) Flexible Reporting: It implements efficiency to the business clients to organize


columns, rows, and cells in a manner that facilitates simple manipulation, analysis, and
synthesis of data.

12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions
should be unlimited. Each of these common dimensions must allow a practically
unlimited number of customer-defined aggregation levels within any given
consolidation path.

DATABASE PERFORMANCE TUNING AND QUERY OPTIMIZATION:


DATABASE PERFORMANCE TUNING
It takes time to become a Database Expert or an expert Database Administrator. This
all comes with lot of experience in various database designs and good trainings.

UNIT - II 28
[ADVANCED DATABASE AND MINING]

But the following list may be helpful for the beginners to have a nice database
performance −
 Use 3BNF database design explained in this tutorial in RDBMS Concepts
chapter.
 Avoid number-to-character conversions because numbers and
characters compare differently and lead to performance downgrade.
 While using SELECT statement, only fetch whatever information is
required and avoid using * in your SELECT queries because it would load
the system unnecessarily.
 Create your indexes carefully on all the tables where you have frequent
search operations. Avoid index on the tables where you have less number
of search operations and more number of insert and update operations.
 A full-table scan occurs when the columns in the WHERE clause do not
have an index associated with them. You can avoid a full-table scan by
creating an index on columns that are used as conditions in the WHERE
clause of an SQL statement.
 Be very careful of equality operators with real numbers and date/time
values. Both of these can have small differences that are not obvious to
the eye but that make an exact match impossible, thus preventing your
queries from ever returning rows.
 Use pattern matching judiciously. LIKE COL% is a valid WHERE condition,
reducing the returned set to only those records with data starting with
the string COL. However, COL%Y does not further reduce the returned
results set since %Y cannot be effectively evaluated. The effort to do the
evaluation is too large to be considered. In this case, the COL% is used,
but the %Y is thrown away. For the same reason, a leading wildcard %COL
effectively prevents the entire filter from being used.
 Fine tune your SQL queries examining the structure of the queries (and
subqueries), the SQL syntax, to discover whether you have designed your
tables to support fast data manipulation and written the query in an
optimum manner, allowing your DBMS to manipulate the data
efficiently.
 For queries that are executed on a regular basis, try to use procedures. A
procedure is a potentially large group of SQL statements. Procedures are
compiled by the database engine and then executed. Unlike an SQL
statement, the database engine need not optimize the procedure before
it is executed.
 Avoid using the logical operator OR in a query if possible. OR inevitably
slows down nearly any query against a table of substantial size.
 You can optimize bulk data loads by dropping indexes. Imagine the
history table with many thousands of rows. That history table is also likely
to have one or more indexes. When you think of an index, you normally

UNIT - II 29
[ADVANCED DATABASE AND MINING]

think of faster table access, but in the case of batch loads, you can benefit
by dropping the index(es).
 When performing batch transactions, perform COMMIT at after a fair
number of records creation in stead of creating them after every record
creation.
 Plan to defragment the database on a regular basis, even if doing so
means developing a weekly routine.

Built-In Tuning Tools

Oracle has many tools for managing SQL statement performance but among them two
are very popular. These two tools are −
 Explain plan − tool iden fies the access path that will be taken when the
SQL statement is executed.
 tkprof − measures the performance by me elapsed during each phase
of SQL statement processing.
If you want to simply measure the elapsed time of a query in Oracle, you can use the
SQL*Plus command SET TIMING ON.
Check your RDBMS documentation for more detail on the above-mentioned tools and
defragmenting the database.
QUERY OPTIMIZATION
Query optimization is of great importance for the performance of a relational
database, especially for the execution of complex SQL statements. A query optimizer
decides the best methods for implementing each query.
The query optimizer selects, for instance, whether or not to use indexes for a given
query, and which join methods to use when joining multiple tables. These decisions
have a tremendous effect on SQL performance, and query optimization is a key
technology for every application, from operational Systems to data warehouse and
analytical systems to content management systems.

There is the various principle of Query Optimization are as follows −


 Understand how your database is executing your query − The first
phase of query optimization is understanding what the database is
performing. Different databases have different commands for this. For
example, in MySQL, one can use the “EXPLAIN [SQL Query]” keyword to
see the query plan. In Oracle, one can use the “EXPLAIN PLAN FOR [SQL
Query]” to see the query plan.
 Retrieve as little data as possible − The more informa on restored from
the query, the more resources the database is required to expand to

UNIT - II 30
[ADVANCED DATABASE AND MINING]

process and save these records. For example, if it can only require to
fetch one column from a table, do not use ‘SELECT *’.
 Store intermediate results − Some mes logic for a query can be quite
complex. It is possible to produce the desired outcomes through the use
of subqueries, inline views, and UNION-type statements. For those
methods, the transitional results are not saved in the database but are
directly used within the query. This can lead to achievement issues,
particularly when the transitional results have a huge number of rows.
There are various query optimization strategies are as follows −
 Use Index − It can be using an index is the first strategy one should use
to speed up a query.
 Aggregate Table − It can be used to pre-populating tables at higher levels
so less amount of information is required to be parsed.
 Vertical Partitioning − It can be used to par on the table by columns.
This method reduces the amount of information a SQL query required to
process.
 Horizontal Partitioning − It can be used to par on the table by data
value, most often time. This method reduces the amount of information
a SQL query required to process.
 De-normalization − The process of de-normalization combines multiple
tables into a single table. This speeds up query implementation because
fewer table joins are required.
 Server Tuning − Each server has its parameters and provides tuning
server parameters so that it can completely take benefit of the hardware
resources that can significantly speed up query implementation.

QUERY PARSING AND TRANSLATION

Initially, the SQL query is scanned. Then it is parsed to look for syntactical errors and
correctness of data types. If the query passes this step, the query is decomposed into
smaller query blocks. Each block is then translated to equivalent relational algebra
expression.
Steps for Query Optimization
Query optimization involves three steps, namely query tree generation, plan
generation, and query plan code generation.
Step 1 − Query Tree Genera on
A query tree is a tree data structure representing a relational algebra expression. The
tables of the query are represented as leaf nodes. The relational algebra operations
are represented as the internal nodes. The root represents the query as a whole.

UNIT - II 31
[ADVANCED DATABASE AND MINING]

During execution, an internal node is executed whenever its operand tables are
available. The node is then replaced by the result table. This process continues for all
internal nodes until the root node is executed and replaced by the result table.
For example, let us consider the following schemas −
EMPLOYEE

EmpID EName Salary DeptNo DateOfJoining

DEPARTMENT

DNo DName Location

Example 1
Let us consider the query as the following.
$$\pi_{EmpID} (\sigma_{EName = \small "ArunKumar"} {(EMPLOYEE)})$$
The corresponding query tree will be −

Example 2
Let us consider another query involving a join.
$\pi_{EName, Salary} (\sigma_{DName = \small "Marketing"} {(DEPARTMENT)})
\bowtie_{DNo=DeptNo}{(EMPLOYEE)}$
Following is the query tree for the above query.

UNIT - II 32
[ADVANCED DATABASE AND MINING]

Step 2 − Query Plan Genera on


After the query tree is generated, a query plan is made. A query plan is an extended
query tree that includes access paths for all operations in the query tree. Access paths
specify how the relational operations in the tree should be performed. For example, a
selection operation can have an access path that gives details about the use of B+ tree
index for selection.
Besides, a query plan also states how the intermediate tables should be passed from
one operator to the next, how temporary tables should be used and how operations
should be pipelined/combined.
Step 3− Code Genera on
Code generation is the final step in query optimization. It is the executable form of the
query, whose form depends upon the type of the underlying operating system. Once
the query code is generated, the Execution Manager runs it and produces the results.

COST OF QUERY:

In the previous section, we understood about Query processing steps and evaluation
plan. Though a system can create multiple plans for a query, the chosen method
should be the best of all. It can be done by comparing each possible plan in terms of
their estimated cost. For calculating the net estimated cost of any plan, the cost of
each operation within a plan should be determined and combined to get the net
estimated cost of the query evaluation plan.

The cost estimation of a query evaluation plan is calculated in terms of various


resources that include:
UNIT - II 33
[ADVANCED DATABASE AND MINING]

o Number of disk accesses


o Execution time taken by the CPU to execute a query
o Communication costs in distributed or parallel database systems.

To estimate the cost of a query evaluation plan, we use the number of blocks
transferred from the disk, and the number of disks seeks. Suppose the disk has an
average block access time of ts seconds and takes an average of tT seconds to transfer
x data blocks. The block access time is the sum of disk seeks time and rotational
latency. It performs S seeks than the time taken will be b*tT + S*tS seconds. If tT=0.1
ms, tS =4 ms, the block size is 4 KB, and its transfer rate is 40 MB per second. With this,
we can easily calculate the estimated cost of the given query evaluation plan.

Generally, for estimating the cost, we consider the worst case that could happen. The
users assume that initially, the data is read from the disk only. But there must be a
chance that the information is already present in the main memory. However, the
users usually ignore this effect, and due to this, the actual cost of execution comes out
less than the estimated value.

The response time, i.e., the time required to execute the plan, could be used for
estimating the cost of the query evaluation plan. But due to the following reasons, it
becomes difficult to calculate the response time without actually executing the query
evaluation plan:

o When the query begins its execution, the response time becomes dependent
on the contents stored in the buffer. But this information is difficult to retrieve
when the query is in optimized mode, or it is not available also.
o When a system with multiple disks is present, the response time depends on an
interrogation that in "what way accesses are distributed among the disks?". It
is difficult to estimate without having detailed knowledge of the data layout
present over the disk.
o Consequently, instead of minimizing the response time for any query
evaluation plan, the optimizers finds it better to reduce the total resource
consumption of the query plan. Thus to estimate the cost of a query evaluation
plan, it is good to minimize the resources used for accessing the disk or use of
the extra resources.

JOIN:
Join in DBMS is a binary operation which allows you to combine join product and
selection in one single statement. The goal of creating a join condition is that it helps

UNIT - II 34
[ADVANCED DATABASE AND MINING]

you to combine the data from two or more DBMS tables. The tables in DBMS are
associated using the primary key and foreign keys.
In this DBMS tutorial, you will learn:

Types of Join

 Inner Join
 Theta Join
 EQUI join:
 Natural Join (⋈)
 Outer Join
 Left Outer Join (A B)
 Right Outer Join (A B)
 Full Outer Join (A B)

Types of Join
There are mainly two types of joins in DBMS:

1. Inner Joins: Theta, Natural, EQUI


2. Outer Join: Left, Right, Full

Let’s see them in detail:

INNER JOIN
Inner Join is used to return rows from both tables which satisfy the given condition.
It is the most widely used join operation and can be considered as a default join-type

An Inner join or equijoin is a comparator-based join which uses equality comparisons


in the join-predicate. However, if you use other comparison operators like “>” it can’t
be called equijoin.
Inner Join further divided into three subtypes:

 Theta join
 Natural join
 EQUI join

THETA JOIN
Theta Join allows you to merge two tables based on the condition represented by
theta. Theta joins work for all comparison operators. It is denoted by symbol θ. The
general case of JOIN operation is called a Theta join.

UNIT - II 35
[ADVANCED DATABASE AND MINING]

Syntax:

A ⋈θ B
Theta join can use any conditions in the selection criteria.

Consider the following tables.

Table A Table B
column 1 column 2 column 1 column 2
1 1 1 1
1 2 1 3
For example:

A ⋈ A.column 2 > B.column 2 (B)


A ⋈ A.column 2 > B.column 2 (B)
column 1 column 2
1 2

EQUI Join
EQUI Join is done when a Theta join uses only the equivalence condition. EQUI join is
the most difficult operation to implement efficiently in an RDBMS, and one reason
why RDBMS have essential performance problems.
For example:

A ⋈ A.column 2 = B.column 2 (B)


A ⋈ A.column 2 = B.column 2 (B)
column 1 column 2
1 1

Natural Join (⋈)


Natural Join does not utilize any of the comparison operators. In this type of join, the
attributes should have the same name and domain. In Natural Join, there should be
at least one common attribute between two relations.
It performs selection forming equality on those attributes which appear in both
relations and eliminates the duplicate attributes.

UNIT - II 36
[ADVANCED DATABASE AND MINING]

Example:
Consider the following two tables

C
Num Square
2 4
3 9
D
Num Cube
2 8
3 18
C⋈D
C⋈D
Num Square Cube
2 4 8
3 9 18

OUTER JOIN
An Outer Join doesn’t require each record in the two join tables to have a matching
record. In this type of join, the table retains each record even if no other matching
record exists.

Three types of Outer Joins are:

 Left Outer Join


 Right Outer Join
 Full Outer Join

Left Outer Join (A B)


Left Outer Join returns all the rows from the table on the left even if no matching
rows have been found in the table on the right. When no matching record is found in
the table on the right, NULL is returned.

Consider the following 2 Tables

UNIT - II 37
[ADVANCED DATABASE AND MINING]

A
Num Square
2 4
3 9
4 16
B
Num Cube
2 8
3 18
5 75
A B
A⋈B
Num Square Cube
2 4 8
3 9 18
4 16 –

Right Outer Join ( A B)


Right Outer Join returns all the columns from the table on the right even if no
matching rows have been found in the table on the left. Where no matches have been
found in the table on the left, NULL is returned. RIGHT outer JOIN is the opposite of
LEFT JOIN
In our example, let’s assume that you need to get the names of members and movies
rented by them. Now we have a new member who has not rented any movie yet.

A B
A⋈B
Num Cube Square
2 8 4
3 18 9
5 75 –

Full Outer Join ( A B)


UNIT - II 38
[ADVANCED DATABASE AND MINING]

In a Full Outer Join , all tuples from both relations are included in the result,
irrespective of the matching condition.
Example:

A B
A⋈B
Num Square Cube
2 4 8
3 9 18
4 16 –
5 – 75

SELECTION AND PROJECTION IMPLEMENTATION ALGORITHMS AND OPTIMIZATION


DATABASE SECURITY:

In the previous section, we understood that estimating the cost of a query plan should
be done by measuring the total resource consumption.

In this section, we will understand how the selection operation is performed in the
query execution plan.

Generally, the selection operation is performed by the file scan. File scans are the
search algorithms that are used for locating and accessing the data. It is the lowest-
level operator used in query processing.

Let's see how selection using a file scan is performed.

Selection using File scans and Indices

In RDBMS or relational database systems, the file scan reads a relation only if the
whole relation is stored in one file only. When the selection operation is performed on
a relation whose tuples are stored in one file, it uses the following algorithms:

o Linear Search: In a linear search, the system scans each record to test whether
satisfying the given selection condition. For accessing the first block of a file, it
needs an initial seek. If the blocks in the file are not stored in contiguous order,
then it needs some extra seeks. However, linear search is the slowest algorithm
used for searching, but it is applicable in all types of cases. This algorithm does
not care about the nature of selection, availability of indices, or the file
sequence. But other algorithms are not applicable in all types of cases.

UNIT - II 39
[ADVANCED DATABASE AND MINING]

Selection Operation with Indexes

The index-based search algorithms are known as Index scans. Such index structures
are known as access paths. These paths allow locating and accessing the data in the
file. There are following algorithms that use the index in query processing:

o Primary index, equality on a key: We use the index to retrieve a single record
that satisfies the equality condition for making the selection. The equality
comparison is performed on the key attribute carrying a primary key.
o Primary index, equality on nonkey: The difference between equality on key
and nonkey is that in this, we can fetch multiple records. We can fetch multiple
records through a primary key when the selection criteria specify the equality
comparison on a nonkey.
o Secondary index, equality on key or nonkey: The selection that specifies an
equality condition can use the secondary index. Using secondary index strategy,
we can either retrieve a single record when equality is on key or multiple
records when the equality condition is on nonkey. When retrieving a single
record, the time cost is equal to the primary index. In the case of multiple
records, they may reside on different blocks. This results in one I/O operation
per fetched record, and each I/O operation requires a seek and a block transfer.

Selection Operations with Comparisons

For making any selection on the basis of a comparison in a relation, we can proceed it
either by using the linear search or via indices in the following ways:

o Primary index, comparison: When the selection condition given by the user is
a comparison, then we use a primary ordered index, such as the primary B+-tree
index. For example, when A attribute of a relation R compared with a given
value v as A>v, then we use a primary index on A to directly retrieve the tuples.
The file scan starts its search from the beginning till the end and outputs all
those tuples that satisfy the given selection condition.
o Secondary index, comparison: The secondary ordered index is used for
satisfying the selection operation that involves <, >, ≤, or ≥ In this, the files scan
searches the blocks of the lowest-level index.
(< ≤): In this case, it scans from the smallest value up to the given value v.
(>, ≥): In this case, it scans from the given value v up to the maximum value.
UNIT - II 40
[ADVANCED DATABASE AND MINING]

However, the use of the secondary index should be limited for selecting a few
records. It is because such an index provides pointers to point each record, so
users can easily fetch the record through the allocated pointers. Such retrieved
records may require an I/O operation as records may be stored on different
blocks of the file. So, if the number of fetched records is large, it becomes
expensive with the secondary index.

Implementing Complex Selection Operations

Working on more complex selection involves three selection predicates known as


Conjunction, Disjunction, and Negation.

Conjunction: A conjunctive selection is the selection having the form as:

σ θ1ꓥθ2ꓥ…ꓥθn (r)

A conjunction is the intersection of all records that satisfies the above selection
condition.

Disjunction: A disjunctive selection is the selection having the form as:

σ θ1ꓦθ2ꓦ…ꓦθn (r)

A disjunction is the union of all records that satisfies the given selection condition θi.

Negation: The result of a selection σ¬θ(r) is the set of tuples of given relation r where
the selection condition evaluates to false. But nulls are not present, and this set is only
the set of tuples in relation r that are not in σθ(r).

Using these discussed selection predicates, we can implement the selection


operations by using the following algorithms:

o Conjunctive selection using one index: In such type of selection operation


implementation, we initially determine if any access path is available for an
attribute. If found one, then algorithms based on the index will work better.
Further completion of the selection operation is done by testing that each
selected records satisfy the remaining simple conditions. The cost of the
selected algorithm provides the cost of this algorithm.
o Conjunctive selection via Composite index: A composite index is the one that
is provided on multiple attributes. Such an index may be present for some
conjunctive selections. If the given selection operation proves true on the
equality condition on two or more attributes and a composite index is present
UNIT - II 41
[ADVANCED DATABASE AND MINING]

on these combined attribute fields, then directly search the index. Such type of
index evaluates the suitable index algorithms.
o Conjunctive selection via the intersection of identifiers: This implementation
involves record pointers or record identifiers. It uses indices with the record
pointers on those fields which are involved in the individual selection condition.
It scans each index for pointers to tuples satisfying the individual condition.
Therefore, the intersection of all the retrieved pointers is the set of pointers to
the tuples that satisfies the conjunctive condition. The algorithm uses these
pointers to fetch the actual records. However, in absence of indices on each
individual condition, it tests the retrieved records for the other remaining
conditions.
o Disjunctive selection by the union of identifiers: This algorithm scans those
entire indexes for pointers to tuples that satisfy the individual condition. But
only if access paths are available on all disjunctive selection conditions.
Therefore, the union of all fetched records provides pointers sets to all those
tuples which satisfy or prove the disjunctive condition. Further, it makes use of
pointers for fetching the actual records. Somehow, if the access path is not
present for anyone condition, we need to use a linear search to find those
tuples that satisfy the condition. Thus, it is good to use a linear search for
determining such tests.

Cost Estimation

Here, the overall cost of the algorithm is composed by adding the cost of individual
index scans and cost of fetching the records in the intersection of the retrieved lists of
pointers. We can minimize the cost by sorting the list of pointers and fetching the
sorted records. So, we found the following two points for cost estimation:

o We can fetch all selected records of the block using a single I/O operation
because each pointer in the block appears together.
o The disk-arm movement gets minimized as blocks are read in sorted order.

Cost Estimation Chart for various Selection algorithms

Here, br is the number of blocks in the file.

hi denotes the height of the index

b is the number of blocks holding records with specified search key

UNIT - II 42
[ADVANCED DATABASE AND MINING]

n is the number of fetched records

Selection Algorithms Cost Why So?

Linear Search ts + br * tT It needs one initial seek with br block transfers.

Linear Search, Equality ts + (br/2) It is the average case where it needs only one
on Key * tT record satisfying the condition. So as soon as it is
found, the scan terminates.

Primary B+-tree index, (hi +1) * Each I/O operation needs one seek and one block
Equality on Key (tr + ts) transfer to fetch the record by traversing the
height of the tree.

Primary B+-tree index, hi * (tT + It needs one seek for each level of the tree, and
Equality on a Nonkey ts) + b * tT one seek for the first block.

Secondary B+-tree (hi + 1) * Each I/O operation needs one seek and one block
index, Equality on Key (tr + ts) transfer to fetch the record by traversing the
height of the tree.

Secondary B+-tree (hi + n) * It requires one seek per record because each
index, Equality on (tr + ts) record may be on a different block.
Nonkey

Primary B+-tree index, hi * (tr + ts) It needs one seek for each level of the tree, and
Comparison + b * tT one seek for the first block.

Secondary B+-tree (hi + n) * It requires one seek per record because each
index, Comparison (tr + ts) record may be on a different block.

PROJECTION IMPLEMENTATION ALGORITHMS


Query is a question or requesting information. Query language is a language which is
used to retrieve information from a database.
Query language is divided into two types −
 Procedural language
 Non-procedural language

UNIT - II 43
[ADVANCED DATABASE AND MINING]

Procedural language

Information is retrieved from the database by specifying the sequence of operations


to be performed.
For Example − Rela onal algorithms
Structure Query language (SQL) is based on relational algorithms.
Relational algorithms consists of a set of operations that take one or two relations as
an input and produces a new relation as output.

Types of Relational algorithms operations

The different types of relational algorithms operations are as follows −


 Select operation
 Project operation
 Rename operation
 Union operation
 Intersection operation
 Difference operation
 Cartesian product operation
 Join operation
 Division operation
Select, project, rename comes under unary operation (operate on one table).

Projection operation

It displays the specific column of a table. It is denoted by pie (∏). It is a vertical subset
of the original relation. It eliminates duplicate tuples.
Syntax
The syntax is as follows −
∏regno(student)

Example

Consider the student table:

Regno Branch Section

1 CSE A

2 ECE B

UNIT - II 44
[ADVANCED DATABASE AND MINING]

Regno Branch Section

3 CIVIL B

4 IT A
To display regno column of student table, we can use the following command −
∏regno(student)
Output
RegNo

4
To display branch, section column of student table, use the following command −
∏branch,section(student)
The result is as follows −

Branch Section

CSE A

ECE B

CIVIL B

IT A
To display regno, section of ECE students, use the following command −
∏regno,section(σbranch=ECE(student))
Output
Regno Section

2 B

OPTIMIZATION DATABASE SECURITY:

UNIT - II 45
[ADVANCED DATABASE AND MINING]

Security of databases refers to the array of controls, tools, and procedures designed
to ensure and safeguard confidentiality, integrity, and accessibility. This tutorial will
concentrate on confidentiality because it's a component that is most at risk in data
security breaches.

Security for databases must cover and safeguard the following aspects:

o The database containing data.


o Database management systems (DBMS)
o Any applications that are associated with it.
o Physical database servers or the database server virtual, and the hardware that
runs it.
o The infrastructure for computing or network that is used to connect to the
database.

Security of databases is a complicated and challenging task that requires all aspects of
security practices and technologies. This is inherently at odds with the accessibility of
databases. The more usable and accessible the database is, the more susceptible we
are to threats from security. The more vulnerable it is to attacks and threats, the more
difficult it is to access and utilize.

Why Database Security is Important?

According to the definition, a data breach refers to a breach of data integrity in


databases. The amount of damage an incident like a data breach can cause our
business is contingent on various consequences or elements.

o Intellectual property that is compromised: Our intellectual property--trade


secrets, inventions, or proprietary methods -- could be vital for our ability to
maintain an advantage in our industry. If our intellectual property has been
stolen or disclosed and our competitive advantage is lost, it could be difficult to
keep or recover.
o The damage to our brand's reputation: Customers or partners may not want
to purchase goods or services from us (or deal with our business) If they do not
feel they can trust our company to protect their data or their own.
o The concept of business continuity (or lack of it): Some businesses cannot
continue to function until a breach has been resolved.
o Penalties or fines to be paid for not complying: The cost of not complying with
international regulations like the Sarbanes-Oxley Act (SAO) or Payment Card

UNIT - II 46
[ADVANCED DATABASE AND MINING]

Industry Data Security Standard (PCI DSS) specific to industry regulations on


data privacy, like HIPAA or regional privacy laws like the European Union's
General Data Protection Regulation (GDPR) could be a major problem with fines
in worst cases in excess of many million dollars for each violation.
o Costs for repairing breaches and notifying consumers about them: Alongside
notifying customers of a breach, the company that has been breached is
required to cover the investigation and forensic services such as crisis
management, triage repairs to the affected systems, and much more.

Common Threats and Challenges

Numerous software configurations that are not correct, weaknesses, or patterns of


carelessness or abuse can lead to a breach of security. Here are some of the most
prevalent kinds of reasons for security attacks and the reasons.

Insider Dangers

An insider threat can be an attack on security from any three sources having an access
privilege to the database.

o A malicious insider who wants to cause harm


o An insider who is negligent and makes mistakes that expose the database to
attack. vulnerable to attacks
o An infiltrator is an outsider who acquires credentials by using a method like
phishing or accessing the database of credential information in the database
itself.

Insider dangers are among the most frequent sources of security breaches to
databases. They often occur as a consequence of the inability of employees to have
access to privileged user credentials.

Human Error

The unintentional mistakes, weak passwords or sharing passwords, and other


negligent or uninformed behaviours of users remain the root causes of almost half (49
percent) of all data security breaches.

Database Software Vulnerabilities can be Exploited

Hackers earn their money by identifying and exploiting vulnerabilities in software such
as databases management software. The major database software companies and

UNIT - II 47
[ADVANCED DATABASE AND MINING]

open-source databases management platforms release regular security patches to fix


these weaknesses. However, failing to implement the patches on time could increase
the risk of being hacked.

SQL/NoSQL Injection Attacks

A specific threat to databases is the infusing of untrue SQL as well as other non-SQL
string attacks in queries for databases delivered by web-based apps and HTTP headers.
Companies that do not follow the safe coding practices for web applications and
conduct regular vulnerability tests are susceptible to attacks using these.

Buffer Overflow is a way to Exploit Buffers

Buffer overflow happens when a program seeks to copy more data into the memory
block with a certain length than it can accommodate. The attackers may make use of
the extra data, which is stored in adjacent memory addresses, to establish a basis for
they can begin attacks.

DDoS (DoS/DDoS) Attacks

In a denial-of-service (DoS) attack in which the attacker overwhelms the targeted


server -- in this case, the database server with such a large volume of requests that the
server is unable to meet no longer legitimate requests made by actual users. In most
cases, the server is unstable or even fails to function.

Malware

Malware is software designed to exploit vulnerabilities or cause harm to databases.


Malware can be accessed via any device that connects to the databases network.

Attacks on Backups

Companies that do not protect backup data using the same rigorous controls
employed to protect databases themselves are at risk of cyberattacks on backups.

The following factors amplify the threats:

o Data volumes are growing: Data capture, storage, and processing continue to
increase exponentially in almost all organizations. Any tools or methods must
be highly flexible to meet current as well as far-off needs.
o The infrastructure is sprawling: Network environments are becoming more
complicated, especially as companies shift their workloads into multiple clouds
UNIT - II 48
[ADVANCED DATABASE AND MINING]

and hybrid cloud architectures and make the selection of deployment,


management, and administration of security solutions more difficult.
o More stringent requirements for regulatory compliance: The worldwide
regulatory compliance landscape continues to increase by complexity. This
makes the compliance of every mandate more challenging.

Best use of Database Security

As databases are almost always accessible via the network, any security risk to any
component or part of the infrastructure can threaten the database. Likewise, any
security attack that impacts a device or workstation could endanger the database.
Therefore, security for databases must go beyond the limits of the database.

In evaluating the security of databases in our workplace to determine our


organization's top priorities, look at each of these areas.

o Security for physical security: If the database servers are on-premises or the
cloud data centre, they should be placed in a secure, controlled climate. (If our
server for database is located in a cloud-based data centre, the cloud provider
will handle the security on our behalf.)
o Access to the network and administrative restrictions: The practical minimum
number of users granted access to the database and their access rights should
be restricted to the minimum level required to fulfil their tasks. Additionally,
access to the network is limited to the minimum permissions needed.
o End security of the user account or device: Be aware of who has access to the
database and when and how data is used. Monitoring tools for data can notify
you of data-related activities that are uncommon or seem to be dangerous. Any
device that connects to the network hosting the database must be physically
secured (in the sole control of the appropriate person) and be subject to
security checks throughout the day.
o Security: ALL data--including data stored in databases, as well as credential
information should be secured using the highest-quality encryption when in
storage and while in transport. All encryption keys must be used in accordance
with the best practices guidelines.

UNIT - II 49
[ADVANCED DATABASE AND MINING]

o Security of databases using software: Always use the most current version of
our software to manage databases and apply any patches immediately after
they're released.
o Security for web server applications and websites: Any application or web
server that connects to the database could be a target and should be subjected
to periodic security testing and best practices management.
o Security of backups: All backups, images, or copies of the database should have
the identical (or equally rigorous) security procedures as the database itself.
o Auditing: Audits of security standards for databases should be conducted every
few months. Record all the logins on the server as well as the operating system.
Also, record any operations that are made on sensitive data, too.

Data protection tools and platforms

Today, a variety of companies provide data protection platforms and tools. A


comprehensive solution should have all of the following features:

o Discovery: The ability to discover is often needed to meet regulatory


compliance requirements. Look for a tool that can detect and categorize
weaknesses across our databases, whether they're hosted in the cloud or on-
premises. It will also provide recommendations to address any vulnerabilities
that are discovered.
o Monitoring of Data Activity: The solution should be capable of monitoring and
analysing the entire data activity in all databases, whether our application is on-
premises, in the cloud, or inside a container. It will alert us to suspicious activity
in real-time to allow us to respond more quickly to threats. It also provides
visibility into the state of our information through an integrated and
comprehensive user interface. It is also important to choose a system that
enforces rules that govern policies, procedures, and the separation of duties.
Be sure that the solution we select is able to generate the reports we need to
comply with the regulations.
o The ability to Tokenize and Encrypt Data: In case of an incident, encryption is
an additional line of protection against any compromise. Any software we
choose to use must have the flexibility to protect data cloud, on-premises
hybrid, or multi-cloud environments. Find a tool with volume, file, and
application encryption features that meet our company's regulations for

UNIT - II 50
[ADVANCED DATABASE AND MINING]

compliance. This could require tokenization (data concealing) or advanced key


management of security keys.
o Optimization of Data Security and Risk Analysis: An application that will
provide contextual insights through the combination of security data with
advanced analytics will allow users to perform optimizing, risk assessment, and
reporting in a breeze. Select a tool that is able to keep and combine large
amounts of recent and historical data about the security and state of your
databases. Also, choose a solution that provides data exploration, auditing, and
reporting capabilities via an extensive but user-friendly self-service dashboard.

ACCESS CONTROL

Database access control is a method of allowing access to company’s sensitive data


only to those people (database users) who are allowed to access such data and to
restrict access to unauthorized persons. It includes two main components:
authentication and authorization.

Authentication is a method of verifying the identity of a person who is accessing your


database. Note that authentication isn’t enough to protect data. An additional layer
of security is required, authorization, which determines whether a user should be
allowed to access the data or make the transaction he’s attempting. Without
authentication and authorization, there is no data security.
Any company whose employees connect to the Internet, thus, every company today,
needs some level of access control implemented.
Types of Access Control
Obsolete access models include Discretionary Access Control (DAC) and Mandatory
Access Control (MAC). Role Based Access Control (RBAC) is the most common
method today, and the most recent model is Attribute Based Access Control (ABAC).

UNIT - II 51
[ADVANCED DATABASE AND MINING]

Discretionary Access Control (DAC)


With DAC models, the data owner allows access. DAC is a means of assigning access
rights based on user-specified rules.
Mandatory Access Control (MAC)
MAC was developed using a nondiscretionary model, in which people are granted
access based on an information clearance. MAC is a policy in which access rights are
assigned based on central authority regulations.
Role Based Access Control (RBAC)
RBAC grants access based on a user’s role and implements key security principles
such as “least privilege” and “separation of privilege.” Thus, someone attempting to
access information can only access data necessary for their role.
Attribute Based Access Control (ABAC)
In ABAC, each resource and user are assigned a series of attributes. In this dynamic
method, a comparative assessment of the user’s attributes, including time of day,
position and location, are used to make a decision on access to a resource.

MAC, RBAC, Authorization, SQL Injection Attacks


Identity and Access Management is an extremely vital part of information security. An
access control model is a framework which helps to manage the identity and the
access management in the organization. There are 5 main types of access control
models: discretionary, rule-based, role-based, attribute-based and mandatory access
control model. Every model uses different methods to control how subjects access
objects. While one may focus on rules, the other focus on roles of the subject. As a
security professional, we must know all about these different access control models.
While one company may choose to implement one of these models depending on their
culture, there is no rule book which says that you cannot implement multiple models
in your organization.

These models are built into the core or the kernel of the different operating systems
and possibly their supporting applications. Every operating system has a security
kernel that enforces a reference monitor concept, which differs depending on the type
of access control model embedded into the system. For every access attempt, before
a subject can communicate with an object, the security kernel reviews the rules of the
access control model to determine whether the request is allowed.

So let’s understand what do these models have to say about themselves:

1. Discretionary Access Control Model

If you have used any platform such as Windows, Mac or Linux, you can easily
understand and appreciate this model. If you create a folder in any of these, you can

UNIT - II 52
[ADVANCED DATABASE AND MINING]

easily add/delete/modify the permissions which you want to give to different subjects.
Sounds confusing? Well, it isn’t. Let’s take an example to understand this.

I have created a folder named “SSCP Video Course”. Now since I’m the owner, it is my
discretion to assign various permissions for users. I can go to the”Security” Tab and
“Edit” permissions and define what users need to be given “Full control” or which users
can only be given “Read” Access.
A system that uses discretionary access control (DAC) enables the owner of the
resource to specify which subjects can access specific resources. This model is called
discretionary because the control of access is based on the discretion of the owner.

There is another term which is used quite often with reference to the models. It is the
Access Control List. An ACL for a file would list all the users and/or groups that are
authorized access to the file and the specific access granted to each.

While all seems good in the world of DAC, there are some issues with this model. While
this model offers the best flexibility amongst any of the model, it is also its weakest
point. For example, if a user opens an attachment that is infected with a virus, the
code can install itself in the background without the user being aware of this activity.
This code basically inherits all the rights and permissions that the user has and can
carry out all the activities a user can perform on the system. It can send copies of itself
out to all the contacts listed in the user’s e-mail client, install a back door, attack other
systems, delete files on the hard drive, and more. The user is actually giving rights to
the virus to carry out its dirty deeds, because the user has very powerful discretionary
rights and is considered the owner of many objects on the system. And the fact that
many users are assigned local administrator or root accounts means that once
malware is installed, it can do anything on a system.

2. Mandatory Access Control (MAC) Model

Do not confuse this with Apple MAC, this model is not even remotely related to it. This
model is the complete opposite of the DAC model. In a mandatory access control
(MAC) model, users do not have the discretion of determining who can access objects
as in a DAC model. An operating system that is based on a MAC model greatly reduces
the number of rights, permissions, and functionality a user has for security purposes.

You would have surely seen movies where Ethan Hunt or Jason Bourne try to access
top secret or confidential files which they do not have access too. Well, the MAC model
uses security labels to help implement it. Security labels are attached to all objects;
thus, every file, directory, and the device has its own security label with its
classification information. Now Jason Bourne may have a security clearance of secret,
but in order to find his identity, his requests may have a security label with the
classification of top secret. In this case, he will be denied because his clearance is not

UNIT - II 53
[ADVANCED DATABASE AND MINING]

equivalent or does not dominate (is not equal to or higher than) the classification of
the object.

This type of model is used in environments where information classification and


confidentiality is of utmost importance, such as military institutions, government
agencies, and government contract companies.

The problem with DAC was that the malware could inherit all permissions which the
user had and could install itself on the system. However, in the MAC systems, this isn’t
the case. Since users that work within a MAC system cannot install software, the
operating system does not allow any type of software, including malware, to be
installed while the user is logged in. But while MAC systems might seem an answer to
all our security prayers, they have very limited user functionality, require a lot of
administrative overhead, are very expensive, and are not user-friendly. DAC systems
are general-purpose computers, while MAC systems serve a very specific purpose.

3. Role Based Access Control (RBAC) Model

If you work in a large organization, you can easily understand that the permissions if
awarded user wise, could become a herculean task. Why not adopt a simple way then?
The RBAC does exactly that. RBAC simplifies the process by allowing the security
administrators to provide the same level of access permissions to a particular group.
If you have 3 people who have joined your organization, one in the R&D, another in
the HR department and the third one in the sales role. As a security administrator,
using RBAC, all access permissions will be awarded to them in a jiffy. How? Just by
adding each one of them into their respective groups.

The moment the person is added to the HR group, the RBAC model, extends all
privileges in this group to the new person who has been added.

In a similar manner, if a person leaves the role, the security administrator just needs
to remove this person from the group and all privileges would be revoked
immediately.

4. Rule-Based Access Control Model

This model is simply based on the premise that “If X Then Y”. Rules are pre-defined
and those are in order to provide access permissions. Forex- Firewalls include a set of
rules or filters within an ACL,
defined by an administrator. The firewall examines all the traffic going through it and
only allows traffic that meets one of the rules. Firewalls include a final rule (referred
to as the implicit deny rule) denying all other traffic. For example, the last rule might

UNIT - II 54
[ADVANCED DATABASE AND MINING]

be “deny all” to indicate the firewall should block all traffic in or out of the network
that wasn’t previously allowed by another rule.

If this sounds too complex, then consider the following statement. If the role id of the
person is XXX such as RND for research development, you can give him default access
to the top 3 basic folders. The system would follow this simple rule and access would
be provided only to such people.

Rule-based access allows a developer to define specific and detailed situations in


which a subject can or cannot access an object, and what that subject can do once
access is granted. Traditionally, Rule-based access control has been used in MAC
systems as an enforcement mechanism for the complex rules of access that MAC
systems provide.

While the rule-based model is commonly used in firewalls and routers to implement
access control permissions, a basic drawback here is that it is very complex and
unproductive to provide the further granularity of access. In simple terms, Rule-based
models enforce rules on all the users and if you start devising rules for granularity, the
system loses its ease of usefulness.

5. Attribute-Based Access Control (ABAC) Model


Consider this example – A person who belongs to the R&D department, working on
the secret project of Genetic Mutation and having the role of super senior specialist
and above must only be able to access the website “GeneticResearch.com” only from
office premises and from timings 0900 hours to 1700 hours.

Well, this sounds quite simple to implement, isn’t it? If you configure this through the
Rule Based Access Control, it would become quite complex.

To solve such complex access issues, we have the Attribute Based Access Control
Model (ABAC). It is an advanced implementation of RBAC model. Attributes can be
almost any characteristic of users, the network, and
devices on the network. It is this attribute which is exploited and used to implement
the ABAC model.

How would ABAC solve the above problem? Well, you just need to enter the attributes
of the person and then configure the rule. The attributes which can be extracted from
the problem are R&D, Genetic Mutation, Super Senior Specialist, Office Network &
Timings. If a person has these attributes which can be added or removed at any point
in time, he can access the website and the firewall will not block it.
From an RBAC perspective, every attribute would have to be checked with a rule,
which would have consumed a lot of resources when access decisions need to be taken
in a jiffy. A user will not wait for an hour to know whether he can access or cannot

UNIT - II 55
[ADVANCED DATABASE AND MINING]

access a website. The system must evaluate the request basis the rules/attributes and
provide an answer immediately.

Hence, in a nutshell, the Rule-based access control applies to all users, but the ABAC
can be much more specific.

From an exam perspective, these models are extremely important. As a security


professional, you must have clarity on these models to help guide organizations as to
which combination of these IAM models would support their work culture best.

UNIT - II 56

You might also like