Transactions
A transaction is a program including a collection of database operations, executed
as a logical unit of data processing. The operations performed in a transaction
include one or more of database operations like insert, delete, update or retrieve
data. It is an atomic process that is either performed into completion entirely or
is not performed at all. A transaction involving only data retrieval without any
data update is called read-only transaction.
Each high level operation can be divided into a number of low level tasks or
operations. For example, a data update operation can be divided into three tasks
−
• read_item() − reads data item from storage to main memory.
• modify_item() − change value of item in the main memory.
• write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations.
Likewise, for all transactions, read and write forms the basic database operations.
Concurrency problems in DBMS Transactions
When multiple transactions execute concurrently in an uncontrolled or
unrestricted manner, then it might lead to several problems. These problems are
commonly referred to as concurrency problems in database environment. The five
concurrency problems that can occur in database are:
(1). Temporary Update Problem
(2). Incorrect Summary Problem
(3). Lost Update Problem
(4). Unrepeatable Read Problem
(5). Phantom Read Problem
These are explained as following below.
1. Temporary Update Problem:
Temporary update or dirty read problem occurs when one transaction
updates an item and fails. But the updated item is used by another
transaction before the item is changed or reverted back to its last
value.
Example:
T1 x=500,N=200 X=300, T2 X=300
,M=100,X=400
In the above example, if transaction 1 fails for some reason then X will
revert back to its previous value. But transaction 2 has already read
the incorrect value of X.
2. Incorrect Summary Problem:
Consider a situation, where one transaction is applying the aggregate
function on some records while another transaction is updating these
records. The aggregate function may calculate some values before
the values have been updated and others after they are updated.
Example:
In the above example, transaction 2 is calculating the sum of some
records while transaction 1 is updating them. Therefore the aggregate
function may calculate some values before they have been updated
and others after they have been updated.
3. Lost Update Problem:
In the lost update problem, update done to a data item by a
transaction is lost as it is overwritten by the update done by another
transaction.
Example:
In the above example, transaction 1 changes the value of X but it gets
overwritten by the update done by transaction 2 on X. Therefore, the
update done by transaction 1 is lost.
4. Unrepeatable Read Problem:
The unrepeatable problem occurs when two or more read operations
of the same transaction read different values of the same variable.
Example:
In the above example, once transaction 2 reads the variable X, a write
operation in transaction 1 changes the value of the variable X. Thus,
when another read operation is performed by transaction 2, it reads
the new value of X which was updated by transaction 1.
5. Phantom Read Problem:
The phantom read problem occurs when a transaction reads a
variable once but when it tries to read that same variable again, an
error occurs saying that the variable does not exist.
Example:
In the above example, once transaction 2 reads the variable X,
transaction 1 deletes the variable X without transaction 2’s knowledge.
Thus, when transaction 2 tries to read X, it is not able to it.
Transaction Operations
The low level operations performed in a transaction are −
• begin_transaction − A marker that specifies start of transaction
execution.
• read_item or write_item − Database operations that may be interleaved
with main memory operations as a part of transaction.
• end_transaction − A marker that specifies end of transaction.
• commit − A signal to specify that the transaction has been successfully
completed in its entirety and will not be undone.
• rollback − A signal to specify that the transaction has been unsuccessful
and so all temporary changes in the database are undone. A committed
transaction cannot be rolled back.
Database systems, like any other computer system, are subject to failures
but the data stored in it must be available as and when required. When a
database fails it must possess the facilities for fast recovery. It must also have
atomicity i.e. either transactions are completed successfully and committed
(the effect is recorded permanently in the database) or the transaction should
have no effect on the database.
There are both automatic and non-automatic ways for both, backing up of data
and recovery from any failure situations. The techniques used to recover the
lost data due to system crash, transaction errors, viruses, catastrophic failure,
incorrect commands execution etc. are database recovery techniques. So to
prevent data loss recovery techniques based on deferred update and
immediate update or backing up data can be used.
Recovery techniques are heavily dependent upon the existence of a special file
known as a system log. It contains information about the start and end of
each transaction and any updates which occur in the transaction. The log
keeps track of all transaction operations that affect the values of database
items. This information is needed to recover from transaction failure.
• The log is kept on disk start_transaction(T): This log entry records
that transaction T starts the execution.
• read_item(T, X): This log entry records that transaction T reads the
value of database item X.
• write_item(T, X, old_value, new_value): This log entry records that
transaction T changes the value of the database item X from
old_value to new_value. The old value is sometimes known as a
before an image of X, and the new value is known as an afterimage of
X.
• commit(T): This log entry records that transaction T has completed
all accesses to the database successfully and its effect can be
committed (recorded permanently) to the database.
• abort(T): This records that transaction T has been aborted.
• checkpoint: Checkpoint is a mechanism where all the previous logs
are removed from the system and stored permanently in a storage
disk. Checkpoint declares a point before which the DBMS was in
consistent state, and all the transactions were committed.
A transaction T reaches its commit point when all its operations that access
the database have been executed successfully i.e. the transaction has reached
the point at which it will not abort (terminate without completing). Once
committed, the transaction is permanently recorded in the database.
Commitment always involves writing a commit entry to the log and writing the
log to disk. At the time of a system crash, item is searched back in the log for
all transactions T that have written a start_transaction(T) entry into the log but
have not written a commit(T) entry yet; these transactions may have to be
rolled back to undo their effect on the database during the recovery process
• Undoing – If a transaction crashes, then the recovery manager may
undo transactions i.e. reverse the operations of a transaction. This
involves examining a transaction for the log entry write_item(T, x,
old_value, new_value) and setting the value of item x in the database
to old-value.There are two major techniques for recovery from non-
catastrophic transaction failures: deferred updates and immediate
updates.
• Deferred update – This technique does not physically update the
database on disk until a transaction has reached its commit point.
Before reaching commit, all transaction updates are recorded in the
local transaction workspace. If a transaction fails before reaching its
commit point, it will not have changed the database in any way so
UNDO is not needed. It may be necessary to REDO the effect of the
operations that are recorded in the local transaction workspace,
because their effect may not yet have been written in the database.
Hence, a deferred update is also known as the No-undo/redo
algorithm
• Immediate update – In the immediate update, the database may
be updated by some operations of a transaction before the
transaction reaches its commit point. However, these operations are
recorded in a log on disk before they are applied to the database,
making recovery still possible. If a transaction fails to reach its
commit point, the effect of its operation must be undone i.e. the
transaction must be rolled back hence we require both undo and
redo. This technique is known as undo/redo algorithm.
• Caching/Buffering – In this one or more disk pages that include
data items to be updated are cached into main memory buffers and
then updated in memory before being written back to disk. A
collection of in-memory buffers called the DBMS cache is kept under
control of DBMS for holding these buffers. A directory is used to keep
track of which database items are in the buffer. A dirty bit is
associated with each buffer, which is 0 if the buffer is not modified
else 1 if modified.
• Shadow paging – It provides atomicity and durability. A directory
with n entries is constructed, where the ith entry points to the ith
database page on the link. When a transaction began executing the
current directory is copied into a shadow directory. When a page is to
be modified, a shadow page is allocated in which changes are made
and when it is ready to become durable, all pages that refer to
original are updated to refer new replacement page.
Some of the backup techniques are as follows :
• Full database backup – In this full database including data and
database, Meta information needed to restore the whole database,
including full-text catalogs are backed up in a predefined time series.
• Differential backup – It stores only the data changes that have
occurred since last full database backup. When same data has
changed many times since last full database backup, a differential
backup stores the most recent version of changed data. For this first,
we need to restore a full database backup.
• Transaction log backup – In this, all events that have occurred in
the database, like a record of every single statement executed is
backed up. It is the backup of transaction log entries and contains all
transaction that had happened to the database. Through this, the
database can be recovered to a specific point in time. It is even
possible to perform a backup from a transaction log if the data files
are destroyed and not even a single committed transaction is lost.
Transaction States
A transaction may go through a subset of five states, active, partially committed,
committed, failed and aborted.
• Active − The initial state where the transaction enters is the active state.
The transaction remains in this state while it is executing read, write or
other operations.
• Partially Committed − The transaction enters this state after the last
statement of the transaction has been executed.
• Committed − The transaction enters this state after successful completion
of the transaction and system checks have issued commit signal.
• Failed − The transaction goes from partially committed state or active
state to failed state when it is discovered that normal execution can no
longer proceed or system checks fail.
• Aborted − This is the state after the transaction has been rolled back after
failure and the database has been restored to its state that was before the
transaction began.
The following state transition diagram depicts the states in the transaction and
the low level transaction operations that causes change in states.
Desirable Properties of Transactions
Any transaction must maintain the ACID properties, viz. Atomicity, Consistency,
Isolation, and Durability.
• Atomicity − This property states that a transaction is an atomic unit of
processing, that is, either it is performed in its entirety or not performed
at all. No partial update should exist.
• Consistency − A transaction should take the database from one consistent
state to another consistent state. It should not adversely affect any data
item in the database.
• Isolation − A transaction should be executed as if it is the only one in the
system. There should not be any interference from the other concurrent
transactions that are simultaneously running.
• Durability − If a committed transaction brings about a change, that
change should be durable in the database and not lost in case of any
failure.
Types of Schedules
There are two types of schedules −
• Serial Schedules − In a serial schedule, at any point of time, only one
transaction is active, i.e. there is no overlapping of transactions. This is
depicted in the following graph −
• Parallel Schedules − In parallel schedules, more than one transactions
are active simultaneously, i.e. the transactions contain operations that
overlap at time. This is depicted in the following graph −
Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two
active transactions perform non-compatible operations. Two operations are said
to be in conflict, when all of the following three conditions exists simultaneously
−
• The two operations are parts of different transactions.
• Both the operations access the same data item.
• At least one of the operations is a write_item() operation, i.e. it tries to
modify the data item.
Serializability
A serializable schedule of ‘n’ transactions is a parallel schedule which is
equivalent to a serial schedule comprising of the same ‘n’ transactions. A
serializable schedule contains the correctness of serial schedule while
ascertaining better CPU utilization of parallel schedule.
Equivalence of Schedules
Equivalence of two schedules can be of the following types −
• Result equivalence − Two schedules producing identical results are said
to be result equivalent.
• View equivalence − Two schedules that perform similar action in a similar
manner are said to be view equivalent.
• Conflict equivalence − Two schedules are said to be conflict equivalent
if both contain the same set of transactions and has the same order of
conflicting pairs of operations.
Concurrency controlling techniques ensure that multiple transactions are
executed simultaneously while maintaining the ACID properties of the
transactions and serializability in the schedules.
View Serializability in DBMS
Types of Schedules in DBMS
View serializability is a concept that is used to compute whether schedules are
View-Serializable or not. A schedule is said to be View-Serializable if it is
view equivalent to a Serial Schedule (where no interleaving of transactions is
possible).
Why do we need to use View-Serializability ?
There may be some schedules that are not Conflict-Serializable but still gives a
consistent result because the concept of Conflict-Serializability becomes limited
when the Precedence Graph of a schedule contains a loop/cycle. In such a
case we cannot predict whether a schedule would be consistent or inconsistent.
As per the concept of Conflict-Serializability, We can say that a schedule is
Conflict-Serializable (means serial and consistent) iff its corresponding
precedence graph does not have any loop/cycle.
But, what if a schedule’s precedence graph contains a cycle/loop and is giving
consistent result/accurate result as a conflict serializable schedule is giving?
So, to address such cases we brought the concept of View-
Serializability because we did not want to confine the concept serializability
only to Conflict-Serializability.
Example: The given schedule S1: r1(X); r2(Z); r1(Z); r3(X); r3(Y); w1(X); w3(Y); r2(Y);
w2(Z); w2(Y);
Step 1 -
For a better understanding purpose represent the given schedule S1 in the transaction
table format as follows:
T1 T2 T3
R(X)
R(Z)
R(Z)
R(X)
R(Y)
W(X)
W(Y)
R(Y)
W(Z)
W(Y)
Step 2 -
List all the conflicting operations and determine the dependency between the
transactions.
Therefore,
Conflicting Operations Dependencies between the Transactions
R1(Z), W2(Z) T1 → T2
R3(X), W1(X) T3 → T1
R3(Y), W2(Y) T3 → T2
W3(Y), W2(Y) T3 → T2
Step 3 -
Draw the Precedence Graph for schedule S1 based on the above transaction
dependencies.
Therefore,
Step 4 -
Check whether any cycle is present in the graph or not.
• Clearly, there exists no cycle in the precedence graph.
• Therefore, the given schedule S1 is Conflict Serializable.
• Based on the View Serializability Rule 1 - All Conflict Serializable schedules are View
Serializable.
• Therefore, the given schedule S1 is also a View Serializable.
Schedule S1 is Conflict as well as View Serializable.
Step 5 -
Now, schedule S1 is conflict/view serializable, therefore find out the equivalent serial
schedule.
• All the possible topological orderings of the above precedence graph will be the
possible equivalent serialized schedules.
• Therefore, the topological orderings can be found by performing the Topological
Sort of the above precedence graph as follows:
After performing the topological sort,
The possible Equivalent Serialized Schedule for S1 got is T3 → T1 → T2
Example : Understanding View-Serializability first with a Schedule S1 :
T1 T2 T3
a=100
read(a)
a=a-40
write(a) //60
a=a-40
write(a) //20
a=a-20
write(a) //0
So, its Conflict Precedence Graph is as follows –
The above graph contains cycle/loop which means it is not conflict-serializable
but it does not mean that it cannot be consistent and equivalent to the serial
schedule it may or may not be.
LookSchedule S’1 :
In the above example if we do swapping among some transaction’s operation
so our table will look like –
T1 T2 T3
a=100
read(a) //100
a=a-40
write(a) //60
a=a-40
write(a) //20
a=a-20
write(a) //0
Its Precedence Graph is as follows –
Now, we see that the precedence graph of the second table does not contain
any cycle/loop, which means it is conflict serializable (equivalent to serial
schedule, consistent) and the final result is coming the same as the first table.
Note : In the above example we understood that if a schedule is Conflict-
serializable so we can easily predict that It would be –
1. Equivalent to a serial schedule,
2. Consistent,
3. And also a View-Serializable.
But what if it is non-conflict serializable (precedence graph contains loop). In
this situation, we cannot predict whether it is consistent and serializable or not.
As we look in the above example, where the precedence graph of
the Schedule S1 was giving consistent result, equivalent to the serializable
result of the Schedule S’1, despite containing cycles/loops. So, to address the
limitation of the Conflict-Serializability concept View-Serializability method
came into the picture.
Methods to check View-Serializability of a schedule –
Method-1 :
Two schedules S1 and S2 are said to be view-equivalent if the following
conditions are agreed –
1) Initial Read
If a transaction T1 reading data item A from database in S1 then in S2 also T1
should read A from database.
T1 T2 T3
-------------------
R(A)
W(A)
R(A)
R(B)
Transaction T2 is reading A from database.
2)Updated Read
If Ti is reading A which is updated by Tj in S1 then in S2 also Ti should read A
which is updated by Tj.
T1 T2 T3 T1 T2 T3
------------------- ----------------
W(A) W(A)
W(A) R(A)
R(A) W(A)
Above two schedule are not view-equivalent as in S1 :T3 is reading A updated
by T2, in S2 T3 is reading A updated by T1.
3)Final Write operation
If a transaction T1 updated A at last in S1, then in S2 also T1 should perform
final write operations.
T1 T2 T1 T2
------------ ---------------
R(A) R(A)
W(A) W(A)
W(A) W(A)
Above two schedules are not view-equivalent as Final write operation in S1 is
done by T1 while in S2 done by T2.
Method-2 :
First of all, check whether the given schedule is Non-Conflict Serializable or
Conflict-Serializable –
• If the given schedule is conflict serializable (means its precedence
graph does not contain any loop/cycle), then the given schedule must
be a view serializable. Stop and submit your final answer.
• If the given schedule is non-conflict serializable, then it may or may
not be view serializable. We cannot predict it just by using the
concept of conflict serializability, So we need to look at the below
cases.
After performing the above steps if you find the provided schedule is non-
conflicting you need to perform the following steps –
Blind write : Performing the Writing operation (updation), without
reading operation, such write operation is known as a blind write.
• If no blind write exists, then the schedule must be a non-View-
Serializable schedule. Stop and submit your final answer.
• If there exists any blind write, then, in that case, the schedule may or
may not be view serializable. So we need to look at the below cases.
Because, if it does not contain any blind-write, we can surely state
that the schedule would not be View-Serializable.
• If the above two conditions do not work {means we have tried the
above 2 conditions, then we have come to this step}. Then, draw a
precedence graph using those dependencies. If no cycle/loop exists in
the graph, then the schedule would be a View-Serializable otherwise
not.
Problem : Prove whether the given schedule is View-Serializable or not?
S' : read1(A), write2(A), read3(A), write1(A), write3(A)
Solution : First of all we’ll make a table for better understanding of given
transactions of schedule S’-
T1 T2 T3
read(a)
write(a)
read(a)
write(a)
write(a)
• First we check whether it is Conflict-Serializable or not, because if it is
Conflict-Serializable so it will also be View-Serializable, so we will
make a precedence graph for the schedule S’.
• Here we will check whether the Schedule s’ contains any blind-write.
We found that the schedule s’ contains a blind-write write2(a) in
transaction T2. Hence the schedule S’ may or may not be View-
Serializable. So we will look at another method. Because, if it does
not contain any Blind-write, we can surely state that the schedule
would not be View-Serializable.
• Now, we will draw a dependency-graph which is different from
precedence-graph.
Its Dependency graph will be followed as –
• Transaction T1 first reads data_item “a” and transaction T2 first
updates(write) “a”.
• So, the transaction T1 must execute before T2.
• In that way, we get the dependency (T1 → T2) in the graph.
• And, final update(write) on “a” is made by the transaction T3.
• So, transaction T3 must execute after all the other transactions(T1,
T2).
• Thus, we get the dependency (T1, T2) → T3 in the graph.
As there is no cycle/loop in the dependency graph, the schedule s’ is
View-Serializable.
Conflict Serializability in DBMS
As discussed in Concurrency control, serial schedules have less resource utilization
and low throughput. To improve it, two or more transactions are run concurrently.
But concurrency of transactions may lead to inconsistency in database. To avoid
this, we need to check whether these concurrent schedules are serializable or not.
Conflict Serializable: A schedule is called conflict serializable if it can be
transformed into a serial schedule by swapping non-conflicting operations.
Conflicting operations: Two operations are said to be conflicting if all
conditions satisfy:
• They belong to different transactions
• They operate on the same data item
• At Least one of them is a write operation
Example: –
• Conflicting operations pair (R1(A), W2(A)) because they belong to two
different transactions on same data item A and one of them is write
operation.
• Similarly, (W1(A), W2(A)) and (W1(A), R2(A)) pairs are also conflicting.
• On the other hand, (R1(A), W2(B)) pair is non-conflicting because they
operate on different data item.
• Similarly, ((W1(A), W2(B)) pair is non-conflicting.
Consider the following schedule:
S1: R1(A), W1(A), R2(A), W2(A), R1(B), W1(B), R2(B), W2(B)
If Oi and Oj are two operations in a transaction and O i< Oj (Oi is executed before
Oj), same order will follow in the schedule as well. Using this property, we can
get two transactions of schedule S1 as:
T1: R1(A), W1(A), R1(B), W1(B)
T2: R2(A), W2(A), R2(B), W2(B)
Possible Serial Schedules are: T1->T2 or T2->T1
-> Swapping non-conflicting operations R2(A) and R1(B) in S1, the schedule
becomes,
S11: R1(A), W1(A), R1(B), W2(A), R2(A), W1(B), R2(B), W2(B)
-> Similarly, swapping non-conflicting operations W2(A) and W1(B) in S11,
the schedule becomes,
S12: R1(A), W1(A), R1(B), W1(B), R2(A), W2(A), R2(B), W2(B)
S12 is a serial schedule in which all operations of T1 are performed before
starting any operation of T2. Since S has been transformed into a serial schedule
S12 by swapping non-conflicting operations of S1, S1 is conflict serializable.
Let us take another Schedule:
S2: R2(A), W2(A), R1(A), W1(A), R1(B), W1(B), R2(B), W2(B)
Two transactions will be:
T1: R1(A), W1(A), R1(B), W1(B)
T2: R2(A), W2(A), R2(B), W2(B)
Possible Serial Schedules are: T1->T2 or T2->T1
Original Schedule is:
S2: R2(A), W2(A), R1(A), W1(A), R1(B), W1(B), R2(B), W2(B)
Swapping non-conflicting operations R1(A) and R2(B) in S2, the schedule
becomes,
S21: R2(A), W2(A), R2(B), W1(A), R1(B), W1(B), R1(A), W2(B)
Similarly, swapping non-conflicting operations W1(A) and W2(B) in S21, the
schedule becomes,
S22: R2(A), W2(A), R2(B), W2(B), R1(B), W1(B), R1(A), W1(A)
In schedule S22, all operations of T2 are performed first, but operations of T1
are not in order (order should be R 1(A), W1(A), R1(B), W1(B)). So S2 is not
conflict serializable.
Conflict Equivalent: Two schedules are said to be conflict equivalent when one
can be transformed to another by swapping non-conflicting operations. In the
example discussed above, S11 is conflict equivalent to S1 (S1 can be converted
to S11 by swapping non-conflicting operations). Similarly, S11 is conflict
equivalent to S12 and so on.
Note 1: Although S2 is not conflict serializable, but still it is conflict equivalent
to S21 and S21 because S2 can be converted to S21 and S22 by swapping non-
conflicting operations.
Note 2: The schedule which is conflict serializable is always conflict equivalent
to one of the serial schedule. S1 schedule discussed above (which is conflict
serializable) is equivalent to serial schedule (T1->T2).
Question: Consider the following schedules involving two transactions.
Which one of the following statement is true?
S1: R1(X) R1(Y) R2(X) R2(Y) W2(Y) W1(X)
S2: R1(X) R2(X) R2(Y) W2(Y) R1(Y) W1(X)
• Both S1 and S2 are conflict serializable
• Only S1 is conflict serializable
• Only S2 is conflict serializable
• None
[GATE 2007]
Solution: Two transactions of given schedules are:
T1: R1(X) R1(Y) W1(X)
T2: R2(X) R2(Y) W2(Y)
Let us first check serializability of S1:
S1: R1(X) R1(Y) R2(X) R2(Y) W2(Y) W1(X)
To convert it to a serial schedule, we have to swap non-conflicting operations so
that S1 becomes equivalent to serial schedule T1->T2 or T2->T1. In this case,
to convert it to a serial schedule, we must have to swap R 2(X) and W1(X) but
they are conflicting. So S1 can’t be converted to a serial schedule.
Now, let us check serializability of S2:
S2: R1(X) R2(X) R2(Y) W2(Y) R1(Y) W1(X)
Swapping non conflicting operations R1(X) and R2(X) of S2, we get
S2’: R2(X) R1(X) R2(Y) W2(Y) R1(Y) W1(X)
Again, swapping non conflicting operations R 1(X) and R2(Y) of S2’, we get
S2’’: R2(X) R2(Y) R1(X) W2(Y) R1(Y) W1(X)
Again, swapping non conflicting operations R 1(X) and W2(Y) of S2’’, we get
S2’’’: R2(X) R2(Y) W2(Y) R1(X) R1(Y) W1(X)
which is equivalent to a serial schedule T2->T1.
So, correct option is C. Only S2 is conflict serializable.
Concurrency-control protocols : allow concurrent schedules, but ensure that
the schedules are conflict/view serializable, and are recoverable and maybe even
cascadeless.
These protocols do not examine the precedence graph as it is being created,
instead a protocol imposes a discipline that avoids non-seralizable schedules.
Different concurrency control protocols provide different advantages between the
amount of concurrency they allow and the amount of overhead that they impose.
Different categories of protocols:
▪ Lock Based Protocol
• Basic 2-PL
• Conservative 2-PL
• Strict 2-PL
• Rigorous 2-PL
▪ Graph Based Protocol
▪ Time-Stamp Ordering Protocol
▪ Multiple Granularity Protocol
▪ Multi-version Protocol
Lock Based Protocols –
A lock is a variable associated with a data item that describes a status of data
item with respect to possible operation that can be applied to it. They
synchronize the access by concurrent transactions to the database items. It is
required in this protocol that all the data items must be accessed in a mutually
exclusive manner. Let me introduce you to two common locks which are used
and some terminology followed in this protocol.
1. Shared Lock (S): also known as Read-only lock. As the name
suggests it can be shared between transactions because while holding
this lock the transaction does not have the permission to update data on
the data item. S-lock is requested using lock-S instruction.
2. Exclusive Lock (X): Data item can be both read as well as written.
This is Exclusive and cannot be held simultaneously on the same data
item. X-lock is requested using lock-X instruction.
Lock Compatibility Matrix –
• A transaction may be granted a lock on an item if the requested lock is
compatible with locks already held on the item by other
transactions.
• Any number of transactions can hold shared locks on an item, but if
any transaction holds an exclusive(X) on the item no other transaction
may hold any lock on the item.
• If a lock cannot be granted, the requesting transaction is made to
wait till all incompatible locks held by other transactions have been
released. Then the lock is granted.
Upgrade / Downgrade locks : A transaction that holds a lock on an
item A is allowed under certain condition to change the lock state
from one state to another.
Upgrade: A S(A) can be upgraded to X(A) if T i is the only transaction
holding the S-lock on element A.
Downgrade: We may downgrade X(A) to S(A) when we feel that we
no longer want to write on data-item A. As we were holding X-lock on
A, we need not check any conditions.
So, by now we are introduced with the types of locks and how to
apply them. But wait, just by applying locks if our problems could’ve
been avoided then life would’ve been so simple! If you have done
Process Synchronization under OS you must be familiar with one
consistent problem, starvation and Deadlock! We’ll be discussing
them shortly, but just so you know we have to apply Locks but they
must follow a set of protocols to avoid such undesirable problems.
Shortly we’ll use 2-Phase Locking (2-PL) which will use the concept of
Locks to avoid deadlock. So, applying simple locking, we may not
always produce Serializable results, it may lead to Deadlock
Inconsistency.
Problem With Simple Locking…
Consider the Partial Schedule:
T1 T2
1 lock-X(B)
2 read(B)
3 B:=B-50
4 write(B)
5 lock-S(A)
6 read(A)
7 lock-S(B)
8 lock-X(A)
9 …… ……
Deadlock – consider the above execution phase. Now, T1 holds an
Exclusive lock over B, and T2 holds a Shared lock over A. Consider
Statement 7, T2 requests for lock on B, while in Statement
8 T1 requests lock on A. This as you may notice imposes
a Deadlock as none can proceed with their execution.
Starvation – is also possible if concurrency control manager is badly
designed. For example: A transaction may be waiting for an X-lock on
an item, while a sequence of other transactions request and are
granted an S-lock on the same item. This may be avoided if the
concurrency control manager is properly designed.
Two Phase Locking Protocol
We have discussed briefly the first type of Concurrency Control Protocol, i.e.,
Lock-based Protocol.
Now, recalling where we last left off, there are two types of Locks
available Shared S(a) and Exclusive X(a). Implementing this lock system
without any restrictions gives us the Simple Lock-based protocol (or Binary
Locking), but it has its own disadvantages, they do not guarantee
Serializability. Schedules may follow the preceding rules but a non-serializable
schedule may result.
To guarantee serializability, we must follow some additional protocol concerning
the positioning of locking and unlocking operations in every transaction. This is
where the concept of Two-Phase Locking(2-PL) comes into the picture, 2-PL
ensures serializability. Now, let’s dig deep!
Two-Phase Locking –
A transaction is said to follow the Two-Phase Locking protocol if Locking and
Unlocking can be done in two phases.
1. Growing Phase: New locks on data items may be acquired but none
can be released.
2. Shrinking Phase: Existing locks may be released but no new locks
can be acquired.
Note – If lock conversion is allowed, then upgrading of lock( from S(a) to X(a) )
is allowed in the Growing Phase, and downgrading of lock (from X(a) to S(a))
must be done in shrinking phase.
Let’s see a transaction implementing 2-PL.
T1 T2
1 lock-S(A)
2 lock-S(A)
3 lock-X(B)
4 ……. ……
5 Unlock(A)
6 Lock-X(C)
7 Unlock(B)
8 Unlock(A)
9 Unlock(C)
10……. ……
This is just a skeleton transaction that shows how unlocking and locking work
with 2-PL. Note for:
Transaction T1:
• The growing Phase is from steps 1-3.
• The shrinking Phase is from steps 5-7.
• Lock Point at 3
Transaction T2:
• The growing Phase is from steps 2-6.
• The shrinking Phase is from steps 8-9.
• Lock Point at 6
What is LOCK POINT? The Point at which the growing phase ends, i.e., when a
transaction takes the final lock it needs to carry on its work. Now look at the
schedule, you’ll surely understand.
It is said that 2-PL ensures serializability, but there are still some drawbacks of
2-PL. Let’s glance at the drawbacks:
• Cascading Rollback is possible under 2-PL.
• Deadlocks and Starvation are possible.
Cascading Rollbacks in 2-PL –
Let’s see the following Schedule:
Take a moment to analyze the schedule. Yes, you’re correct, because of Dirty
Read in T2 and T3 in lines 8 and 12 respectively, when T1 failed we have to roll
back others also. Hence, Cascading Rollbacks are possible in 2-PL. I have
taken skeleton schedules as examples because it’s easy to understand when it’s
kept simple. When explained with real-time transaction problems with many
variables, it becomes very complex.
Deadlock in 2-PL –
Consider this simple example, it will be easy to understand. Say we have two
transactions T1 and T2.
Schedule: Lock-X1(A) Lock-X2(B) Lock-X1(B) Lock-X2(A)
Drawing the precedence graph, you may detect the loop. So Deadlock is also
possible in 2-PL.
Two-phase locking may also limit the amount of concurrency that occurs in a
schedule because a Transaction may not be able to release an item after it has
used it. This may be because of the protocols and other restrictions we may put
on the schedule to ensure serializability, deadlock freedom, and other factors.
This is the price we have to pay to ensure serializability and other factors, hence
it can be considered as a bargain between concurrency and maintaining the
ACID properties.
The above-mentioned type of 2-PL is called Basic 2PL. To sum it up it ensures
Conflict Serializability but does not prevent Cascading Rollback and Deadlock.
Timestamp Concurrency Control Algorithms
Timestamp-based concurrency control algorithms use a transaction’s timestamp
to coordinate concurrent access to a data item to ensure serializability. A
timestamp is a unique identifier given by DBMS to a transaction that represents
the transaction’s start time.
These algorithms ensure that transactions commit in the order dictated by their
timestamps. An older transaction should commit before a younger transaction,
since the older transaction enters the system before the younger one.
Timestamp-based concurrency control techniques generate serializable schedules
such that the equivalent serial schedule is arranged in order of the age of the
participating transactions.
Some of timestamp based concurrency control algorithms are −
• Basic timestamp ordering algorithm.
• Conservative timestamp ordering algorithm.
• Multiversion algorithm based upon timestamp ordering.
Timestamp based ordering follow three rules to enforce serializability −
• Access Rule − When two transactions try to access the same data item
simultaneously, for conflicting operations, priority is given to the older
transaction. This causes the younger transaction to wait for the older
transaction to commit first.
• Late Transaction Rule − If a younger transaction has written a data item,
then an older transaction is not allowed to read or write that data item.
This rule prevents the older transaction from committing after the younger
transaction has already committed.
• Younger Transaction Rule − A younger transaction can read or write a
data item that has already been written by an older transaction.
Optimistic Concurrency Control Algorithm
In systems with low conflict rates, the task of validating every transaction for
serializability may lower performance. In these cases, the test for serializability
is postponed to just before commit. Since the conflict rate is low, the probability
of aborting transactions which are not serializable is also low. This approach is
called optimistic concurrency control technique.
In this approach, a transaction’s life cycle is divided into the following three
phases −
• Execution Phase − A transaction fetches data items to memory and
performs operations upon them.
• Validation Phase − A transaction performs checks to ensure that
committing its changes to the database passes serializability test.
• Commit Phase − A transaction writes back modified data item in memory
to the disk.
This algorithm uses three rules to enforce serializability in validation phase −
Rule 1 − Given two transactions Ti and Tj, if Ti is reading the data item which
Tj is writing, then Ti’s execution phase cannot overlap with Tj’s commit phase.
Tj can commit only after Ti has finished execution.
Rule 2 − Given two transactions Ti and Tj, if Ti is writing the data item that Tj is
reading, then Ti’s commit phase cannot overlap with Tj’s execution phase. Tj can
start executing only after Ti has already committed.
Rule 3 − Given two transactions Ti and Tj, if Ti is writing the data item which Tj is
also writing, then Ti’s commit phase cannot overlap with Tj’s commit phase. Tj can
start to commit only after Ti has already committed.
OBJECT-ORIENTED DATABASE(OODB)
An object-oriented database is a collection of object-oriented programming and
relational database. There are various items which are created using object-
oriented programming languages like C++, Java which can be stored in relational
databases, but object-oriented databases are well-suited for those items.
An object-oriented database is organized around objects rather than actions,
and data rather than logic. For example, a multimedia record in a relational
database can be a definable data object, as opposed to an alphanumeric value.
WEB BASED DATABASE
The Web-based database management system is one of the essential parts of
DBMS and is used to store web application data. A web-based Database
management system is used to handle those databases that are having data
regarding E-commerce, E-business, blogs, e-mail, and other online applications.
An example of where a web database may be used is for an online forum. Forum
software often creates a database with a number of tables, including one for users,
posts, and settings. It is important for the relationships between the database
tables to be properly set and defined so that posts and users can be linked together
easily. In some cases, web databases can be bought with information already
included. For example, a database could include a list of all the dentists in the US
along with state and address. These databases are commonly integrated into
related websites using PHP and HTML, along with additional content.
DATA WAREHOUSE
A Data Warehouse (DW) is a relational database that is designed for query
and analysis rather than transaction processing. It includes historical data
derived from transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data
and focuses on providing support for decision-makers for data modelling
and analysis.
A Data Warehouse is a group of data specific to the entire organization,
not only to a particular group of users.
It is not used for daily operations and transaction processing but used for
making decisions.
A Data Warehouse can be viewed as a data system with the following
attributes:
o It is a database designed for investigative tasks, using data from
various applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical
perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
GOALS OF DATA WAREHOUSING
o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
DATA MINING
Data Mining is the process of investigating hidden patterns of information to various
perspectives for categorization into useful data, which is collected and assembled in
particular areas such as data warehouses, efficient analysis, data mining algorithm,
helping decision making and other data requirement to eventually cost-cutting and
generating revenue.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful
information.
Data Mining is similar to Data Science carried out by a person, in a specific situation,
on a particular data set, with an objective. This process includes various types of
services such as text mining, web mining, audio and video mining, pictorial data
mining, and social media mining. It is done through software that is simple or highly
specific. By outsourcing data mining, all the work can be done faster with low operation
costs. Specialized firms can also use new technologies to collect data that is impossible
to locate manually. There are tonnes of information available on various platforms, but
very little knowledge is accessible. The biggest challenge is to analyze the data to
extract important information that can be used to solve a problem or for company
development. There are many powerful instruments and techniques available to mine
data and find better insight from it.
Basis of
Comparison Data Warehousing Data Mining
A data warehouse is a
database system that is
designed for analytical
analysis instead of Data mining is the process of
Definition transactional work. analyzing data patterns.
Process Data is stored periodically. Data is analyzed regularly.
Basis of
Comparison Data Warehousing Data Mining
Data warehousing is the
process of extracting and
storing data to allow easier Data mining is the use of pattern
Purpose reporting. recognition logic to identify patterns.
Managing Data warehousing is solely Data mining is carried out by business
Authorities carried out by engineers. users with the help of engineers.
Data warehousing is the Data mining is considered as a
process of pooling all process of extracting data from large
Data Handling relevant data together. data sets.
Subject-oriented,
integrated, time-varying AI, statistics, databases, and machine
and non-volatile constitute learning systems are all used in data
Functionality data warehouses. mining technologies.
Data warehousing is the
process of extracting and
storing data in order to
make reporting more Pattern recognition logic is used in
Task efficient. data mining to find patterns.
It extracts data and stores it
in an orderly format, This procedure employs pattern
making reporting easier recognition tools to aid in the
Uses and faster. identification of access patterns.
When a data warehouse is Data mining aids in the creation of
connected with operational suggestive patterns of key
business systems like CRM parameters. Customer purchasing
Examples
(Customer Relationship behavior, items, and sales are
Basis of
Comparison Data Warehousing Data Mining
Management) systems, it examples. As a result, businesses will
adds value. be able to make the required
adjustments to their operations and
production.
QUERY OPTIMIZATION:
Query optimization is the process to solve complex SQL statements. A query
optimizer decides the best methods for implementing each query. The query
optimizer selects, for instance, whether or not to use indexes for a given query,
and which join methods to use when joining multiple tables. These decisions
have a tremendous effect on SQL performance, and query optimization is a key
technology for every application. The query optimizer uses these two
techniques to determine which process or expression to consider for evaluating
the query.
There are two methods of query optimization.
1.Cost based Optimization (Physical)
This is based on the cost of the query. The query can use different paths based
on indexes, constraints, sorting methods etc. This method mainly uses the
statistics like record size, number of records, number of records per block,
number of blocks, table size, whether whole table fits in a block, organization
of tables, uniqueness of column values, size of columns etc.
Suppose, we have series of table joined in a query.
T1 ∞ T2 ∞ T3 ∞ T4∞ T5 ∞ T6
For above query we can have any order of evaluation. We can start taking any
two tables in any order and start evaluating the query. Ideally, we can have
join combinations in (2(n-1))! / (n-1)! ways. For example, suppose we have 5
tables involved in join, then we can have 8! / 4! = 1680 combinations. But
when query optimizer runs, it does not evaluate in all these ways always. It
uses Dynamic Programming where it generates the costs for join orders of any
combination of tables. It is calculated and generated only once. This least cost
for all the table combination is then stored in the database and is used for
future use. i.e.; say we have a set of tables, T = { T1 , T2 , T3 .. Tn}, then it
generates least cost combination for all the tables and stores it.
• Dynamic Programming
As we learnt above, the least cost for the joins of any combination of table is
generated here. These values are stored in the database and when those
tables are used in the query, this combination is selected for evaluating the
query.
While generating the cost, it follows below steps :
Suppose we have set of tables, T = {T1 , T2 , T3 .. Tn}, in a DB. It picks the first
table, and computes cost for joining with rest of the tables in set T. It
calculates cost for each of the tables and then chooses the best cost. It
continues doing the same with rest of the tables in set T. It will generate 2n – 1
cases and it selects the lowest cost and stores it. When a query uses those
tables, it checks for the costs here and that combination is used to evaluate the
query. This is called dynamic programming.
In this method, time required to find optimized query is in the order of 3n,
where n is the number of tables. Suppose we have 5 tables, then time required
in 35 = 243, which is lesser than finding all the combination of tables and then
deciding the best combination (1680). Also, the space required for computing
and storing the cost is also less and is in the order of 2n. In above example, it is
25 = 32.
• Left Deep Trees
This is another method of determining the cost of the joins. Here, the tables
and joins are represented in the form of trees. The joins always form the root
of the tree and table is kept at the right side of the root. LHS of the root always
point to the next join. Hence it gets deeper and deeper on LHS. Hence it is
called as left deep tree.
Here instead of calculating the best join cost for set of tables, best join cost for
joining with each table is calculated. In this method, time required to find
optimized query is in the order of n2n, where n is the number of tables. Suppose
we have 5 tables, then time required in 5*25 =160, which is lesser than dynamic
programming. Also, the space required for computing storing the cost is also less
and is in the order of 2n. In above example, it is 25 = 32, same as dynamic
programming.
• Interesting Sort Orders
This method is an enhancement to dynamic programming. Here, while
calculating the best join order costs, it also considers the sorted tables. It
assumes, calculating the join orders on sorted tables would be efficient. i.e.;
suppose we have unsorted tables T1 , T2 , T3 .. Tn and we have join on these
tables.
(T1 ∞T2)∞ T3 ∞… ∞ Tn
This method uses hash join or merge join method to calculate the cost. Hash Join
will simply join the tables. We get sorted output in merge join method, but it is
costlier than hash join. Even though merge join is costlier at this stage, when it
moves to join with third table, the join will have less effort to sort the tables.
This is because first table is the sorted result of first two tables. Hence it will
reduce the total cost of the query.
But the number of tables involved in the join would be relatively less and this
cost/space difference will be hardly noticeable.
All these cost based optimizations are expensive and are suitable for large
number of data. There is another method of optimization called heuristic
optimization, which is better compared to cost based optimization.
2. Heuristic Optimization (Logical)
This method is also known as rule based optimization. This is based on the
equivalence rule on relational expressions; hence the number of combination of
queries get reduces here. Hence the cost of the query too reduces.
This method creates relational tree for the given query based on the equivalence
rules. These equivalence rules by providing an alternative way of writing and
evaluating the query, gives the better path to evaluate the query. This rule need
not be true in all cases. It needs to be examined after applying those rules. The
most important set of rules followed in this method is listed below:
• Perform all the selection operation as early as possible in the
query. This should be first and foremost set of actions on the tables
in the query. By performing the selection operation, we can reduce
the number of records involved in the query, rather than using the
whole tables throughout the query.
Suppose we have a query to retrieve the students with age 18 and studying in
class DESIGN_01. We can get all the student details from STUDENT table, and
class details from CLASS table. We can write this query in two different ways.
Here both the queries will return same result. But when we observe them closely
we can see that first query will join the two tables first and then applies the
filters. That means, it traverses whole table to join, hence the number of records
involved is more. But he second query, applies the filters on each table first. This
reduces the number of records on each table (in class table, the number of
record reduces to one in this case!). Then it joins these intermediary tables.
Hence the cost in this case is comparatively less.
Instead of writing query the optimizer creates relational algebra and tree for
above case.
• Perform all the projection as early as possible in the query. This is
similar to selection but will reduce the number of columns in the
query.
Suppose for example, we have to select only student name, address and class
name of students with age 18 from STUDENT and CLASS tables.
Here again, both the queries look alike, results alike. But when we compare the
number of records and attributes involved at each stage, second query uses
less records and hence more efficient.
• Next step is to perform most restrictive joins and selection
operations. When we say most restrictive joins and selection
means, select those set of tables and views which will result in
comparatively less number of records. Any query will have better
performance when tables with few records are joined. Hence
throughout heuristic method of optimization, the rules are formed
to get less number of records at each stage, so that query
performance is better. So is the case here too.
Suppose we have STUDENT, CLASS and TEACHER tables. Any student can attend
only one class in an academic year and only one teacher takes a class. But a class
can have more than 50 students. Now we have to retrieve STUDENT_NAME,
ADDRESS, AGE, CLASS_NAME and TEACHER_NAME of each student in a school.
∏STD_NAME, ADDRESS, AGE, CLASS_NAME, TEACHER_NAME ((STUDENT ∞
CLASS_ID CLASS)∞ TECH_IDTEACHER) Not So efficient
∏STD_NAME, ADDRESS, AGE, CLASS_NAME, TEACHER_NAME (STUDENT ∞
CLASS_ID (CLASS∞ TECH_IDTEACHER)) Efficient
In the first query, it tries to select the records of students from each class. This
will result in a very huge intermediary table. This table is then joined with
another small table. Hence the traversing of number of records is also more. But
in the second query, CLASS and TEACHER are joined first, which has one to one
relation here. Hence the number of resulting record is STUDENT table give the
final result. Hence this second method is more efficient.
• Sometimes we can combine above heuristic steps with cost based
optimization technique to get better results.
All these methods need not be always true. It also depends on the table size,
column size, type of selection, projection, join sort, constraints, indexes,
statistics etc. Above optimization describes the best way of optimizing the
queries.