Adm Unit 2
Adm Unit 2
UNIT II
TRANSACTION PROCESSING:
A transaction is a program including a collection of database operations, executed as
a logical unit of data processing. The operations performed in a transaction include
one or more of database operations like insert, delete, update or retrieve data. It is an
atomic process that is either performed into completion entirely or is not performed
at all. A transaction involving only data retrieval without any data update is called read-
only transaction.
Each high level operation can be divided into a number of low level tasks or operations.
For example, a data update operation can be divided into three tasks −
read_item() − reads data item from storage to main memory.
modify_item() − change value of item in the main memory.
write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for
all transactions, read and write forms the basic database operations.
TRANSACTION OPERATIONS
TRANSACTION STATES
UNIT - II 1
[ADVANCED DATABASE AND MINING]
Any transaction must maintain the ACID properties, viz. Atomicity, Consistency,
Isolation, and Durability.
Atomicity − This property states that a transac on is an atomic unit of
processing, that is, either it is performed in its entirety or not performed
at all. No partial update should exist.
Consistency − A transac on should take the database from one
consistent state to another consistent state. It should not adversely
affect any data item in the database.
Isolation − A transac on should be executed as if it is the only one in the
system. There should not be any interference from the other concurrent
transactions that are simultaneously running.
Durability − If a commi ed transac on brings about a change, that
change should be durable in the database and not lost in case of any
failure.
UNIT - II 2
[ADVANCED DATABASE AND MINING]
Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active
transactions perform non-compatible operations. Two operations are said to be in
conflict, when all of the following three conditions exists simultaneously −
The two operations are parts of different transactions.
Both the operations access the same data item.
UNIT - II 3
[ADVANCED DATABASE AND MINING]
SERIALIZABILITY SCHEDULE
Equivalence of Schedules
Equivalence of two schedules can be of the following types −
Result equivalence − Two schedules producing iden cal results are said
to be result equivalent.
View equivalence − Two schedules that perform similar ac on in a
similar manner are said to be view equivalent.
Conflict equivalence − Two schedules are said to be conflict equivalent if
both contain the same set of transactions and has the same order of
conflicting pairs of operations.
DBMS is the management of data that should remain integrated when any changes
are done in it. It is because if the integrity of the data is affected, whole data will get
disturbed and corrupted. Therefore, to maintain the integrity of the data, there are
four properties described in the database management system, which are known as
the ACID properties. The ACID properties are meant for the transaction that goes
through a different group of tasks, and there we come to see the role of the ACID
properties.
In this section, we will learn and understand about the ACID properties. We will learn
what these properties stand for and what does each property is used for. We will also
understand the ACID properties with the help of some examples.
ACID Properties
UNIT - II 4
[ADVANCED DATABASE AND MINING]
1) ATOMICITY: The term atomicity defines that the data remains atomic. It means if
any operation is performed on the data, either it should be performed or executed
completely or should not be executed at all. It further means that the operation should
not break in between or execute partially. In the case of executing operations on the
transaction, the operation should be completely executed and not partially.
Example: If Remo has account A having $30 in his account from which he wishes to
send $10 to Sheero's account, which is B. In account B, a sum of $ 100 is already
present. When $10 will be transferred to account B, the sum will become $110. Now,
there will be two operations that will take place. One is the amount of $10 that Remo
wants to transfer will be debited from his account A, and the same amount will get
credited to account B, i.e., into Sheero's account. Now, what happens - the first
operation of debit executes successfully, but the credit operation, however, fails. Thus,
in Remo's account A, the value becomes $20, and to that of Sheero's account, it
remains $100 as it was previously present.
UNIT - II 5
[ADVANCED DATABASE AND MINING]
In the above diagram, it can be seen that after crediting $10, the amount is still $100
in account B. So, it is not an atomic transaction.
The below image shows that both debit and credit operations are done successfully.
Thus the transaction is atomic.
Thus, when the amount loses atomicity, then in the bank systems, this becomes a huge
issue, and so the atomicity is the main focus in the bank systems.
2) CONSISTENCY: The word consistency means that the value should remain
preserved always. In DBMS
, the integrity of the data should be maintained, which means if a change in the
database is made, it should remain preserved always. In the case of transactions, the
integrity of the data is very essential so that the database remains consistent before
and after the transaction. The data should always be correct.
Example:
UNIT - II 6
[ADVANCED DATABASE AND MINING]
In the above figure, there are three accounts, A, B, and C, where A is making a
transaction T one by one to both B & C. There are two operations that take place, i.e.,
Debit and Credit. Account A firstly debits $50 to account B, and the amount in account
A is read $300 by B before the transaction. After the successful transaction T, the
available amount in B becomes $150. Now, A debits $20 to account C, and that time,
the value read by C is $250 (that is correct as a debit of $50 has been successfully done
to B). The debit and credit operation from account A to C has been done successfully.
We can see that the transaction is done successfully, and the value is also read
correctly. Thus, the data is consistent. In case the value read by B and C is $300, which
means that data is inconsistent because when the debit operation executes, it will not
be consistent.
Example: If two operations are concurrently running on two different accounts, then
the value of both accounts should not get affected. The value should remain
persistent. As you can see in the below diagram, account A is making T1 and T2
UNIT - II 7
[ADVANCED DATABASE AND MINING]
Therefore, the ACID property of DBMS plays a vital role in maintaining the consistency
and availability of data in the database.
SCHEDULE
UNIT - II 8
[ADVANCED DATABASE AND MINING]
1. Serial Schedule
The serial schedule is a type of schedule where one transaction is executed completely
before starting another transaction. In the serial schedule, when the first transaction
completes its cycle, then the next transaction is executed.
For example: Suppose there are two transactions T1 and T2 which have some
operations. If it has no interleaving of operations, then there are the following two
possible outcomes:
1. Execute all the operations of T1 which was followed by all the operations of T2.
2. Execute all the operations of T1 which was followed by all the operations of T2.
o In the given (a) figure, Schedule A shows the serial schedule where T1 followed
by T2.
o In the given (b) figure, Schedule B shows the serial schedule where T2 followed
by T1.
2. Non-serial Schedule
UNIT - II 9
[ADVANCED DATABASE AND MINING]
3. Serializable schedule
UNIT - II 10
[ADVANCED DATABASE AND MINING]
RECOVERABILITY OF SCHEDULE
Sometimes a transaction may not execute completely due to a software issue, system
crash or hardware failure. In that case, the failed transaction has to be rollback. But
UNIT - II 11
[ADVANCED DATABASE AND MINING]
some other transaction may also have used value produced by the failed transaction.
So we also have to rollback those transactions.
The above table 1 shows a schedule which has two transactions. T1 reads and writes
the value of A and that value is read and written by T2. T2 commits but later on, T1
fails. Due to the failure, we have to rollback T1. T2 should also be rollback because it
reads the value written by T1, but T2 can't be rollback because it already committed.
So this type of schedule is known as irrecoverable schedule.
The above table 2 shows a schedule with two transactions. Transaction T1 reads and
writes A, and that value is read and written by transaction T2. But later on, T1 fails.
Due to this, we have to rollback T1. T2 should be rollback because T2 has read the
UNIT - II 12
[ADVANCED DATABASE AND MINING]
value written by T1. As it has not committed before T1 commits so we can rollback
transaction T2 as well. So it is recoverable with cascade rollback.
Recoverable with cascading rollback: The schedule will be recoverable with cascading
rollback if Tj reads the updated value of Ti. Commit of Tj is delayed till commit of Ti.
The above Table 3 shows a schedule with two transactions. Transaction T1 reads and
write A and commits, and that value is read and written by T2. So this is a cascade less
recoverable schedule.
CONCURRENCY CONTROL
But before knowing about concurrency control, we should know about concurrent
execution.
o In a multi-user system, multiple users can access and use the same database at
one time, which is known as the concurrent execution of the database. It means
that the same database is executed simultaneously on a multi-user system by
different users.
o While working on the database transactions, there occurs the requirement of
using the database by multiple users for performing different operations, and
in that case, concurrent execution of the database is performed.
o The thing is that the simultaneous execution that is performed should be done
in an interleaved manner, and no operation should affect the other executing
operations, thus maintaining the consistency of the database. Thus, on making
UNIT - II 13
[ADVANCED DATABASE AND MINING]
In a database transaction, the two main operations are READ and WRITE operations.
So, there is a need to manage these two operations in the concurrent execution of the
transactions as if these operations are not performed in an interleaved manner, and
the data may become inconsistent. So, the following problems occur with the
Concurrent Execution of the operations:
The problem occurs when two different database transactions perform the read/write
operations on the same database items in an interleaved manner (i.e., concurrent
execution) that makes the values of the items incorrect hence making the database
inconsistent.
For example:
Consider the below diagram where two transactions TX and TY, are performed on the
same account A where the balance of account A is $300.
o At time t1, transaction TX reads the value of account A, i.e., $300 (only read).
UNIT - II 14
[ADVANCED DATABASE AND MINING]
o At time t2, transaction TX deducts $50 from account A that becomes $250 (only
deducted and not updated/write).
o Alternately, at time t3, transaction TY reads the value of account A that will be
$300 only because TX didn't update the value yet.
o At time t4, transaction TY adds $100 to account A that becomes $400 (only
added but not updated/write).
o At time t6, transaction TX writes the value of account A that will be updated as
$250 only, as TY didn't update the value yet.
o Similarly, at time t7, transaction TY writes the values of account A, so it will write
as done at time t4 that will be $400. It means the value written by TX is lost, i.e.,
$250 is lost.
The dirty read problem occurs when one transaction updates an item of the database,
and somehow the transaction fails, and before the data gets rollback, the updated
database item is accessed by another transaction. There comes the Read-Write Conflict
between both transactions.
For example:
UNIT - II 15
[ADVANCED DATABASE AND MINING]
Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two
different values are read for the same database item.
For example:
UNIT - II 16
[ADVANCED DATABASE AND MINING]
o At time t1, transaction TX reads the value from account A, i.e., $300.
o At time t2, transaction TY reads the value from account A, i.e., $300.
o At time t3, transaction TY updates the value of account A by adding $100 to the
available balance, and then it becomes $400.
o At time t4, transaction TY writes the updated value, i.e., $400.
o After that, at time t5, transaction TX reads the available value of account A, and
that will be read as $400.
o It means that within the same transaction TX, it reads two different values of
account A, i.e., $ 300 initially, and after updation made by transaction TY, it
reads $400. It is an unrepeatable read and is therefore known as the
Unrepeatable read problem.
Thus, in order to maintain consistency in the database and avoid such problems that
take place in concurrent execution, management is needed, and that is where the
concept of Concurrency Control comes into role.
UNIT - II 17
[ADVANCED DATABASE AND MINING]
Example
Assume that two people who go to electronic kiosks at the same time to buy a movie
ticket for the same movie and the same show time.
However, there is only one seat left in for the movie show in that particular theatre.
Without concurrency control in DBMS, it is possible that both moviegoers will end up
purchasing a ticket. However, concurrency control method does not allow this to
happen. Both moviegoers can still access information written in the movie seating
database. But concurrency control only provides a ticket to the buyer who has
completed the transaction process first.
Lock-Based Protocols
Two Phase Locking Protocol
Timestamp-Based Protocols
Validation-Based Protocols
LOCK-BASED PROTOCOLS:
Binary Locks: A Binary lock on a data item can either locked or unlocked states.
Shared/exclusive: This type of locking mechanism separates the locks in DBMS based
on their uses. If a lock is acquired on a data item to perform a write operation, it is
called an exclusive lock.
A shared lock is also called a Read-only lock. With the shared lock, the data item can
be shared between transactions. This is because you will never have permission to
update data on the data item.
For example, consider a case where two transactions are reading the account
balance of a person. The database will let them read by placing a shared lock.
However, if another transaction wants to update that account’s balance, shared lock
prevent it until the reading process is over.
With the Exclusive Lock, a data item can be read as well as written. This is exclusive
and can’t be held concurrently on the same data item. X-lock is requested using lock-
x instruction. Transactions may unlock the data item after finishing the ‘write’
operation.
For example, when a transaction needs to update the account balance of a person.
You can allows this transaction by placing X lock on it. Therefore, when the second
transaction wants to read or write, exclusive lock prevent this operation.
This type of lock-based protocols allows transactions to obtain a lock on every object
before beginning operation. Transactions may unlock the data item after finishing
the ‘write’ operation.
4. Pre-claiming Locking
Pre-claiming lock protocol helps to evaluate operations and create a list of required
data items which are needed to initiate an execution process. In the situation when
all locks are granted, the transaction executes. After that, all locks release when all of
its operations are over.
Starvation
Starvation is the situation when a transaction needs to wait for an indefinite period
to acquire a lock.
Deadlock
UNIT - II 19
[ADVANCED DATABASE AND MINING]
Deadlock refers to a specific situation where two or more processes are waiting for
each other to release a resource or more than two processes are waiting for the
resource in a circular chain.
Two Phase Locking Protocol also known as 2PL protocol is a method of concurrency
control in DBMS that ensures serializability by applying a lock to the transaction data
which blocks other transactions to access the same data simultaneously. Two Phase
Locking protocol helps to eliminate the concurrency problem in DBMS.
This locking protocol divides the execution phase of a transaction into three different
parts.
The Two-Phase Locking protocol allows each transaction to make a lock or unlock
request in two steps:
Growing Phase: In this phase transaction may obtain locks but may not
release any locks.
Shrinking Phase: In this phase, a transaction may release locks but not obtain
any new lock
UNIT - II 20
[ADVANCED DATABASE AND MINING]
It is true that the 2PL protocol offers serializability. However, it does not ensure that
deadlocks do not happen.
In the above-given diagram, you can see that local and global deadlock detectors are
searching for deadlocks and solve them with resuming transactions to their initial
states.
Centralized 2PL
In Centralized 2 PL, a single site is responsible for lock management process. It has
only one lock manager for the entire DBMS.
Distributed 2PL
In this kind of two-phase locking mechanism, Lock managers are distributed to all
sites. They are responsible for managing locks for data at that site. If no data is
replicated, it is equivalent to primary copy 2PL. Communication costs of Distributed
2PL are quite higher than primary copy 2PL
TIMESTAMP-BASED PROTOCOLS
Timestamp based Protocol in DBMS is an algorithm which uses the System Time or
Logical Counter as a timestamp to serialize the execution of concurrent transactions.
The Timestamp-based protocol ensures that every conflicting read and write
operations are executed in a timestamp order.
The older transaction is always given priority in this method. It uses system time to
determine the time stamp of the transaction. This is the most commonly used
concurrency protocol.
Lock-based protocols help you to manage the order between the conflicting
transactions when they will execute. Timestamp-based protocols manage conflicts as
soon as an operation is created.
UNIT - II 21
[ADVANCED DATABASE AND MINING]
Example:
Disadvantages:
1. Read Phase
2. Validation Phase
3. Write Phase
Read Phase
In the Read Phase, the data values from the database can be read by a transaction
but the write operation or updates are only applied to the local data copies, not the
actual database.
Validation Phase
In Validation Phase, the data is checked to ensure that there is no violation of
serializability while applying the transaction updates to the database.
Write Phase
UNIT - II 22
[ADVANCED DATABASE AND MINING]
In the Write Phase, the updates are applied to the database if the validation is
successful, else; the updates are not applied, and the transaction is rolled back.
In case of transaction the term ACID has been used significantly to state some of
important properties that a transaction must follow. We all know ACID stands for
Atomicity, Consistency, Isolation and Durability and these properties collectively
called as ACID Properties.
Properties of transaction
ISOLATION
It determines the visibility of transactions of other systems. A lower level allows every
user to access the same data. Therefore, it involves high risk of data privacy and
security of the system. However, a higher isolation level reduces the type of
concurrency over the data but requires more resources and is slower than lower
isolation levels.
UNIT - II 23
[ADVANCED DATABASE AND MINING]
The isolation protocols help safeguards the data from unwanted transactions. They
maintain the integrity of every data by defining how and when the changes made by
one operation are visible to others.
LEVELS OF ISOLATION
Example
transaction D
transaction E
session ends
Check which option −
A- Serializable
B- Repeatable read
C- Read uncommitted
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
UNIT - II 25
[ADVANCED DATABASE AND MINING]
Production
o Production planning
o Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data
model more intuitive to them than a tabular model. This model is called a Dimensional
Model.
The second purpose is to enable fast query response that is usually difficult to achieve
using tabular models.
Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries
that are typically very hard to execute over tabular databases, namely aggregation,
joining, and grouping. These queries are calculated during a process that is usually
called 'building' or 'processing' of the OLAP cube. This process happens overnight, and
by the time end users get to work - data will have been updated.
Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines
and requirements as the basis for selecting OLAP systems:
UNIT - II 26
[ADVANCED DATABASE AND MINING]
3) Accessibility: It provides access only to the data that is actually required to perform
the particular analysis, present a single, coherent, and consistent view to the clients.
The OLAP system must map its own logical schema to the heterogeneous physical data
stores and perform any necessary transformations. The OLAP operations should be
sitting between data sources (e.g., data warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel any
significant degradation in documenting performance as the number of dimensions or
the size of the database increases. That is, the performance of OLAP should not suffer
as the number of dimensions is increased. Users must observe consistent run time,
response time, or machine utilization every time a given query is run.
UNIT - II 27
[ADVANCED DATABASE AND MINING]
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific
analytical model being created and loaded that optimizes sparse matrix handling.
When encountering the sparse matrix, the system must be easy to dynamically assume
the distribution of the information and adjust the storage and access to obtain and
maintain a consistent level of performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity,
and access security.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions
should be unlimited. Each of these common dimensions must allow a practically
unlimited number of customer-defined aggregation levels within any given
consolidation path.
UNIT - II 28
[ADVANCED DATABASE AND MINING]
But the following list may be helpful for the beginners to have a nice database
performance −
Use 3BNF database design explained in this tutorial in RDBMS Concepts
chapter.
Avoid number-to-character conversions because numbers and
characters compare differently and lead to performance downgrade.
While using SELECT statement, only fetch whatever information is
required and avoid using * in your SELECT queries because it would load
the system unnecessarily.
Create your indexes carefully on all the tables where you have frequent
search operations. Avoid index on the tables where you have less number
of search operations and more number of insert and update operations.
A full-table scan occurs when the columns in the WHERE clause do not
have an index associated with them. You can avoid a full-table scan by
creating an index on columns that are used as conditions in the WHERE
clause of an SQL statement.
Be very careful of equality operators with real numbers and date/time
values. Both of these can have small differences that are not obvious to
the eye but that make an exact match impossible, thus preventing your
queries from ever returning rows.
Use pattern matching judiciously. LIKE COL% is a valid WHERE condition,
reducing the returned set to only those records with data starting with
the string COL. However, COL%Y does not further reduce the returned
results set since %Y cannot be effectively evaluated. The effort to do the
evaluation is too large to be considered. In this case, the COL% is used,
but the %Y is thrown away. For the same reason, a leading wildcard %COL
effectively prevents the entire filter from being used.
Fine tune your SQL queries examining the structure of the queries (and
subqueries), the SQL syntax, to discover whether you have designed your
tables to support fast data manipulation and written the query in an
optimum manner, allowing your DBMS to manipulate the data
efficiently.
For queries that are executed on a regular basis, try to use procedures. A
procedure is a potentially large group of SQL statements. Procedures are
compiled by the database engine and then executed. Unlike an SQL
statement, the database engine need not optimize the procedure before
it is executed.
Avoid using the logical operator OR in a query if possible. OR inevitably
slows down nearly any query against a table of substantial size.
You can optimize bulk data loads by dropping indexes. Imagine the
history table with many thousands of rows. That history table is also likely
to have one or more indexes. When you think of an index, you normally
UNIT - II 29
[ADVANCED DATABASE AND MINING]
think of faster table access, but in the case of batch loads, you can benefit
by dropping the index(es).
When performing batch transactions, perform COMMIT at after a fair
number of records creation in stead of creating them after every record
creation.
Plan to defragment the database on a regular basis, even if doing so
means developing a weekly routine.
Oracle has many tools for managing SQL statement performance but among them two
are very popular. These two tools are −
Explain plan − tool iden fies the access path that will be taken when the
SQL statement is executed.
tkprof − measures the performance by me elapsed during each phase
of SQL statement processing.
If you want to simply measure the elapsed time of a query in Oracle, you can use the
SQL*Plus command SET TIMING ON.
Check your RDBMS documentation for more detail on the above-mentioned tools and
defragmenting the database.
QUERY OPTIMIZATION
Query optimization is of great importance for the performance of a relational
database, especially for the execution of complex SQL statements. A query optimizer
decides the best methods for implementing each query.
The query optimizer selects, for instance, whether or not to use indexes for a given
query, and which join methods to use when joining multiple tables. These decisions
have a tremendous effect on SQL performance, and query optimization is a key
technology for every application, from operational Systems to data warehouse and
analytical systems to content management systems.
UNIT - II 30
[ADVANCED DATABASE AND MINING]
process and save these records. For example, if it can only require to
fetch one column from a table, do not use ‘SELECT *’.
Store intermediate results − Some mes logic for a query can be quite
complex. It is possible to produce the desired outcomes through the use
of subqueries, inline views, and UNION-type statements. For those
methods, the transitional results are not saved in the database but are
directly used within the query. This can lead to achievement issues,
particularly when the transitional results have a huge number of rows.
There are various query optimization strategies are as follows −
Use Index − It can be using an index is the first strategy one should use
to speed up a query.
Aggregate Table − It can be used to pre-populating tables at higher levels
so less amount of information is required to be parsed.
Vertical Partitioning − It can be used to par on the table by columns.
This method reduces the amount of information a SQL query required to
process.
Horizontal Partitioning − It can be used to par on the table by data
value, most often time. This method reduces the amount of information
a SQL query required to process.
De-normalization − The process of de-normalization combines multiple
tables into a single table. This speeds up query implementation because
fewer table joins are required.
Server Tuning − Each server has its parameters and provides tuning
server parameters so that it can completely take benefit of the hardware
resources that can significantly speed up query implementation.
Initially, the SQL query is scanned. Then it is parsed to look for syntactical errors and
correctness of data types. If the query passes this step, the query is decomposed into
smaller query blocks. Each block is then translated to equivalent relational algebra
expression.
Steps for Query Optimization
Query optimization involves three steps, namely query tree generation, plan
generation, and query plan code generation.
Step 1 − Query Tree Genera on
A query tree is a tree data structure representing a relational algebra expression. The
tables of the query are represented as leaf nodes. The relational algebra operations
are represented as the internal nodes. The root represents the query as a whole.
UNIT - II 31
[ADVANCED DATABASE AND MINING]
During execution, an internal node is executed whenever its operand tables are
available. The node is then replaced by the result table. This process continues for all
internal nodes until the root node is executed and replaced by the result table.
For example, let us consider the following schemas −
EMPLOYEE
DEPARTMENT
Example 1
Let us consider the query as the following.
$$\pi_{EmpID} (\sigma_{EName = \small "ArunKumar"} {(EMPLOYEE)})$$
The corresponding query tree will be −
Example 2
Let us consider another query involving a join.
$\pi_{EName, Salary} (\sigma_{DName = \small "Marketing"} {(DEPARTMENT)})
\bowtie_{DNo=DeptNo}{(EMPLOYEE)}$
Following is the query tree for the above query.
UNIT - II 32
[ADVANCED DATABASE AND MINING]
COST OF QUERY:
In the previous section, we understood about Query processing steps and evaluation
plan. Though a system can create multiple plans for a query, the chosen method
should be the best of all. It can be done by comparing each possible plan in terms of
their estimated cost. For calculating the net estimated cost of any plan, the cost of
each operation within a plan should be determined and combined to get the net
estimated cost of the query evaluation plan.
To estimate the cost of a query evaluation plan, we use the number of blocks
transferred from the disk, and the number of disks seeks. Suppose the disk has an
average block access time of ts seconds and takes an average of tT seconds to transfer
x data blocks. The block access time is the sum of disk seeks time and rotational
latency. It performs S seeks than the time taken will be b*tT + S*tS seconds. If tT=0.1
ms, tS =4 ms, the block size is 4 KB, and its transfer rate is 40 MB per second. With this,
we can easily calculate the estimated cost of the given query evaluation plan.
Generally, for estimating the cost, we consider the worst case that could happen. The
users assume that initially, the data is read from the disk only. But there must be a
chance that the information is already present in the main memory. However, the
users usually ignore this effect, and due to this, the actual cost of execution comes out
less than the estimated value.
The response time, i.e., the time required to execute the plan, could be used for
estimating the cost of the query evaluation plan. But due to the following reasons, it
becomes difficult to calculate the response time without actually executing the query
evaluation plan:
o When the query begins its execution, the response time becomes dependent
on the contents stored in the buffer. But this information is difficult to retrieve
when the query is in optimized mode, or it is not available also.
o When a system with multiple disks is present, the response time depends on an
interrogation that in "what way accesses are distributed among the disks?". It
is difficult to estimate without having detailed knowledge of the data layout
present over the disk.
o Consequently, instead of minimizing the response time for any query
evaluation plan, the optimizers finds it better to reduce the total resource
consumption of the query plan. Thus to estimate the cost of a query evaluation
plan, it is good to minimize the resources used for accessing the disk or use of
the extra resources.
JOIN:
Join in DBMS is a binary operation which allows you to combine join product and
selection in one single statement. The goal of creating a join condition is that it helps
UNIT - II 34
[ADVANCED DATABASE AND MINING]
you to combine the data from two or more DBMS tables. The tables in DBMS are
associated using the primary key and foreign keys.
In this DBMS tutorial, you will learn:
Types of Join
Inner Join
Theta Join
EQUI join:
Natural Join (⋈)
Outer Join
Left Outer Join (A B)
Right Outer Join (A B)
Full Outer Join (A B)
Types of Join
There are mainly two types of joins in DBMS:
INNER JOIN
Inner Join is used to return rows from both tables which satisfy the given condition.
It is the most widely used join operation and can be considered as a default join-type
Theta join
Natural join
EQUI join
THETA JOIN
Theta Join allows you to merge two tables based on the condition represented by
theta. Theta joins work for all comparison operators. It is denoted by symbol θ. The
general case of JOIN operation is called a Theta join.
UNIT - II 35
[ADVANCED DATABASE AND MINING]
Syntax:
A ⋈θ B
Theta join can use any conditions in the selection criteria.
Table A Table B
column 1 column 2 column 1 column 2
1 1 1 1
1 2 1 3
For example:
EQUI Join
EQUI Join is done when a Theta join uses only the equivalence condition. EQUI join is
the most difficult operation to implement efficiently in an RDBMS, and one reason
why RDBMS have essential performance problems.
For example:
UNIT - II 36
[ADVANCED DATABASE AND MINING]
Example:
Consider the following two tables
C
Num Square
2 4
3 9
D
Num Cube
2 8
3 18
C⋈D
C⋈D
Num Square Cube
2 4 8
3 9 18
OUTER JOIN
An Outer Join doesn’t require each record in the two join tables to have a matching
record. In this type of join, the table retains each record even if no other matching
record exists.
UNIT - II 37
[ADVANCED DATABASE AND MINING]
A
Num Square
2 4
3 9
4 16
B
Num Cube
2 8
3 18
5 75
A B
A⋈B
Num Square Cube
2 4 8
3 9 18
4 16 –
A B
A⋈B
Num Cube Square
2 8 4
3 18 9
5 75 –
In a Full Outer Join , all tuples from both relations are included in the result,
irrespective of the matching condition.
Example:
A B
A⋈B
Num Square Cube
2 4 8
3 9 18
4 16 –
5 – 75
In the previous section, we understood that estimating the cost of a query plan should
be done by measuring the total resource consumption.
In this section, we will understand how the selection operation is performed in the
query execution plan.
Generally, the selection operation is performed by the file scan. File scans are the
search algorithms that are used for locating and accessing the data. It is the lowest-
level operator used in query processing.
In RDBMS or relational database systems, the file scan reads a relation only if the
whole relation is stored in one file only. When the selection operation is performed on
a relation whose tuples are stored in one file, it uses the following algorithms:
o Linear Search: In a linear search, the system scans each record to test whether
satisfying the given selection condition. For accessing the first block of a file, it
needs an initial seek. If the blocks in the file are not stored in contiguous order,
then it needs some extra seeks. However, linear search is the slowest algorithm
used for searching, but it is applicable in all types of cases. This algorithm does
not care about the nature of selection, availability of indices, or the file
sequence. But other algorithms are not applicable in all types of cases.
UNIT - II 39
[ADVANCED DATABASE AND MINING]
The index-based search algorithms are known as Index scans. Such index structures
are known as access paths. These paths allow locating and accessing the data in the
file. There are following algorithms that use the index in query processing:
o Primary index, equality on a key: We use the index to retrieve a single record
that satisfies the equality condition for making the selection. The equality
comparison is performed on the key attribute carrying a primary key.
o Primary index, equality on nonkey: The difference between equality on key
and nonkey is that in this, we can fetch multiple records. We can fetch multiple
records through a primary key when the selection criteria specify the equality
comparison on a nonkey.
o Secondary index, equality on key or nonkey: The selection that specifies an
equality condition can use the secondary index. Using secondary index strategy,
we can either retrieve a single record when equality is on key or multiple
records when the equality condition is on nonkey. When retrieving a single
record, the time cost is equal to the primary index. In the case of multiple
records, they may reside on different blocks. This results in one I/O operation
per fetched record, and each I/O operation requires a seek and a block transfer.
For making any selection on the basis of a comparison in a relation, we can proceed it
either by using the linear search or via indices in the following ways:
o Primary index, comparison: When the selection condition given by the user is
a comparison, then we use a primary ordered index, such as the primary B+-tree
index. For example, when A attribute of a relation R compared with a given
value v as A>v, then we use a primary index on A to directly retrieve the tuples.
The file scan starts its search from the beginning till the end and outputs all
those tuples that satisfy the given selection condition.
o Secondary index, comparison: The secondary ordered index is used for
satisfying the selection operation that involves <, >, ≤, or ≥ In this, the files scan
searches the blocks of the lowest-level index.
(< ≤): In this case, it scans from the smallest value up to the given value v.
(>, ≥): In this case, it scans from the given value v up to the maximum value.
UNIT - II 40
[ADVANCED DATABASE AND MINING]
However, the use of the secondary index should be limited for selecting a few
records. It is because such an index provides pointers to point each record, so
users can easily fetch the record through the allocated pointers. Such retrieved
records may require an I/O operation as records may be stored on different
blocks of the file. So, if the number of fetched records is large, it becomes
expensive with the secondary index.
σ θ1ꓥθ2ꓥ…ꓥθn (r)
A conjunction is the intersection of all records that satisfies the above selection
condition.
σ θ1ꓦθ2ꓦ…ꓦθn (r)
A disjunction is the union of all records that satisfies the given selection condition θi.
Negation: The result of a selection σ¬θ(r) is the set of tuples of given relation r where
the selection condition evaluates to false. But nulls are not present, and this set is only
the set of tuples in relation r that are not in σθ(r).
on these combined attribute fields, then directly search the index. Such type of
index evaluates the suitable index algorithms.
o Conjunctive selection via the intersection of identifiers: This implementation
involves record pointers or record identifiers. It uses indices with the record
pointers on those fields which are involved in the individual selection condition.
It scans each index for pointers to tuples satisfying the individual condition.
Therefore, the intersection of all the retrieved pointers is the set of pointers to
the tuples that satisfies the conjunctive condition. The algorithm uses these
pointers to fetch the actual records. However, in absence of indices on each
individual condition, it tests the retrieved records for the other remaining
conditions.
o Disjunctive selection by the union of identifiers: This algorithm scans those
entire indexes for pointers to tuples that satisfy the individual condition. But
only if access paths are available on all disjunctive selection conditions.
Therefore, the union of all fetched records provides pointers sets to all those
tuples which satisfy or prove the disjunctive condition. Further, it makes use of
pointers for fetching the actual records. Somehow, if the access path is not
present for anyone condition, we need to use a linear search to find those
tuples that satisfy the condition. Thus, it is good to use a linear search for
determining such tests.
Cost Estimation
Here, the overall cost of the algorithm is composed by adding the cost of individual
index scans and cost of fetching the records in the intersection of the retrieved lists of
pointers. We can minimize the cost by sorting the list of pointers and fetching the
sorted records. So, we found the following two points for cost estimation:
o We can fetch all selected records of the block using a single I/O operation
because each pointer in the block appears together.
o The disk-arm movement gets minimized as blocks are read in sorted order.
UNIT - II 42
[ADVANCED DATABASE AND MINING]
Linear Search, Equality ts + (br/2) It is the average case where it needs only one
on Key * tT record satisfying the condition. So as soon as it is
found, the scan terminates.
Primary B+-tree index, (hi +1) * Each I/O operation needs one seek and one block
Equality on Key (tr + ts) transfer to fetch the record by traversing the
height of the tree.
Primary B+-tree index, hi * (tT + It needs one seek for each level of the tree, and
Equality on a Nonkey ts) + b * tT one seek for the first block.
Secondary B+-tree (hi + 1) * Each I/O operation needs one seek and one block
index, Equality on Key (tr + ts) transfer to fetch the record by traversing the
height of the tree.
Secondary B+-tree (hi + n) * It requires one seek per record because each
index, Equality on (tr + ts) record may be on a different block.
Nonkey
Primary B+-tree index, hi * (tr + ts) It needs one seek for each level of the tree, and
Comparison + b * tT one seek for the first block.
Secondary B+-tree (hi + n) * It requires one seek per record because each
index, Comparison (tr + ts) record may be on a different block.
UNIT - II 43
[ADVANCED DATABASE AND MINING]
Procedural language
Projection operation
It displays the specific column of a table. It is denoted by pie (∏). It is a vertical subset
of the original relation. It eliminates duplicate tuples.
Syntax
The syntax is as follows −
∏regno(student)
Example
1 CSE A
2 ECE B
UNIT - II 44
[ADVANCED DATABASE AND MINING]
3 CIVIL B
4 IT A
To display regno column of student table, we can use the following command −
∏regno(student)
Output
RegNo
4
To display branch, section column of student table, use the following command −
∏branch,section(student)
The result is as follows −
Branch Section
CSE A
ECE B
CIVIL B
IT A
To display regno, section of ECE students, use the following command −
∏regno,section(σbranch=ECE(student))
Output
Regno Section
2 B
UNIT - II 45
[ADVANCED DATABASE AND MINING]
Security of databases refers to the array of controls, tools, and procedures designed
to ensure and safeguard confidentiality, integrity, and accessibility. This tutorial will
concentrate on confidentiality because it's a component that is most at risk in data
security breaches.
Security for databases must cover and safeguard the following aspects:
Security of databases is a complicated and challenging task that requires all aspects of
security practices and technologies. This is inherently at odds with the accessibility of
databases. The more usable and accessible the database is, the more susceptible we
are to threats from security. The more vulnerable it is to attacks and threats, the more
difficult it is to access and utilize.
UNIT - II 46
[ADVANCED DATABASE AND MINING]
Insider Dangers
An insider threat can be an attack on security from any three sources having an access
privilege to the database.
Insider dangers are among the most frequent sources of security breaches to
databases. They often occur as a consequence of the inability of employees to have
access to privileged user credentials.
Human Error
Hackers earn their money by identifying and exploiting vulnerabilities in software such
as databases management software. The major database software companies and
UNIT - II 47
[ADVANCED DATABASE AND MINING]
A specific threat to databases is the infusing of untrue SQL as well as other non-SQL
string attacks in queries for databases delivered by web-based apps and HTTP headers.
Companies that do not follow the safe coding practices for web applications and
conduct regular vulnerability tests are susceptible to attacks using these.
Buffer overflow happens when a program seeks to copy more data into the memory
block with a certain length than it can accommodate. The attackers may make use of
the extra data, which is stored in adjacent memory addresses, to establish a basis for
they can begin attacks.
Malware
Attacks on Backups
Companies that do not protect backup data using the same rigorous controls
employed to protect databases themselves are at risk of cyberattacks on backups.
o Data volumes are growing: Data capture, storage, and processing continue to
increase exponentially in almost all organizations. Any tools or methods must
be highly flexible to meet current as well as far-off needs.
o The infrastructure is sprawling: Network environments are becoming more
complicated, especially as companies shift their workloads into multiple clouds
UNIT - II 48
[ADVANCED DATABASE AND MINING]
As databases are almost always accessible via the network, any security risk to any
component or part of the infrastructure can threaten the database. Likewise, any
security attack that impacts a device or workstation could endanger the database.
Therefore, security for databases must go beyond the limits of the database.
o Security for physical security: If the database servers are on-premises or the
cloud data centre, they should be placed in a secure, controlled climate. (If our
server for database is located in a cloud-based data centre, the cloud provider
will handle the security on our behalf.)
o Access to the network and administrative restrictions: The practical minimum
number of users granted access to the database and their access rights should
be restricted to the minimum level required to fulfil their tasks. Additionally,
access to the network is limited to the minimum permissions needed.
o End security of the user account or device: Be aware of who has access to the
database and when and how data is used. Monitoring tools for data can notify
you of data-related activities that are uncommon or seem to be dangerous. Any
device that connects to the network hosting the database must be physically
secured (in the sole control of the appropriate person) and be subject to
security checks throughout the day.
o Security: ALL data--including data stored in databases, as well as credential
information should be secured using the highest-quality encryption when in
storage and while in transport. All encryption keys must be used in accordance
with the best practices guidelines.
UNIT - II 49
[ADVANCED DATABASE AND MINING]
o Security of databases using software: Always use the most current version of
our software to manage databases and apply any patches immediately after
they're released.
o Security for web server applications and websites: Any application or web
server that connects to the database could be a target and should be subjected
to periodic security testing and best practices management.
o Security of backups: All backups, images, or copies of the database should have
the identical (or equally rigorous) security procedures as the database itself.
o Auditing: Audits of security standards for databases should be conducted every
few months. Record all the logins on the server as well as the operating system.
Also, record any operations that are made on sensitive data, too.
UNIT - II 50
[ADVANCED DATABASE AND MINING]
ACCESS CONTROL
UNIT - II 51
[ADVANCED DATABASE AND MINING]
These models are built into the core or the kernel of the different operating systems
and possibly their supporting applications. Every operating system has a security
kernel that enforces a reference monitor concept, which differs depending on the type
of access control model embedded into the system. For every access attempt, before
a subject can communicate with an object, the security kernel reviews the rules of the
access control model to determine whether the request is allowed.
If you have used any platform such as Windows, Mac or Linux, you can easily
understand and appreciate this model. If you create a folder in any of these, you can
UNIT - II 52
[ADVANCED DATABASE AND MINING]
easily add/delete/modify the permissions which you want to give to different subjects.
Sounds confusing? Well, it isn’t. Let’s take an example to understand this.
I have created a folder named “SSCP Video Course”. Now since I’m the owner, it is my
discretion to assign various permissions for users. I can go to the”Security” Tab and
“Edit” permissions and define what users need to be given “Full control” or which users
can only be given “Read” Access.
A system that uses discretionary access control (DAC) enables the owner of the
resource to specify which subjects can access specific resources. This model is called
discretionary because the control of access is based on the discretion of the owner.
There is another term which is used quite often with reference to the models. It is the
Access Control List. An ACL for a file would list all the users and/or groups that are
authorized access to the file and the specific access granted to each.
While all seems good in the world of DAC, there are some issues with this model. While
this model offers the best flexibility amongst any of the model, it is also its weakest
point. For example, if a user opens an attachment that is infected with a virus, the
code can install itself in the background without the user being aware of this activity.
This code basically inherits all the rights and permissions that the user has and can
carry out all the activities a user can perform on the system. It can send copies of itself
out to all the contacts listed in the user’s e-mail client, install a back door, attack other
systems, delete files on the hard drive, and more. The user is actually giving rights to
the virus to carry out its dirty deeds, because the user has very powerful discretionary
rights and is considered the owner of many objects on the system. And the fact that
many users are assigned local administrator or root accounts means that once
malware is installed, it can do anything on a system.
Do not confuse this with Apple MAC, this model is not even remotely related to it. This
model is the complete opposite of the DAC model. In a mandatory access control
(MAC) model, users do not have the discretion of determining who can access objects
as in a DAC model. An operating system that is based on a MAC model greatly reduces
the number of rights, permissions, and functionality a user has for security purposes.
You would have surely seen movies where Ethan Hunt or Jason Bourne try to access
top secret or confidential files which they do not have access too. Well, the MAC model
uses security labels to help implement it. Security labels are attached to all objects;
thus, every file, directory, and the device has its own security label with its
classification information. Now Jason Bourne may have a security clearance of secret,
but in order to find his identity, his requests may have a security label with the
classification of top secret. In this case, he will be denied because his clearance is not
UNIT - II 53
[ADVANCED DATABASE AND MINING]
equivalent or does not dominate (is not equal to or higher than) the classification of
the object.
The problem with DAC was that the malware could inherit all permissions which the
user had and could install itself on the system. However, in the MAC systems, this isn’t
the case. Since users that work within a MAC system cannot install software, the
operating system does not allow any type of software, including malware, to be
installed while the user is logged in. But while MAC systems might seem an answer to
all our security prayers, they have very limited user functionality, require a lot of
administrative overhead, are very expensive, and are not user-friendly. DAC systems
are general-purpose computers, while MAC systems serve a very specific purpose.
If you work in a large organization, you can easily understand that the permissions if
awarded user wise, could become a herculean task. Why not adopt a simple way then?
The RBAC does exactly that. RBAC simplifies the process by allowing the security
administrators to provide the same level of access permissions to a particular group.
If you have 3 people who have joined your organization, one in the R&D, another in
the HR department and the third one in the sales role. As a security administrator,
using RBAC, all access permissions will be awarded to them in a jiffy. How? Just by
adding each one of them into their respective groups.
The moment the person is added to the HR group, the RBAC model, extends all
privileges in this group to the new person who has been added.
In a similar manner, if a person leaves the role, the security administrator just needs
to remove this person from the group and all privileges would be revoked
immediately.
This model is simply based on the premise that “If X Then Y”. Rules are pre-defined
and those are in order to provide access permissions. Forex- Firewalls include a set of
rules or filters within an ACL,
defined by an administrator. The firewall examines all the traffic going through it and
only allows traffic that meets one of the rules. Firewalls include a final rule (referred
to as the implicit deny rule) denying all other traffic. For example, the last rule might
UNIT - II 54
[ADVANCED DATABASE AND MINING]
be “deny all” to indicate the firewall should block all traffic in or out of the network
that wasn’t previously allowed by another rule.
If this sounds too complex, then consider the following statement. If the role id of the
person is XXX such as RND for research development, you can give him default access
to the top 3 basic folders. The system would follow this simple rule and access would
be provided only to such people.
While the rule-based model is commonly used in firewalls and routers to implement
access control permissions, a basic drawback here is that it is very complex and
unproductive to provide the further granularity of access. In simple terms, Rule-based
models enforce rules on all the users and if you start devising rules for granularity, the
system loses its ease of usefulness.
Well, this sounds quite simple to implement, isn’t it? If you configure this through the
Rule Based Access Control, it would become quite complex.
To solve such complex access issues, we have the Attribute Based Access Control
Model (ABAC). It is an advanced implementation of RBAC model. Attributes can be
almost any characteristic of users, the network, and
devices on the network. It is this attribute which is exploited and used to implement
the ABAC model.
How would ABAC solve the above problem? Well, you just need to enter the attributes
of the person and then configure the rule. The attributes which can be extracted from
the problem are R&D, Genetic Mutation, Super Senior Specialist, Office Network &
Timings. If a person has these attributes which can be added or removed at any point
in time, he can access the website and the firewall will not block it.
From an RBAC perspective, every attribute would have to be checked with a rule,
which would have consumed a lot of resources when access decisions need to be taken
in a jiffy. A user will not wait for an hour to know whether he can access or cannot
UNIT - II 55
[ADVANCED DATABASE AND MINING]
access a website. The system must evaluate the request basis the rules/attributes and
provide an answer immediately.
Hence, in a nutshell, the Rule-based access control applies to all users, but the ABAC
can be much more specific.
UNIT - II 56