Unit 5:
Storage & Data Security
• Storage structure, file organization, Recovery and
atomicity, Performance measures of discs, RAID
levels, Indices, B+ Tree, Hashing, Bitmap indices,
Query optimizations, Database Security, Data
mining models and techniques, Distributed
Databases, GIS.
Storage System
• For storing the data, there are different types
of storage options available. These storage
types differ from one another as per the speed
and accessibility. There are the following types
of storage devices used for storing the data:
Types of Data Storage
• Primary Storage- Main Memory, Cache
• Secondary Storage- Flash Memory, Magnetic
Disk Storage
• Tertiary Storage- Optical Storage, Tape Storage
System Issues: How to Build a DBMS
Query Optimization
and Execution
Discussed so far Relational Operators
Files and Access Methods
New topic
Buffer Management
Disk Space Management
DB
Data on External Storage
• Disks: Can retrieve random page at fixed cost
– But reading several consecutive pages is much cheaper than reading
them in random order
• Tapes: Can read pages only in sequence
– Cheaper than disks; used for archival storage
• File organization: Method of arranging a file of records on
external storage.
– Record id (rid) is sufficient to physically locate record
– Indexes are data structures that allow us to find the record ids of
records with given values in index search key fields
• Architecture: Buffer manager stages pages from external
storage to main memory buffer pool. File and index layers
make calls to the buffer manager.
Alternative File Organizations
Many alternatives exist, each ideal for some
situations, and not so good in others:
– Heap (random order) files: Suitable when typical
access is a file scan retrieving all records.
– Sorted Files: Best if records must be retrieved in
some order, or only a `range’ of records is needed.
– Indexes: Data structures to organize records via
trees or hashing.
• Like sorted files, they speed up searches for a subset of
records, based on values in certain (“search key”) fields
• Updates are much faster than in sorted files.
RAID
• Redundant Array of Independent Disk
(RAID) combines multiple small,
inexpensive disk drives into an array of disk
drives which yields performance more than
that of a Single Large Expensive Drive
(SLED). RAID is also called Redundant
Array of Inexpensive Disks.
Level of RAID
1.RAID-0 (Stripping)
2.RAID-1 (Mirroring)
3.RAID-2 (Bit-Level Stripping with Dedicated Parity)
4.RAID-3 (Byte-Level Stripping with Dedicated Parity)
5.RAID-4 (Block-Level Stripping with Dedicated Parity)
6.RAID-5 (Block-Level Stripping with Distributed Parity)
7.RAID-6 (Block-Level Stripping with two Parity Bits)
Indexes
• An index on a file speeds up selections on the
search key fields for the index.
– Any subset of the fields of a relation can be the
search key for an index on the relation (e.g., age or
colour).
– Search key is not the same as key (minimal set of
fields that uniquely identify a record in a relation).
• An index contains a collection of data entries,
and supports efficient retrieval of all data entries
k* with a given key value k.
• Example of Index: Essentials of Game Theory
Example of Alternative 2
Loca- colour
tion
6 data entries,
1 Red
sorted by colour
2 Red
3 Red
4 blue
5 blue
6 blue
Index Classification
• Primary vs. secondary: If search key contains primary key,
then called primary index.
– Unique index: Search key uniquely identifies record.
• Clustered vs. unclustered: If order of data records is the same
as, or `close to’, order of data entries, then called clustered
index.
– Alternative 1 implies clustered; in practice, clustered also implies
Alternative 1 (since sorted files are rare).
– A file can be clustered on at most one search key.
– Cost of retrieving data records through index varies greatly based on
whether index is clustered or not!
Clustered vs. Unclustered Index
• Suppose that Alternative (2) is used for data entries, and that the
data records are stored in a Heap file.
– To build clustered index, first sort the Heap file (with some free space on
each page for future inserts).
– Overflow pages may be needed for inserts. (Thus, order of data recs is `close
to’, but not identical to, the sort order.)
Index entries
CLUSTERED direct search for UNCLUSTERED
data entries
Data entries Data entries
(Index File)
(Data file)
Data Records Data Records
Hash-Based Indexes
• Good for equality selections.
• Index is a collection of buckets. Bucket = primary
page plus zero or more overflow pages.
• Hashing function h: h(r) = bucket in which record r
belongs. h looks at the search key fields of r.
• If Alternative (1) is used, the buckets contain
the data records; otherwise, they contain
<key, rid> or <key, rid-list> pairs.
B+ Tree Indexes
Non-leaf
Pages
Leaf
Pages
Leaf pages contain data entries, and are chained (prev & next)
Non-leaf pages contain index entries; they direct searches:
index entry
P0 K 1 P1 K 2 P 2 K m Pm
Example B+ Tree
Root
17
Entries <= 17 Entries > 17
5 13 27 30
2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*
• Find 28*? 29*? All > 17* and < 30*
• Insert/delete: Find data entry in leaf, then
change it. Need to adjust parent sometimes.
– And change sometimes bubbles up the tree
Comparing File Organizations
• Heap files (random order; insert at eof)
• Sorted files, sorted on <age, sal>
• Clustered B+ tree file, Alternative (1), search key
<age, sal>
• Heap file with unclustered B + tree index on search
key <age, sal>
• Heap file with unclustered hash index on search key
<age, sal>
Hashing
• In this technique, data is stored at the data
blocks whose address is generated by using
the hashing function. The memory location
where these records are stored is known as
data bucket or data blocks.
Database Security
• Security of databases refers to the array of
controls, tools, and procedures designed to
ensure and safeguard confidentiality, integrity,
and accessibility. This tutorial will concentrate
on confidentiality because it's a component
that is most at risk in data security breaches.
Security for databases must cover and
safeguard the following aspects:
• The database containing data.
• Database management systems (DBMS)
• Any applications that are associated with it.
• Physical database servers or the database
server virtual, and the hardware that runs it.
• The infrastructure for computing or network
that is used to connect to the database.
Data Mining
• Data mining refers to extracting or mining
knowledge from large amounts of data. In
other words, Data mining is the science, art,
and technology of discovering large and
complex bodies of data in order to discover
useful patterns.
Data Mining Techniques
Distributed Database System
• A distributed database is basically a database
that is not limited to one system, it is spread
over different sites, i.e, on multiple computers
or over a network of computers.
• A distributed database system is located on
various sites that don’t share physical
components.
Types:
1. Homogeneous Database:
In a homogeneous database, all different sites store database identically. The operating system,
database management system, and the data structures used – all are the same at all sites. Hence,
they’re easy to manage.
2. Heterogeneous Database:
In a heterogeneous distributed database, different sites can use different schema and software that
can lead to problems in query processing and transactions. Also, a particular site might be
completely unaware of the other sites. Different computers may use a different operating system,
different database application. They may even use different data models for the database. Hence,
translations are required for different sites to communicate.
GIS: Geographic Information System
• GIS stands for Geographic Information System.
• It is a system designed to collect, analyze, manipulate,
manage, and display all types of geographical and
spatial data and information.
• It allows you to perform spatial analysis and manage
large data and display the information in maps or
graphical form for analysis and presentation.
• These benefits make GIS a valuable tool to visualize
spatial data or to build decision support systems for an
organization.
Thank you!!