File Structures and
UNIT 12 FILE STRUCTURES Advanced Data Structures
Structure
12.0 Introduction
12.1 Objectives
12.2 Basic concepts of file
12.3 File Organization
12.3.1 Key drivers for file organization
12.4 Sequential files
12.4.1 Operations on sequential files
12.4.2 Types of sequential files
12.4.3 Advantages and disadvantages of sequential files
12.4.4 Use cases for sequential files
12.5 Direct file organization
12.5.1 Hash function and collision resolution
12.5.2 Advantages and disadvantages of direct file
organization
12.6 Indexed Sequential File Organization
12.6.1 Advantages and disadvantages of indexed sequential
files
12.7 Summary
12.8 Solutions/Answers
12.9 Further Readings
12.0 INTRODUCTION
The data is stored and organized through file structure. Software applications and
platforms store the data in files. Operating systems manage critical configuration
through file and even databases also store the data in files such as transaction logs.
Files essentially serves as the key building blocks for data organization. We can
enforce governance such as data integrity checks, security rules, access controls,
sharing controls for files.
In this unit we discuss various aspects of file such as file organization, file types and
related concepts.
12.1 OBJECTIVES
After going through this unit, you should be able to
understand key concepts of files,
understand the file organization concepts,
understand sequential files and direct file organization,
Understand indexed file organization and indexed sequential files
12.2 BASIC CONCEPTS OF FILE
A file is a collection of data records. The key operations of the file are as follows:
File creation when the file gets created initially.
File update when we update the file with new data records or modify
existing data records.
1
File Structures File deletion when we delete the file
File merging that combines the content of multiple files into a single file.
File searching that involves searching for a specific data record in a file.
A data record in a file encapsulates the data of a single entity. For instance, an
employee data record captures details of a single employee such as employee id,
employee name, employee address, employee phone number and employee date of
join.
Each data record is uniquely identified by a key. For instance, in the employee data
record, employee id is the key.
The table 1 provides the employee data records in an organization:
Employee Employee Employee Department Date of Join
Number Name Address
45278 Employee 1 Kanpur Sales 01-Jan-2020
167352 Employee 2 Mysore Development 01-Mar-2020
12.3 FILE ORGANIZATION
File organization defines the way the data record is stored and retrieved from the files.
File organization provides the data record relationship specifying how the data records
are mapped to the disk blocks. We can either store fixed length record in the files or
use flexible file structure to support data records of varying length.
There are multiple ways of file organization such as sequential, direct, indexed
sequential and others. We can select the file organization type based on the use case,
information retrieval performance, ease of information management and others.
Given below are the most commonly used file organizations:
Sequential files – The data records are stored in a pre-defined order. For instance,
the data records are sequenced based on the key values. In this file organization
type we need to sequentially access the record.
Relative files – The data record is stored at a fixed location that can be accessed
using a static key value. As we can access the data record using a fixed value, we
can retrieve or insert the data record randomly in this file organization type.
Direct files – Similar to the relative file that supports non-integer key value.
Indexed sequential files – In this file organization we add an index to provide the
random access to the sequential files.
Indexed files – The data records are indexed to improve the performance of the
data retrieval.
Figure 1 provides various file organizations that are most commonly used.
2
File Structures and
Advanced Data Structures
Figure 1 Types of File organizations
12.3.1 Key drivers for file organization
Given below are the main drivers for the file organization:
File organization helps in optimal design of file records so that they can be
retrieved as quickly as possible.
Improve the performance of file operations such as insert, update and delete.
Ensure the data consistency and avoid duplicates during file operations.
Provide optimal storage for managing the files.
12.4 SEQUENTIAL FILES
We write the data records in a specific sequence in sequential files. The data is also
accessed in the order it is written to the physical device. We usually order the data
records based on the key value. In Sequential files the reading happens from the
beginning of the file and write happens at the end of the file.
The legacy storage systems such as tape storage use sequential files.
12.4.1 Operations on sequential files
We discuss the main operations on the sequential files:
Insert operation
As the sequential files are organized based on their keys, when we insert we should
find the exact position of the new key value and insert the key over there.
If there are multiple insertions, we can batch them to optimize the performance
optimization. We can collect all the new inserts into transaction log and then sort the
new keys in order in the transaction log. We can then merge the entries in the
transaction log into the main file in batch mode.
Delete Operation
In the delete operation, we need to reclaim the space from the deleted record. Here as
well we can use the concept of transaction log for performance optimization. We can
add the delete marker for the key position in the transaction log and later we can
merge the changes to the main file in batch mode. During merge operation, we drop
the data records in the main file based on the marker.
Update Operation
3
File Structures For performing the update operation, we need to add the delete marker for the
corresponding key and then insert the new data record for the corresponding key in
the transaction log. Like always we merge the transaction log to the main file
Retrieval Operation
We read the data record based on its key. For stale data reads, we can directly read the
data record for the key from the main file. To maintain read consistency, we need to
check If there are any pending updates in the transaction log and if so we need to
merge the changes (deletion/insertion).
12.4.2 Types of sequential files
There are mainly two types of sequential files – Pile files and sorted files.
Pile Files
In Pile files, the data records are inserted in the order of their arrival. For instance,
let’s consider the below given incoming Data records –
DR1, DR3, DR5, DR2
The data records are inserted as given in Figure 2:
Figure 2 Pile File
For searching the records, the file is searched from the beginning till the desired
record is found. For deleting a record, we search for the record and add a deletion
marker for the found record.
For inserting a new record, it will be done at the end of the file. The new record DR4
is to be inserted as shown in Figure 3
Figure 3 Pile File Insertion
Sort Files
In case of sort files, the data records are sorted based on the key and are inserted into
the file.
For instance, let’s consider the below given incoming Data records –
DR1, DR3, DR5, DR2
The data records are inserted as given in Figure 4:
Figure 4 Sort File
4
While inserting a new record the key of the new record is sorted along with the File Structures and
Advanced Data Structures
existing records and then inserted. For instance when we try to insert the data record
DR4, the sorted insert is depicted in Figure 5
Figure 5 Sort file insert
12.4.3 Advantages and disadvantages of sequential files
In this section we discuss the main advantages and disadvantages of the sequential
files.
Advantages
The main advantages of sequential files are as follows:
1. Simple and easy to implement
2. Provides best usage of the storage space.
3. Can be used for data archival needs.
4. The underlying storage systems provide low-cost storage option.
5. Can be stored in inexpensive device like tape device.
Disadvantages
The main disadvantages of the sequential files are as follows:
1. The update operation is costlier and time consuming.
2. The access speed is slow due to the sequential access.
3. It is not possible to directly access a given data record key
4. All records should have uniform structure.
5. The insert time of sorted file is high.
6. Overall processing is slow.
12.4.4 Use cases for sequential files
We generally use sequential files for long-term archival of data and for batch
processing operations. We also use sequential access for storing large amount of data
that tolerates slow retrieval time.
☞ Check Your Progress – 1
1. A ______ in a file encapsulates the data of a single entity
2. Each data record is uniquely identified by a ____.
3. In _______ files, the data records are inserted in the order of their arrival.
4. For optimizing the performance, we get all the updates into a ________ and
later merge it into main file.
5. The two types of sequential files are ______ and __________.
6. Define a data record for a typical university’s student enrolment system.
5
File Structures
12.5 DIRECT FILE ORGANIZATION
In the direct file organization, we directly access the file by its key. To enable this
direct access, we need to map the key to the address where the data record is stored.
Direct access is also known as random access. The file records are stored in direct
access storage device (DASD) like hard disk, CD, magnetic disk. We randomly place
the record throughout the hard disk.
We have depicted the direct file organization in Figure 6
Figure 6 Direct File Organization
We use hash function to convert the key into the address. Naturally direct file access
is faster compared to sequential files.
In direct file organization, the records are stored at a known addressed as depicted in
Figure 6.
12.5.1 Hash Function and Collision Resolution
The most popular hash function is the modulo based function that computes the
remainder value as the hash value based on the key. The hash value is then used to
store the data record.
When there are huge set of keys, multiple keys will end up with same hash value
leading to collision. In such cases, we use the collision resolution technique. The
separate chaining technique chains the values to the same slot and open addressing
technique probes for next available slot for placing the value.
12.5.2 Advantages and disadvantages of Direct file organization
We shall look at the main advantages and disadvantages of the direct file organization.
Advantages
Given below are the main advantages of direct file organization:
1. As we can directly access the data, the retrieval is fast that helps in transaction
management systems such as relational database systems.
2. The insert, update and delete operations are faster.
3. The direct file organization is efficient in storage and retrieval of large data.
4. Key-based search is faster.
5. Can be used for real-time transactions that needs optimal performance.
6. Concurrent processing is possible
6
Disadvantages File Structures and
Advanced Data Structures
Given below are the main disadvantages of direct file organization:
1. The required storage technology is costlier.
2. The usage of storage space is sub optimal.
3. Insert, delete and update needs update of the index table.
12.6 INDEXED SEQUENTIAL FILE
ORGANIZATION
Indexed sequential file organization supports direct access of keys and also sequential
access by keys.
We use the indexed sequential file organization for hierarchical data organization. For
instance, lets consider a use case for retrieving the data record for the states. We use
the indexed/direct file for getting the country. The country data record points to the
sequential file storing the records of its constituent states. Indexed sequential files can
be stored only on random access device such as magnetic disk.
To implement the indexed sequential file, we decouple the index and data files. The
index file is structured hierarchically in tree structure whereas the data file stores the
information sequentially.
We use two types of indexes static index and dynamic index.
We have depicted the depicted the indexed sequential file in Figure 7.
Figure 7 Indexed sequential file
As depicted in Figure 7, the indexed sequential file is depicted as a two-level
hierarchy. The first level holds the index to each of the three records in the data file.
The second level holds the sequential set of records.
The first level is accessed directly where we get the keys from the index file. In the
second level we store the data record sequentially. As the data records are stored
sequentially, the key in the first level just points to the first record in the second level
from where it can be accessed sequentially.
7
File Structures When we have to insert the record, we update the data file based on the sequence and
update the index accordingly. Static index and dynamic index are two main types of
indexes. In static indexes the update to the records in the data file does not change the
index structure whereas in the dynamic index, the updates to the data file changes the
index structure.
12.6.1 Advantages and disadvantages of Indexed sequential file organization
We shall look at the main advantages and disadvantages of the indexed sequential file
organization
Advantages
Given below are the main advantages of indexed sequential file organization:
As the records are directly accessed, the access speed is high
Record insert is very fast.
Disadvantages
Given below are the main disadvantages of indexed sequential file organization:
Usage of storage space is sub optimal
The implementation is costly due to the required of costly hardware
Extra storage for index file is required.
☞ Check Your Progress – 2
1. In the direct file organization, we directly access the file by its ___
2. The function that converts the key to address is called _____
3. The first level in the indexed sequential acess is done _______ and second
level is accessed _________
4. The data records in direct file organization are stored in ______storage device
5. _____ file organization for managing hierarchical data
12.7 SUMMARY
In this unit we mainly discussed about file organization and various types of file
organization. A file is a collection of data records. The key operations of the file are
File creation, file update, file deletion, file merging, file searching. File organization
defines the way the data record is stored and retrieved from the files. We write the
data records in a specific sequence in sequential files. There are mainly two types of
sequential files - Pile Files and sort files. In Pile files, the data records are inserted in
the order of their arrival. In case of sort files, the data records are sorted based on the
key and are inserted into the file. In the direct file organization, we directly access the
file by its key. Indexed sequential file organization supports direct access of keys and
also sequential access by keys.
12.8 SOLUTIONS/ANSWERS
☞ Check Your Progress – 1
1. Data record.
8
File Structures and
2. Key
Advanced Data Structures
3. Pile files
4. Transaction log
5. Pile files and sort files
6. A sample student enrollment structure is as follows:
Student Number, Student Name, Date of Enrollment, Address, Programme
☞ Check Your Progress – 2
1. Key
2. Hash function
3. Directly and sequentially
4. Direct access
5. Indexed sequential
12.9 FURTHER READINGS
References
https://en.wikipedia.org/wiki/File_system
https://en.wikipedia.org/wiki/Sequential_access
https://en.wikipedia.org/wiki/Indexed_file
https://en.wikipedia.org/wiki/ISAM