Introduction
• Most Applications collect huge amount of
                                                       Data, [ Bank Accounts, College Admissions,
            File Structure And Hashing                 Scientific Experiments etc.]
                                                     • Files stored on computer are a good alternative
                                                       to store.
                                                     • Primary Storage vs. Secondary Storage.
                                                     • Less Time and Less Space vs. More Time and
                                                       More Space.
    File structure and data structure                    File structure and data structure
• A data structure could be present both in          • In most (binary) file structures, the basic
  RAM and DISK.                                        organizing elements are Pages and Physical
• Technically the file structures are more             Links. 
  standardized, especially if one wants to allow     • Page (computer memory) map to file system
  a third party app or a newer version of the
  same app to open the file.                           chunks that are (hopefully) guaranteed to be
                                                       stored contiguously and can be read and
• Data structures can change between app
  versions and there is no need for compatibility.     written in a single operation.
2015-11-2                                            2015-11-2
                    Data Hierarchy                                            File Attributes
• Every file contains data which can be               • Every file is stored in directory in computer
  organized in hierarchy for systematic                 system.
  organization.                                       • Each file is having list of attributes associated
     – Data Field                                       with it. They state how to use file to any
     – Record                                           software program looks up the directory.
     – File                                           • Just like hidden files are not displayed in DOS
     – Directory                                        when we apply DIR command.
2015-11-2                                             2015-11-2
                    File Attributes                                           File Attributes
• File name – string of characters vary by OS         • Attributes Flag – a single byte is used to
• File position – position where the next               specify 6 flags attached to a file
  read/write operation can be performed                           Attribute         Byte
• File structure – whether text file or binary file               Read-Only         0000 0001
                                                                  Hidden            0000 0010
• File access method – how to access records,                     System            0000 0100
  sequentially or randomly                                        Volume Label      0000 1000
                                                                  Directory         0001 0000
                                                                  Archive           0010 0000
2015-11-2                                             2015-11-2
                 File Attributes                                        File Attributes
• Read-only – can’t get deleted or modified. Get       • Directory – Extra bit to differentiate directory
  message ‘access denied’ if you do so.                  from file. In directory listing it is used to
• Hidden – Not displayed in directory listing.           specify sub directories. Theoretically by
• System – Important as used by system and               changing this bit, we can convert a file to
  must not be altered or removed. More than              directory but practically results in failure.
  Read-only.                                           • Archive – Backup program uses it by setting 1
• Volume label – Every disk volume is assigned           to 0 and 0 to 1 to have the latest copy.
  for identification, mainly at the time of
  formatting.
2015-11-2                                              2015-11-2
            Text files vs. Binary files                            Text files vs. Binary files
• Basically every file is just a series of bytes one   • A binary file is basically any file that is not
  after the other.                                       "line-oriented". Any file where besides the
                                                         actual written characters and newlines there
• In general every file is a binary file, but if the     are other symbols as well.
  data in it contains only text like letter,           • A Microsoft word file is a binary file as besides
  numbers and other symbols one would use in             the actual text, it also contains various
  writing, and if it consists of lines, then we          characters representing font size and color.
  consider it a text file.                             • Program written in the C programming
                                                         language is a text file, but after you compiled
                                                         it, the compiled version is binary.
2015-11-2                                              2015-11-2
                File Organization                                        File Organization
• File is a collection of related records so main         • Sequential Organization
  issue in file management is the way records             • Relative File Organization
  are organized inside the file.
                                                          • Indexed Sequential File Organization
• How to select file organization method:
     – Rapid access to one or more records
     – Ease of insert/update/delete one or more records
       w/o affecting speed of accessing records
     – Efficient storage of records
2015-11-2                                                 2015-11-2
            Sequential Organization                                   Sequential Organization
• In a sequential file, we place records in a block       • Suppose we only wish to find one record. We
  one after another until there is no room for              may have to search the entire file, reading
  another complete record in the block.                     one block at a time to find the record.
• If we wish to access all the records, one after
  another, this is a highly efficient file
  organization.
2015-11-2                                                 2015-11-2
            Relative File Organization                               Relative File Organization
• In contrast to SEQ files, records of a RELATIVE   • Address of ith record =
  file can be accessed by specifying the record       base_add + (i-1)*record_length
  sequence number/relative key and without          • Add of 5th record = 1000+(4)*20 = 1080
  needing to read all the previous records.         • Possible only if records are of equal lenths.
                                                                Relative Record Number   Records in Memory
• Record with sequence number 16 is located                     0                        Record 0
                                                                1                        Record 1
  just after the 15th record.
                                                                2                        Free
• Range from 0 to n-1.                                          3                        Free
                                                                ……                       …..
                                                                98                       Free
                                                                99                       Record 99
2015-11-2                                           2015-11-2
            Relative File Organization                  Indexed Sequential File Organization
• This can be used as Relative as well as           •   Provides fast data retrieval
  sequential                                        •   Mostly records are of fixed length
• Random access, ease of processing, low            •   Index table stores address of each record
  overhead, add/delete based on relative key        •   Sequential and random access is possible
• Only disk devices uses this, limited to fixed     •   Index table is read sequentially to find the
  length records, relative record number must           address of the desired record
  be known in advance to access any record
2015-11-2                                           2015-11-2
   Indexed Sequential File Organization                                        Indexing
• Advantage is that indexes are small and             • Index to file is like a catalogue to library
  searched quickly                                    • Indexed sequential file organization is efficient
• Disadvantage is extra space for index                 to use but in practical where a file may have
       Record Number    Address of Record
                                                        millions of records, method fails
       1                1234                Record    • Based on access time, access type, insertion
       2                NULL
                                                        time, deletion time, we have two categories,
       ….               ….
       99               4567                Record         – Ordered Indices
                                                           – Hash Indices
2015-11-2                                             2015-11-2
                       Ordered Indices                                Ordered Indices
• Indexes are used to provide fast and random         • Secondary Index, while primary index
  access to records                                     provides sequential order only. Here the
• Primary Index – Index whose search key                search key which provides random access or
  specifies the sequential order of file.               different from sequential order will work as
• Example, STUDENT file has sequential order            secondary index.
  starting from roll no. 1 to 70. So roll number is   • Example, in STUDENT file use student name as
  primary index. Call roll no 20 and you get            index which never provides sequential order,
  record.                                               as they are sorted roll number wise.
2015-11-2                                             2015-11-2
                    Hash Indices                                         Why Hashing ?
• Based on Hashing Techniques                              • Internet has grown to millions of users
                                                             generating terabytes of content every day.
                                                           • With this kind of growth, it is impossible to
                                                             find anything in the internet, unless we
                                                             develop new data structures and algorithms
                                                             for storing and accessing data.
2015-11-2                                                  2015-11-2
Why not traditional data structures?                                     Why Hashing ?
• Traditional data structures like Arrays and              • Therefore we discuss a new technique called
  Linked Lists?                                              hashing that allows us to update and retrieve
• Amount of time required to look up an                      any entry in constant time O(1).
  element in the array is either O(log n) or O( n).        • The constant time or O(1) performance means,
     – If array is sorted then binary search or searched     the amount of time to perform the operation
       linearly.                                             does not depend on data size 'n'.
• Either case may not be desirable if we need to
  process a very large data set.
2015-11-2                                                  2015-11-2
                 Hash Function                                              Hash Function
• Pair is of the form (key, value), where for               • The concept of a hash table is a generalized
  given a key, we can find a value using some                 idea of an array where key does not have to
  kind of a “function” that maps keys to values.              be an integer.
• The key for a given object can be calculated              • We can have a name as a key, or for that
                                                              matter any object as the key.
  using a function called a hash function.
                                                            • The trick is to find a hash function to compute
• For example, given an array A, if i is the key,             an index so that an object can be stored at a
  then we can find the value by simply looking                specific location in a table such that it can
  up A[i].                                                    easily be found.
2015-11-2                                                   2015-11-2
                     Example                                                     Example
• Suppose we have a set of strings {“abc”, “def”, “ghi”}    • If we assume that we have a table of size 5 to store
  that we’d like to store in a table.                         these strings,
• Our objective here is to find or update them quickly      • we can compute the location of the string by taking
  from a table, actually in O(1).                             the sum mod 5. So we will then store “abc” in 6 mod
• We are not concerned about ordering them or                 5 = 1, “def” in 15 mod 5 = 0, and “ghi” in 24 mod 5 =
  maintaining any order at all.                               4.
• Suppose we assign “a” = 1, “b”=2, … etc to all            • So locations 1, 0 and 4 as follows.
  alphabetical characters, then simply compute a
  number for each of the strings.
• “abc” = 1 + 2 + 3=6, “def” = 4 + 5 + 6=15 , “ghi” = 7 +
  8 + 9=24
2015-11-2                                                   2015-11-2
                  Example                                     Problem with Hashing
• We can immediately compute the location         • The method discussed seems too good as we begin
  using a simple hash function, which is sum of     to think more about the hash function.
  the characters mod Table size.                  • First of all, the hash function we used, that is the
• Using this hash value, we can search for the      sum of the letters, is a bad one.
  string. This seems to be great way to store a   • In case we have permutations of the same letters,
  Dictionary.                                       "abc", "bac" etc in the set, we will end up with the
• Therefore the idea of hashing seems to be a       same value for the sum and hence the key.
  great way to store pairs of (key, value) in a   • In this case, the strings would hash into the same
  table.                                            location, creating what we call a "collision".
2015-11-2                                         2015-11-2
            Problem with Hashing                                     Question 1
• Question 1: How do we pick a good hash          • Picking a “good” hash function is key to
  function?                                         successfully implementing a hash table.
                                                  • What we mean by “good” is that the function
• Question 2: How do we deal with collisions?       must be easy to compute and avoid collisions
                                                    as much as possible.
2015-11-2                                         2015-11-2
                     Question 1                                              Division Method
• Properties of Good Hash Function                          • This method divides Key K by Prime number
     – Low Cost – Binary search find value with log n, so     M and then uses the reminder thus obtained.
       hash function must cost less than this
     – Determinism – same hash value for same input
                                                            • h(K) = K mod M
     – Uniformity – uniform hash values minimize the
       number of collisions
                                                            • Example:
                                                                 – h(12345) = 12345 % 95 = 90
2015-11-2                                                   2015-11-2
              Midsquare Method                                               Folding Method
• Square the value of the Key                               • Divide the Key value into a number of parts.
• Extract the middle r bits of the result                        – K into k1, k2, k3, .....kn. where each part has same
                                                                   number of digits except the last one.
                                                            • Add the individual parts.
• Example:                                                       – obtain k1+k2+k3+....+kn. the hash value is
     – if K = 12345                                                obtained by ignoring the last carry.
     – Square(K) = 152399025                                • Example:
     – middle 3 bits: 399                                        – K = 12345: k1=12, k2=34,k3=5
     – middle 2 bits: 39                                         – k1+k2+k3 = 51
2015-11-2                                                   2015-11-2
            Multiplication Method                                  Algebraic Method
•   Choose a constant A such that 0 < A < 1            • Suppose we need to store a dictionary in a
•   Multiply Key K by A
•   Extract the fractional part of KA
                                                         hash table. A dictionary is a set of Strings and
•   Multiply the result by m and take floor              we can define a hash function as follows.
                                                         Assume that S is a string of length n and
• h(K) = floor [m (KA mod 1)]
• Example:
   – h(12345) = floor[ 100 (12345*0.357840 mod 1)]
               = floor[ 100 (4417.5348 mod 1) ]
               = floor[ 100 (0.5348) ]
               = floor[ 53.48 ]                        • where p is a prime number. Obviously, each
               = 53                                      string will lead to a unique number
2015-11-2                                              2015-11-2
                        Collisions                                 Collision Resolution
• It is difficult to find a “perfect” hash function,   • If two keys hash to the same index, the
  that is a function that has no collisions.             corresponding records cannot be stored in the
                                                         same location.
• One problem with hashing is that it is possible      • So, if it's already occupied, we must find
  that two strings can hash into the same                another location to store the new record, and
  location. This is called a collision.                  do it so that we can find it when we look it up
                                                         later on.
                                                       • There are number of collision resolution
                                                         techniques, but the most popular
                                                         are chaining and open addressing.
2015-11-2                                              2015-11-2
              Open Addressing                                       Open Addressing
• Open addressing hash tables can store the            • Linear probing, looking for the next available
  records directly within the array.                     location i+1, i+2, etc. from the hashed value i
• A hash collision is resolved by probing, or          • Quadratic probing, same as linear probing,
  searching through alternate locations in the           except we look for available positions i+1 , i +
  array (the probe sequence) until either the            4, i + 9, etc from the hashed value i
  target record is found, or an unused array slot
  is found, which indicates that there is no such
  key in the table.
2015-11-2                                              2015-11-2
                     Chaining
• Each slot in the array references a linked list of
  inserted records that collide to the same slot.
• Insertion requires finding the correct slot, and
  appending to either end of the list in that slot;                    Thank You...
  deletion requires searching the list and removal.
2015-11-2                                              2015-11-2