OS File Systems Overview
OS File Systems Overview
Raju Pandey
Department of Computer Sciences
  University of California, Davis
           Spring 2011
                                                        Overall view
                                                              Application
                                                               Program
                            mount()
                        write()                                                       WriteFile()
                                                                         CreateFile()
                      close() open()                                      CloseHandle() ReadFile()
                    lseek()     read()
                                                                                SetFilePointer()
Memory Mgr
                                                                                                           Memory Mgr
                                                Process Mgr
                                                                                                           Process Mgr
                                   Device Mgr
                                                                                              Device Mgr
                        File Mgr
                                                                                   File Mgr
                 UNIX                                                                                      Windows
Hardware
ECS 150 (Operating Systems)                                   File System (1), 2                                   Spring 2011 UC Davis
                                        Files
• A file is a collection of data with some properties
       contents, size, owner, last read/write time, protection …
• Files may also have types
       understood by file system
           o device, directory, symbolic link
       understood by other parts of OS or by runtime libraries
           o executable, dll, source code, object code, text file, …
ECS 150 (Operating Systems)          File System (1), 3       Source: Gribble, Lazowska,
                                                              Levy, Zahorjan
                               File Management
                                                   Applications
• The file manager
                                                                          Records
  administers the
  collection by:
    Storing the information
     on a device                                         Structured Record Files
  programmer?
 ECS 150 (Operating Systems)       File System (1), 4              Spring 2011 UC Davis
                               Interface Layers
                                             Std.
                        App.                                                               Disk
                                           Runtime                     OS
                        Code
                                            Library
                               Procedure                                   Device-type
                                 Calls                                 Dependent Commands
Whatever… Syscalls
ECS 150 (Operating Systems)            File System (1), 5               Source: Gribble, Lazowska,
                                                                        Levy, Zahorjan
                               Exported Abstractions
                                         Std.
                       App.                                                                 Disk
                                       Runtime                     OS
                       Code
                                        Library
/ /
ECS 150 (Operating Systems)            File System (1), 6                 Source: Gribble, Lazowska,
                                                                          Levy, Zahorjan
             Primary Roles of the OS (file system)
ECS 150 (Operating Systems)              File System (1), 7           Source: Gribble, Lazowska,
                                                                      Levy, Zahorjan
                              File access methods
• Some file systems provide different access methods that
  specify ways the application will access data
       sequential access
           o read bytes one at a time, in order
       direct access
           o random access given a block/byte #
       record access
           o file is array of fixed- or variable-sized records
       indexed access
           o FS contains an index to a particular field of each record in a file
           o apps can find a file based on value in that record (similar to DB)
• Why do we care about distinguishing sequential from direct
  access?
       what might the FS do differently in these cases?
ECS 150 (Operating Systems)           File System (1), 8         Source: Gribble, Lazowska,
                                                                 Levy, Zahorjan
                         Byte Stream File Interface
                  fileID = open(fileName)
                  close(fileID)
                  read(fileID, buffer, length)
                  write(fileID, buffer, length)
                  seek(fileID, filePosition)
ECS 150 (Operating Systems)       File System (1), 9   Spring 2011 UC Davis
                               Low Level Files
   fid = open(“fileName”,…);
   …
   read(fid, buf, buflen);                   b0 b1 b2     ...   bi   ...
   …
   close(fid);
ECS 150 (Operating Systems)        File System (1), 10           Spring 2011 UC Davis
                              Structured Files
Records
Record-Block Translation
ECS 150 (Operating Systems)         File System (1), 11      Spring 2011 UC Davis
                  Record-Oriented Sequential Files
Logical Record
                   fileID = open(fileName)
                   close(fileID)
                   getRecord(fileID, record)
                   putRecord(fileID, record)
                   seek(fileID, position)
ECS 150 (Operating Systems)   File System (1), 12           Spring 2011 UC Davis
                  Record-Oriented Sequential Files
Logical Record
ECS 150 (Operating Systems)         File System (1), 13           Spring 2011 UC Davis
                              Indexed Sequential File
                   fileID = open(fileName)
                   close(fileID)
                   getRecord(fileID, index)
                   index = putRecord(fileID, record)
                   deleteRecord(fileID, index)
ECS 150 (Operating Systems)          File System (1), 14   Spring 2011 UC Davis
                     Indexed Sequential File (cont)
           Application structure
             Account # Index                                index = i
             012345
             123456       i
             294376       k
                                                                           index = k
             ...
             529366       j
             ...
             965987                                      index = j
ECS 150 (Operating Systems)        File System (1), 15           Spring 2011 UC Davis
                                  Directories
• Directories provide:
       a way for users to organize their files
       a convenient file name space for both users and FS’s
• Most file systems support multi-level directories
       naming hierarchies (/, /usr, /usr/local, /usr/local/bin, …)
• Most file systems support the notion of current directory
       absolute names: fully-qualified starting from root of FS
           bash$ cd /usr/local
ECS 150 (Operating Systems)         File System (1), 16           Source: Gribble, Lazowska,
                                                                  Levy, Zahorjan
                              Single-Level Directory
Naming problem
Grouping problem
       Path name
       Can have the same file name for different user
       Efficient searching
       No grouping capability
• Grouping Capability
ECS 150 (Operating Systems)         File System (1), 27      Source: Gribble, Lazowska,
                                                             Levy, Zahorjan
                              Path name translation
• Let’s say you want to open “/one/two/three”
      fd = open(“/one/two/three”, O_RDWR);
• What goes on inside the file system?
         open directory “/” (well known, can always find)
         search the directory for “one”, get location of “one”
         open directory “one”, search for “two”, get location of “two”
         open directory “two”, search for “three”, get loc. of “three”
         open file “three”
         (of course, permissions are checked at each step)
• FS spends lots of time walking down directory paths
       this is why open is separate from read/write (session state)
       OS will cache prefix lookups to enhance performance
           o /a/b, /a/bb, /a/bbb all share the “/a” prefix
ECS 150 (Operating Systems)         File System (1), 28      Source: Gribble, Lazowska,
                                                             Levy, Zahorjan
                              File protection
• FS must implement some kind of protection system
       to control who can access a file (user)
       to control how they can access it (e.g., read, write, or exec)
• More generally:
       generalize files to objects (the “what”)
       generalize users to principals (the “who”, user or program)
       generalize read/write to actions (the “how”, or operations)
• A protection system dictates whether a given action
  performed by a given principal on a given object should be
  allowed
       e.g., you can read or write your files, but others cannot
       e.g., your can read /etc/motd but you cannot write to it
ECS 150 (Operating Systems)      File System (1), 29   Source: Gribble, Lazowska,
                                                       Levy, Zahorjan
                 Model for representing protection
objects
      principals gribble      r                  rw             r
                     guest                                      r                  capability
                                   ACL
ECS 150 (Operating Systems)              File System (1), 30         Source: Gribble, Lazowska,
                                                                     Levy, Zahorjan
                              ACLs vs. Capabilities
• Capabilities are easy to transfer
       they are like keys: can hand them off
       they make sharing easy
• ACLs are easier to manage
       object-centric, easy to grant and revoke
           o to revoke capability, need to keep track of principals that have it
           o hard to do, given that principals can hand off capabilities
• ACLs grow large when object is heavily shared
       can simplify by using “groups”
           o put users in groups, put groups in ACLs
           o you are all in the “VMware powerusers” group on Win2K
       additional benefit
           o change group membership, affects ALL objects that have this group
             in its ACL
ECS 150 (Operating Systems)         File System (1), 31       Source: Gribble, Lazowska,
                                                              Levy, Zahorjan
                     Implementing Low Level Files
ECS 150 (Operating Systems)      File System (1), 32   Spring 2011 UC Davis
                                         Disk Organization
Block 0
b0 b1 b2 b3 … … bn-1
...
ECS 150 (Operating Systems)             File System (1), 34    Spring 2011 UC Davis
                              File Descriptors
 •External name
 •Current state
 •Sharable
 •Owner
 •User
 •Locks
 •Protection settings
 •Length
 •Time of creation
 •Time of last modification
 •Time of last access
 •Reference count
 •Storage device details
ECS 150 (Operating Systems)      File System (1), 35   Spring 2011 UC Davis
                              An open() Operation
ECS 150 (Operating Systems)        File System (1), 36   Spring 2011 UC Davis
                 File Manager Data Structures
 ECS 150 (Operating Systems)           File System (1), 37     Spring 2011 UC Davis
                              Opening a UNIX File
             0    stdin
             1    stdout
                                 File structure
             2    stderr
             3    ...
                                                                    inode
ECS 150 (Operating Systems)       File System (1), 39    Spring 2011 UC Davis
                              Contiguous Allocation
                              File descriptor
                                Head position                237
                                …
                                First block                  785
                                Number of blocks             25
ECS 150 (Operating Systems)            File System (1), 40         Spring 2011 UC Davis
ECS 150 (Operating Systems)   File System (1), 41   Source: Stallings
                              After Compaction
ECS 150 (Operating Systems)      File System (1), 42   Source: Stallings
                 Linked Lists or Chained Allocation
        First block
        …                     Length              Length            Length
                              Byte 0             Byte 0            Byte 0
                               ...                ...               ...
        Head: 417             Byte 4095          Byte 4095         Byte 4095
        ...
                              Block 0            Block 1           Block N-1
ECS 150 (Operating Systems)     File System (1), 43          Spring 2011 UC Davis
ECS 150 (Operating Systems)   File System (1), 44   Source: Stallings
                              After Consolidation
ECS 150 (Operating Systems)      File System (1), 45   Source: Stallings
                               Indexed Allocation
                                                             Byte 0
                                                              ...
  Index block
                                                             Byte 4095
  …
                              Length                         Block 0
                              Length                                                     Byte 0
  Head: 417                                                                               ...
  ...                                                                                    Byte 4095
                                                                  Byte 0                 Block 1
                                                                   ...
                              Length                              Byte 4095
                                                                  Block N-1
ECS 150 (Operating Systems)            File System (1), 46               Spring 2011 UC Davis
                    Indexed allocation with Block portions
ECS 150 (Operating Systems)         File System (1), 47      Source: Stallings
                Indexed allocation with variable length portions
ECS 150 (Operating Systems)      File System (1), 49   Source: Gribble, Lazowska,
                                                       Levy, Zahorjan
      (Old) Unix disks are divided into five parts …
• Boot block
       can boot the system by loading from this block
• Superblock
       specifies boundaries of next 3 areas, and contains head of
        freelists of inodes and file blocks
• i-node area
       contains descriptors (i-nodes) for each file on the disk; all i-
        nodes are the same size; head of freelist is in the superblock
• File contents area
       fixed-size blocks; head of freelist is in the superblock
• Swap area
       holds processes that have been swapped out of memory
ECS 150 (Operating Systems)     File System (1), 50    Source: Gribble, Lazowska,
                                                       Levy, Zahorjan
                                  So …
• You can attach a disk to a dead system …
• Boot it up …
• Find, create, and modify files …
       because the superblock is at a fixed place, and it tells you
        where the i-node area and file contents area are
       by convention, the second i-node is the root directory of the
        volume
ECS 150 (Operating Systems)    File System (1), 51   Source: Gribble, Lazowska,
                                                     Levy, Zahorjan
                              i-node format
    • User number
    • Group number
    • Protection bits
    • Times (file last read, file last written, inode last written)
    • File code: specifies if the i-node represents a directory, an
      ordinary user file, or a “special file” (typically an I/O
      device)
    • Size: length of file in bytes
    • Block list: locates contents of file (in the file contents
      area)
           more on this soon!
    • Link count: number of directories referencing this i-node
ECS 150 (Operating Systems)      File System (1), 52   Source: Gribble, Lazowska,
                                                       Levy, Zahorjan
                       The flat (i-node) file system
• Each file is known by a number, which is the number of the
  i-node
       seriously – 1, 2, 3, etc.!
       why is it called “flat”?
• Files are created empty, and grow when extended through
  writes
ECS 150 (Operating Systems)      File System (1), 53   Source: Gribble, Lazowska,
                                                       Levy, Zahorjan
     The tree (directory, hierarchical) file system
                                       …
           1
                 …
                                                            …
          10
                                  …
          11
          12
                                                                …
                                                    …
                              …
ECS 150 (Operating Systems)           File System (1), 55       Source: Gribble, Lazowska,
                                                                Levy, Zahorjan
                                    So …
ECS 150 (Operating Systems)      File System (1), 56    Source: Gribble, Lazowska,
                                                        Levy, Zahorjan
    • Can get to 128 x 128 x 128 x 512B = an additional 1GB
      with a triple indirect reference
           (the 13th pointer in the i-node gets you to a 512B block in
            the file contents area that contains 128 4B pointers to
            512B blocks in the file contents area that contain 128 4B
            pointers to 512B blocks in the file contents area that
            contain 128 4B pointers to 512B blocks holding file data)
    • Maximum file size is 1GB + a smidge
    • Berkeley Unix went to 1KB block sizes
           What’s the effect on the maximum file size?
               o 256x256x256x1K = 17 GB + a smidge
           What’s the price?
    • Subsequently went to 4KB blocks
           1Kx1Kx1Kx4K = 4TB + a smidge
ECS 150 (Operating Systems)      File System (1), 57   Source: Gribble, Lazowska,
                                                       Levy, Zahorjan
                                Putting it all together
                                                                             •••
                                                                                               directory ‘var/’
                                                        directory ‘usr/’
                                                                                               (table of entries)
                                     inode for          (table of entries)
                                                                              inode for
                                     ‘usr/’                                   ‘var/’
                                                                  data blocks
                              indirection
                              block
 inode       file block                                  indirection block
 free list   free list
                                            •••
                                                                                 data blocks
                       Indirection
                       block
ECS 150 (Operating Systems)                      File System (1), 58                 Source: Gribble, Lazowska,
                                                                                     Levy, Zahorjan
    The UNIX File System
ECS 150 (Operating Systems)       File System (1), 60   Source: Gribble, Lazowska,
                                                        Levy, Zahorjan
                              File system consistency
• Both i-nodes and file blocks are cached in memory
• The “sync” command forces memory-resident disk
  information to be written to disk
       system does a sync every few seconds
• A crash or power failure between sync’s can leave an
  inconsistent disk
• You could reduce the frequency of problems by reducing
  caching, but performance would suffer big-time
ECS 150 (Operating Systems)          File System (1), 61   Source: Gribble, Lazowska,
                                                           Levy, Zahorjan
                     What do you do after a crash?
• Run a program called “fsck” to try to fix any consistency
  problems
• fsck has to scan the entire disk
       as disks are getting bigger, fsck is taking longer and longer
       modern disks: fsck can take a full day!
• Newer file systems try to help here
       are more clever about the order in which writes happen, and
        where writes are directed
           o e.g., Journaling file system: collect recent writes in a log called a
             journal. On crash, run through journal to replay against file
             system.
ECS 150 (Operating Systems)          File System (1), 62      Source: Gribble, Lazowska,
                                                              Levy, Zahorjan
                              VFS-based File Manager
ECS 150 (Operating Systems)            File System (1), 63       Spring 2011 UC Davis
    Linux VFS
• Multiple interfaces build
  up VFS:                                    File access
         files
         dentries
         inodes
         superblock
         quota
• VFS can do all caching &
  provides utility fctns to FS
• FS provides methods to
  VFS; many are optional
ECS 150 (Operating Systems)   File System (1), 64          Source: P.J.Braam, CMU
    User level file access
ECS 150 (Operating Systems)   File System (1), 65   Source: P.J.Braam, CMU
    VFS
ECS 150 (Operating Systems)   File System (1), 66   Source: P.J.Braam, CMU
    File system level
ECS 150 (Operating Systems)       File System (1), 67     Source: P.J.Braam, CMU
    Anatomy of stat system call
sys_stat(path, buf) {
 dentry = namei(path);
 if ( dentry == NULL ) return -
ENOENT;                                             Establish VFS data
 inode = dentry->d_inode;
 rc =inode->i_op-
                                                    Call into inode
>i_permission(inode);
                                                    layer of filesystem
 if ( rc ) return -EPERM;
                                                    Call into inode
  rc = inode->i_op-                                 layer of filesystem
>i_getattr(inode, buf);
  dput(dentry);
  return rc;
}
ECS 150 (Operating Systems)   File System (1), 68      Source: P.J.Braam, CMU
     Anatomy of fstatfs system call
ECS 150 (Operating Systems)   File System (1), 69   Source: P.J.Braam, CMU
                                       DOS FAT Files
    File Descriptor
                              43                254
                                                                 …           107
                                   Disk           Disk                         Disk
                                   Block          Block                        Block
    File Descriptor
  43                          43                254
                                                                 …           107
 107                               Disk           Disk                         Disk
                                   Block          Block                        Block
254