spcl.inf.ethz.
ch
                                                                      @spcl_eth
ADRIAN PERRIG & TORSTEN HOEFLER
Networks and Operating Systems                   (252-0062-00)
Chapter 9: I/O Subsystems
                                  Never underestimate the KISS principle!
                                                                      spcl.inf.ethz.ch
                                                                           @spcl_eth
Cache re-load and a magic trick
 Last time
   On-disk data structures
    File representation
    Block allocation
    Directories
   FAT32 case study
    Very simple block interface
    Single table
   FFS case study
    Blocked interface
    Uses inodes, direct, (single, double, triple …) indirect blocks
   NTFS case study
    Extent interface
    Direct and indirect extent pointers
                                                                                     2
                                                                           spcl.inf.ethz.ch
                                                                                @spcl_eth
Our Small Quiz
 True or false (raise hand)
   1. Directory structures can never contain cycles
   2. Access control lists scale to large numbers of principals
   3. Capabilities are stored with the principals and revocation can be complex
   4. POSIX (Unix) access control is scalable to large numbers of files
   5. Named pipes are just (special) files in Unix
   6. Memory mapping improves sequential file access
   7. Accessing different files on disk can have different speeds
   8. The FAT filesystem enables fast random access
   9. FFS enables fast random access for small files
   10. The minimum storage for a file in FFS is 8kB (4kB inode + block)
   11. Block groups in FFS are used to simplify the implementation
   12. Multiple hard links in FFS are stored in the same inode
   13. NTFS stores files that are contiguous on disk more efficiently than FFS
   14. The volume information in NTFS is a file in NTFS
                                                                                          3
                            spcl.inf.ethz.ch
                                 @spcl_eth
In-memory data structures
                                                                   spcl.inf.ethz.ch
                                                                        @spcl_eth
Opening a file
 Directories translated into kernel data structures on demand:
     open(“foo”);
                                                       directory
                          directory structure             file inode
      User space                 Kernel                Disk
                                                                         spcl.inf.ethz.ch
                                                                              @spcl_eth
Reading and writing
 Per-process open file table → index into…
 System open file table → cache of inodes
                                                            file inode
     read(5,…)
                      Per-process           System
                     open file table     open file table
                                                           File blocks
      User space                       Kernel                 Disk
                                                                       spcl.inf.ethz.ch
                                                                            @spcl_eth
Efficiency and Performance
 Efficiency dependent on:
   disk allocation and directory algorithms
   types of data kept in file’s directory entry
 Performance
   disk cache – separate section of main memory for frequently used blocks
   free-behind and read-ahead – techniques to optimize sequential access
   improve PC performance by dedicating section of memory as virtual disk,
    or RAM disk
                                                               spcl.inf.ethz.ch
                                                                    @spcl_eth
Page Cache
 A page cache caches pages rather than disk blocks using virtual
  memory techniques
 Memory-mapped I/O uses a page cache
 Routine I/O through the file system uses the buffer (disk) cache
 This leads to the following figure
                                                            spcl.inf.ethz.ch
                                                                 @spcl_eth
Two layers of caching?
                                         File access with
          Memory-mapped files            read()/write()
                 Page cache
                          Buffer cache
                           File system
                                                            spcl.inf.ethz.ch
                                                                 @spcl_eth
Unified Buffer Cache
                                         File access with
          Memory-mapped files            read()/write()
                          Buffer cache
                           File system
                                                              spcl.inf.ethz.ch
                                                                   @spcl_eth
Filesystem Recovery
 Consistency checking – compares data in directory structure
  with data blocks on disk, and tries to fix inconsistencies
 Use system programs to back up data from disk to another
  storage device (floppy disk, magnetic tape, other magnetic disk,
  optical)
 Recover lost file or disk by restoring data from backup
                                        spcl.inf.ethz.ch
                                             @spcl_eth
Disks, Partitions and Logical Volumes
                                                                      spcl.inf.ethz.ch
                                                                           @spcl_eth
Partitions
        Partition
                                    File system
         table
                    File system A                     File system C
                                          B
       0               Logical block address (LBA) on a single disk
 Multiplex single disk among >1 file systems
 Contiguous block ranges per FS
                                                                  spcl.inf.ethz.ch
                                                                       @spcl_eth
Logical volumes
              Disk 1             Disk 2               Disk 3
           File system A     File system A        File system A
               (part 1)          (part 2)             (part 3)
                   Single logical volume with file system A
 Emulate 1 virtual disk from >1 physical ones
 Single file system spanning >1 disk
                                                 spcl.inf.ethz.ch
                                                      @spcl_eth
Multiple file systems
 How to name files in multiple file systems?
 Top-level volume names:
   Windows A:, B:, C:, D:, etc. (problematic)
   \\fs-systems.ethz.ch\
 Bind “mount points” in name space
   Unix, etc. (flexible)
               spcl.inf.ethz.ch
                    @spcl_eth
Mount points
                                                                          spcl.inf.ethz.ch
                                                                               @spcl_eth
File hierarchy with mounts
        home           etc             dev         var            usr
 htor          netos         shm             run         lock            bin
                                                           Mount point
                                                     Normal directory
                                                                 spcl.inf.ethz.ch
                                                                      @spcl_eth
Virtual File Systems
 Virtual File Systems (VFS) provide an object-oriented way of
  implementing file systems.
 VFS allows the same system call interface (the API) to be used
  for different types of file systems.
 The API is to the VFS interface, rather than any specific type of
  file system.
                                                                       spcl.inf.ethz.ch
                                                                            @spcl_eth
   Virtual File System
                                File system interface
                                   VFS interface
                                     EXT4 file          NFS network
              FAT file system
                                      system             file system
Advanced: check out Linux’ FUSE (Filesystem in Userspace)
                                     spcl.inf.ethz.ch
                                          @spcl_eth
Rest of today: I/O
1.   Recap: what devices look like
2.   Device drivers
3.   The I/O subsystem
                                spcl.inf.ethz.ch
                                     @spcl_eth
Recap from CASP:
What does a device look like?
                                                spcl.inf.ethz.ch
                                                     @spcl_eth
Recap: What is a device?
Specifically, to an OS programmer:
 Piece of hardware visible from software
 Occupies some location on a bus
 Set of registers
   Memory mapped or I/O space
 Source of interrupts
 May initiate Direct Memory Access transfers
                               spcl.inf.ethz.ch
                                    @spcl_eth
Recap: Registers
 Details of registers given
  in chip “datasheets” or
  “data books”
 Information is rarely
  trusted by OS
  programmers 
        From the data
         sheet for the
       PC16550 UART
        (standard PC
          serial port)
                                 spcl.inf.ethz.ch
                                      @spcl_eth
Registers
 Slightly more readable
  version:
   From Barrelfish, in a
    language called “Mackerel”
   Compiler generates code to
    do the “bit-banging”
                                   spcl.inf.ethz.ch
                                        @spcl_eth
Using registers
 From the Barrelfish console
  driver
   Very simple!
 Note the issues:
   Polling loop on send
   Polling loop on receive
     Only a good idea for debug
    CPU must write all the data
     not much in this case
                                                                        spcl.inf.ethz.ch
                                                                             @spcl_eth
Very simple UART driver
 Actually, far too simple!
    But this is how the first version always looks…
 No initialization code, no error handling.
 Uses Programmed I/O (PIO)
    CPU explicitly reads and writes all values to and from registers
    All data must pass through CPU registers
 Uses polling
    CPU polls device register waiting before send/receive
     Tight loop!
    Can’t do anything else in the meantime
     Although could be extended with threads and care…
    Without CPU polling, no I/O can occur
                                                              spcl.inf.ethz.ch
                                                                   @spcl_eth
Recap: Interrupts
 CPU Interrupt-request line triggered by I/O device
 Interrupt handler receives interrupts
 Maskable to ignore or delay some interrupts
 Interrupt vector to dispatch interrupt to correct handler
    Based on priority
    Some nonmaskable
 Interrupt mechanism also used for exceptions
                                                                spcl.inf.ethz.ch
                                                                     @spcl_eth
Interrupt-driven I/O cycle
                      CPU                  Device
               Process A performs
              blocking I/O operation
               Driver initiates I/O
              operation with device
                                        Device starts I/O
             Scheduler blocks process           …
               A; switches to other
                    processes
                                         I/O completes (or
                        …               error occurs); device
                                           raises interrupt
                 Interrupt handler
                  processes data
             CPU resumes interrupted
                    process
                        …
             Process A unblocks and
                operation returns
                                                             spcl.inf.ethz.ch
                                                                  @spcl_eth
Recap: Direct Memory Access
 Avoid programmed I/O for lots of data
   E.g. fast network or disk interfaces
 Requires DMA controller
   Generally built-in these days
 Bypasses CPU to transfer data directly between I/O device and
  memory
   Doesn’t take up CPU time
   Can save memory bandwidth
   Only one interrupt per transfer
                                                                         spcl.inf.ethz.ch
                                                                              @spcl_eth
I/O protection
I/O operations can be dangerous to normal system operation!
 Dedicated I/O instructions usually privileged
 I/O performed via system calls
   Register locations must be protected
 DMA transfers must be carefully checked
   Bypass memory protection!
   How can that happen today?
    Multiple operating systems on the same machine (e.g., virtualized)
   IOMMUs are beginning to appear…
                                                               spcl.inf.ethz.ch
                                                                    @spcl_eth
IOMMUs
IOMMU does the same for the I/O devices as MMU does for the CPU!
➔ Translates device adresses (so called DVAs) into physical ones
➔ Uses so called IOTLB (I/O TLB)
➔ Works for DMA-capable
  devices :-)
➔   Examples:
    ➔ Intel VT-d
    ➔ AMD IOMMU
    ➔   ...very similar in functionality
                                                    Source: Wikipedia
                                                                                spcl.inf.ethz.ch
                                                                                     @spcl_eth
IOMMUs
➔   Security features for VMs
    ➔ Possibility to assign different devices to different address domains
    ➔ By address remapping we can isolate the domains from one another,
      thus 'sandboxing' untrusted devices
                                                      Source: Intel VT-d specification
                                                                            spcl.inf.ethz.ch
                                                                                 @spcl_eth
IOMMUs
➔   IOMMUs were designed for enhancing virtualization
    ➔ Remapping & security features can be applied to guest virtual
      machines
    ➔ Better performance than software-based I/O virtualization
                                                             Source: Intel VT-d specification
                                                                                       spcl.inf.ethz.ch
                                                                                            @spcl_eth
IOMMUs
➔   IOMMUs take as the 'input request' the ID consisting of:
    ➔   Bus ID, stored in root tables (support for multiple buses),
    ➔   Device ID, stored in context tables (support for multiple devices within each bus)
    ➔   Function ID, also stored in context tables (support for multiple func. within each
        device)
➔   Different page
    table per I/O device
           Source: Intel VT-d specification
                                                                             spcl.inf.ethz.ch
                                                                                  @spcl_eth
IOMMUs - Address remapping
➔   IOMMUs support page remapping
      ➔ Some PCI devices use 32 bit addressing
➔   IOMMU Page Tables                     bounce
                                          buffers             IOMMU
    ➔ Similar to 'standard' multi-level
      page tables
    ➔ Write-only / read-only bits
    ➔ Support for huge pages
    ➔ Currently no support for
      more extended features
      (e.g., reference bits)
                                                    Source: http://codingrelic.geekhold.com/
                                                                      spcl.inf.ethz.ch
                                                                           @spcl_eth
IOMMUs
 ➔   IOMMUs are much broader topic
 ➔   They provide also:
     ➔ Interrupt remapping (you can control interrupts in a similar
       way as memory accesses)
     ➔ Device I/O TLBs (Intel VT-d)
     ➔ Fault logging
     ➔ …
 ➔   You can think of many interesting use cases for them :-)
       ➔ Interested? New ideas?
                 spcl.inf.ethz.ch
                      @spcl_eth
Device drivers
                                                             spcl.inf.ethz.ch
                                                                  @spcl_eth
Device drivers
 Software object (module, object, process, hunk of code) which
  abstracts a device
   Sits between hardware and rest of OS
   Understands device registers, DMA, interrupts
   Presents uniform interface to rest of OS
 Device abstractions (“driver models”) vary…
   Unix starts with “block” and “character” devices
                                                           spcl.inf.ethz.ch
                                                                @spcl_eth
Device driver structure: the basic problem
 Hardware is interrupt driven.
   System must respond to unpredictable I/O events
    (or events it is expecting, but doesn’t know when)
 Applications are (often) blocking
   Process is waiting for a specific I/O event to occur
 Often considerable processing in between
   TCP/IP processing, retries, etc.
   File system processing, blocks, locking, etc.
                                                                spcl.inf.ethz.ch
                                                                     @spcl_eth
Example: network receive
                 Recv()
  User process
  Kernel
                          Block                       Unblock
                                       Demux
                                   TCP processing
                                   Retransmissions
                                      Timeouts
                                    Port allocation
                                         Etc.
    Packet arrives;
    Interrupt                     Interrupt handler
                                                                                    spcl.inf.ethz.ch
                                                                                         @spcl_eth
Example: network receive
                 Recv()
  User process
  Kernel
                          Block                       Unblock
                                                                • Can’t take too long
                                       Demux                         • Interrupts disabled?
                                   TCP processing               • Can’t change much
                                   Retransmissions                   • Interrupt context
                                      Timeouts                       • Arbitrary system state
                                    Port allocation                  • Can’t hold locks
                                         Etc.
    Packet arrives;
    Interrupt                     Interrupt handler
                                                                                        spcl.inf.ethz.ch
                                                                                             @spcl_eth
Example: network receive
                 Recv()
  User process
  Kernel
                          Block                       Unblock
                                                                • Process is blocked
                                       Demux                    • Don’t even know it’s this
                                   TCP processing                 process until demux
                                   Retransmissions
                                      Timeouts
                                    Port allocation
                                         Etc.
    Packet arrives;
    Interrupt                     Interrupt handler
                                                             spcl.inf.ethz.ch
                                                                  @spcl_eth
Solution 1: driver threads
                 Recv()
  User process
  Kernel
                          Block                    Unblock
                                   Driver thread
    Packet arrives;
    Interrupt                     Interrupt handler
                                                                        spcl.inf.ethz.ch
                                                                             @spcl_eth
Solution 1: driver threads
                       Recv()
     User process
     Kernel
                                     Block                    Unblock
1. Interrupt handler
       i. Masks interrupt
       ii. Does minimal processing            Driver thread
       iii. Unblocks driver thread
        Packet arrives;
        Interrupt                            Interrupt handler
                                                                                  spcl.inf.ethz.ch
                                                                                       @spcl_eth
Solution 1: driver threads
                 Recv()
  User process
  Kernel
                          Block                    Unblock
                                                        2. Thread
                                                              i. Performs all necessary
                                                                  packet processing
                                                              ii. Unblocks user processes
                                   Driver thread              iii.Unmasks interrupt
    Packet arrives;
    Interrupt                     Interrupt handler
                                                                 spcl.inf.ethz.ch
                                                                      @spcl_eth
Solution 1: driver threads
                 Recv()
  User process
  Kernel
                              Block                    Unblock
            3. User process
                  i. Per-process handling
                  ii. Copies packet to user space
                  iii.Returns from kernel
                                       Driver thread
    Packet arrives;
    Interrupt                         Interrupt handler
                                                       spcl.inf.ethz.ch
                                                            @spcl_eth
Terminology – very confused!
 1st-level Interrupt Handler (FLIH)
    Linux calls this the “top half”.
    In contrast to every other OS on the planet.
 Thread is an “interrupt handler thread” in Solaris
    Other names in other systems… 
                                                                       spcl.inf.ethz.ch
                                                                            @spcl_eth
 Solution 2: deferred procedure calls (DPCs)
           Recv()
User process
Kernel
                     Block                         Unblock
                                                             Run all
                                                             pending
                                    Enqueue                   DPCs
                                      DPC
                                    (closure)
   Packet arrives;
   Interrupt                 FLIH               FLIH
                                                                 spcl.inf.ethz.ch
                                                                      @spcl_eth
Deferred Procedure Calls
 Instead of using a thread, execute on the next process to be
  dispatched
    Before it leaves the kernel
 Solution in most versions of Unix
    Don’t need kernel threads
    Saves a context switch
    Can’t account processing time to the right process
  3rd solution: demux early, run in user space
    Covered in Advanced OS Course!
                                                                 spcl.inf.ethz.ch
                                                                      @spcl_eth
More confusing terminology
 DPCs: also known as:
     2nd-level interrupt handlers
     Soft interrupt handlers
     Slow interrupt handlers
     In Linux ONLY: bottom-half handlers
 Any non-Linux OS (the way to think about it):
   Bottom-half = FLIH + SLIH, called from “below”
   Top-half = Called from user space (syscalls etc.), “above”
                                                                                       spcl.inf.ethz.ch
                                                                                            @spcl_eth
Life cycle of an I/O request
   •   Request I/O                 User process       •   I/O complete
                  System call                                         Return from system call
                                                      •   Transfer data to/from user
            Can satisfy
                                                          space,
             request?      Yes                        •   Return completion code
                   No              I/O subsystem
   •   Send request to driver
   •   Block process if needed
   •   Issue commands to                              •   Demultiplex I/O complete
       device                      Device driver      •   Determine source of
   •   Block until interrupted                            request
                                                      •   Handle interrupt
                                  Interrupt handler   •   Signal device driver
                                                                      Interrupt
   •   Issue interrupt when I/O   Physical device     •   I/O complete
       completed                                      •   Generate Interrupt
                                         Time
                    spcl.inf.ethz.ch
                         @spcl_eth
The I/O subsystem
                                                                 spcl.inf.ethz.ch
                                                                      @spcl_eth
Generic I/O functionality
 Device drivers essentially move data to and from I/O devices
   Abstract hardware
   Manage asynchrony
 OS I/O subsystem includes generic functions for dealing with
  this data
   Such as…
                                                     spcl.inf.ethz.ch
                                                          @spcl_eth
The I/O subsystem
 Caching - fast memory holding copy of data
   Always just a copy
   Key to performance
 Spooling - hold output for a device
   If device can serve only one request at a time
   E.g., printing
                                                                spcl.inf.ethz.ch
                                                                     @spcl_eth
The I/O subsystem
 Scheduling
   Some I/O request ordering via per-device queue
   Some OSs try fairness
 Buffering - store data in memory while transferring between
  devices or memory
   To cope with device speed mismatch
   To cope with device transfer size mismatch
   To maintain “copy semantics”
                                                         spcl.inf.ethz.ch
                                                              @spcl_eth
Naming and discovery
 What are the devices the OS needs to manage?
   Discovery (bus enumeration)
   Hotplug / unplug events
   Resource allocation (e.g. PCI BAR programming)
 How to match driver code to devices?
   Driver instance ≠ driver module
   One driver typically manages many models of device
 How to name devices inside the kernel?
 How to name devices outside the kernel?
                                                  spcl.inf.ethz.ch
                                                       @spcl_eth
Matching drivers to devices
 Devices have unique (model) identifiers
    E.g. PCI vendor/device identifiers
 Drivers recognize particular identifiers
    Typically a list…
 Kernel offers a device to each driver in turn
    Driver can “claim” a device it can handle
    Creates driver instance for it.
                                                            spcl.inf.ethz.ch
                                                                 @spcl_eth
Naming devices in the Unix kernel
               (Actually, naming device driver instances)
 Kernel creates identifiers for
    Block devices
    Character devices
    [ Network devices – see later… ]
 Major device number:
    Class of device (e.g. disk, CD-ROM, keyboard)
 Minor device number:
    Specific device within a class
                                                    spcl.inf.ethz.ch
                                                         @spcl_eth
Unix block devices
 Used for “structured I/O”
    Deal in large “blocks” of data at a time
 Often look like files (seekable, mappable)
    Often use Unix’ shared buffer cache
 Mountable:
    File systems implemented above block devices
                                                       spcl.inf.ethz.ch
                                                            @spcl_eth
Character devices
 Used for “unstructured I/O”
   Byte-stream interface – no block boundaries
   Single character or short strings get/put
   Buffering implemented by libraries
 Examples:
   Keyboards, serial lines, mice
 Distinction with block devices somewhat arbitrary…
                                                  spcl.inf.ethz.ch
                                                       @spcl_eth
Naming devices outside the kernel
 Device files: special type of file
    Inode encodes <type, major num, minor num>
    Created with mknod() system call
 Devices are traditionally put in /dev
      /dev/sda – First SCSI/SATA/SAS disk
      /dev/sda5 – Fifth partition on the above
      /dev/cdrom0 – First DVD-ROM drive
      /dev/ttyS1 – Second UART
                                                     spcl.inf.ethz.ch
                                                          @spcl_eth
Pseudo-devices in Unix
 Devices with no hardware!
 Still have major/minor device numbers. Examples:
       /dev/stdin
       /dev/kmem
       /dev/random
       /dev/null
       /dev/loop0
etc.
                                                           spcl.inf.ethz.ch
                                                                @spcl_eth
Old-style Unix device configuration
 All drivers compiled into the kernel
 Each driver probes for any supported devices
 System administrator populates /dev
   Manually types mknod when a new device is purchased!
 Pseudo devices similarly hard-wired in kernel
                                                       spcl.inf.ethz.ch
                                                            @spcl_eth
Linux device configuration today
 Physical hardware configuration readable from /sys
   Special fake file system: sysfs
   Plug events delivered by a special socket
 Drivers dynamically loaded as kernel modules
   Initial list given at boot time
   User-space daemon can load more if required
 /dev populated dynamically by udev
   User-space daemon which polls /sys
                                             spcl.inf.ethz.ch
                                                  @spcl_eth
Next time:
   Network stack implementation
   Network devices and network I/O
   Buffering
   Memory management in the I/O subsystem