Vijayan Thesis
Vijayan Thesis
by
Vijayan Prabhakaran
                          Doctor of Philosophy
                                   in
                           Computer Sciences
                    University of Wisconsin-Madison
                                  2006
Committee in charge:
  Andrea C. Arpaci-Dusseau (Co-chair)
  Remzi H. Arpaci-Dusseau (Co-chair)
  David J. DeWitt
  Mary K. Vernon
  Mikko H. Lipasti
ii
iv
                                                                                     v
Abstract
     Disk drives are widely used as a primary medium for storing information.
While commodity file systems trust disks to either work or fail completely, modern
disks exhibit complex failure modes such as latent sector faults and block corrup-
tions, where only portions of a disk fail.
     In this thesis, we focus on understanding the failure policies of file systems and
improving their robustness to disk failures. We suggest a new fail-partial failure
model for disks, which incorporates realistic localized faults such as latent sector
faults and block corruption. We then develop and apply a novel semantic failure
analysis technique, which uses file system block type knowledge and transactional
semantics, to inject interesting faults and investigate how commodity file systems
react to a range of more realistic disk failures.
     We apply our technique to five important journaling file systems: Linux ext3,
ReiserFS, JFS, XFS, and Windows NTFS. We classify their failure policies in a new
taxonomy that measures their Internal RObustNess (IRON), which includes both
failure detection and recovery techniques. Our analysis results show that commod-
ity file systems store little or no redundant information, and contain failure policies
that are often inconsistent, sometimes buggy, and generally inadequate in their abil-
ity to recover from partial disk failures.
     We remedy the reliability short comings in commodity file systems by address-
ing two issues. First, we design new low-level redundancy techniques that a file
system can use to handle disk faults. We begin by qualitatively and quantitatively
evaluating various redundancy information such as checksum, parity, and replica,
Then, in order to account for spatially correlated faults, we propose a new prob-
abilistic model that can be used to construct redundancy sets Finally, we describe
two update strategies: a overwrite and no-overwrite approach that a file system can
use to update its data and parity blocks atomically without NVRAM support. Over-
vi
all, we show that low-level redundant information can greatly enhance file system
robustness while incurring modest time and space overheads.
     Second, to remedy the problem of failure handling diffusion, we develop a mod-
ified ext3 that unifies all failure handling in a Centralized Failure Handler (CFH).
We then showcase the power of centralized failure handling in ext3c , a modified
IRON version of ext3 that uses CFH by demonstrating its support for flexible, con-
sistent, and fine-grained policies. By carefully separating policy from mechanism,
ext3c demonstrates how a file system can provide a thorough, comprehensive, and
easily understandable failure-handling policy.
                                                                       vii
Acknowledgements
     I’m also indebted to all those who gave me useful feedbacks on the dissertation
topic during my interview process. My discussions with Chandhu Thekkath at Mi-
crosoft Research, Yasushi Saito at Google, Erik Riedel at Seagate, Ram Rajmony
at IBM-Austin, the IBM-Almaden interview panel, Fred Douglis at IBM-Watson,
and Mary Baker at HP Labs were fruitful.
     I’m fortunate to work with wonderful colleagues at UW: Nitin Agrawal, Lak-
shmi Bairavasundaram, Nathan Burnett, Timothy Denehy, Haryadi Gunawi, Todd
Jones, Florentina Popovici, and Muthian Sivathanu. Our late-night work during
conference deadlines were lot of fun (with free pizzas, thanks to our advisors).
Here, I’d like to thank Muthian Sivathanu specially. He was always eager to dis-
cuss my problems (both technical and personal) and offer his help. There were
numerous occasions when I was surprised at his complex yet clear analytical rea-
sonings. I’d also like to thank Himani Apte and Meenali Rungta for letting me
participate in their course project. While trying to answer their in-depth questions,
I learnt a lot about the internals of ext3.
     I’d like to thank my friends at Madison. I don’t have any of my family members
in United States. But, when I’m with my friends I never felt bad about being
away from home. To name a few: Arini Balakrishnan, Gogul Balakrishnan, Vinod
Deivasigamany, Aditi Ganesan, Karthik Jayaraman, Vignesh Kannappan, Indirajith
Krishnamurthy, Ramanathan Palaniappan, Prabu Ravindran, Muthian Sivathanu,
Sanjay Sribalusu, and Joseph Stanley. They were happy even at my small successes
and supported me during difficult times. I’m going to miss all the fun we had:
screening late-night movies, watching Disney channel with Gogul, Prabu’s witty
comments, and evening coffee conversations with Arini (whose concerns about my
eating habits reminds me of my Mother).
     I’m grateful to my family back in India. My Grandpa’s regular phone calls (in
spite of his old age) were a great source of encouragement. Whenever I felt low
during these years, I’d call up my sister and her family. A few minutes of laugh
with them helped me relax more than anything. Finally, I’m grateful to my parents.
They had faith in me when I went against their wishes to pursue the doctorate.
I’ve always been an irresponsible son pursuing his own desires than supporting
the family. But, they neither complained nor showed me their personal struggles.
Instead, they rejoiced even at my smallest successes, showering me with all their
love, and kept me always feeling great like a little prince. Thanks Ma and Pa, I
hope I’ve made you all proud.
Contents
Abstract v
Acknowledgements ix
1 Introduction                                                                                                 1
  1.1 Analysis of File System Robustness      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
  1.2 Building Robust File Systems . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
  1.3 Contributions . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    7
  1.4 Outline . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
3 Background                                                                                                  23
  3.1 Journaling File Systems . . . . . . . . . . . . . . . . . . . . . . .                                   23
      3.1.1 Ext3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                    26
                                       xi
xii
        3.1.2   ReiserFS . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
        3.1.3   JFS . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
        3.1.4   XFS . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
        3.1.5   NTFS . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
        3.1.6   File System Summary       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
Chapter 1
Introduction
of such errors become larger at higher areal densities as they can corrupt more
bits [4]. In addition, increased density can also increase the complexity of the
logic, that is the firmware that manages the data [7], which can result in increased
number of bugs. For example, buggy firmwares are known to issue misdirected
writes [126], where correct data is placed on disk but in the wrong location.
    Second, increased use of low-end desktop drives such as the IDE/ATA drives
worsens the reliability problem. Low cost dominates the design of personal storage
drives [7] and therefore, they are less tested and have less machinery to handle
disk errors [56]. It is important we consider the ATA drives here because they
are not only used in our laptops and desktops but also in high-end storage arrays
such as EMC Centera [36, 50] and large-scale internet clusters like Google [40].
While personal storage drives are designed for active use only several hours per
day, they are less likely to perform reliably under operational stresses of enterprise
systems [7].
    Finally, amount of software used on the storage stack has increased. Firmware
on a desktop drive contains about 400 thousand lines of code [33]. Moreover,
the storage stack consists of several layers of low-level device driver code that
have been considered to have more bugs than the rest of the operating system
code [38, 113]. As Jim Gray points out in his study of Tandem Availability, “As the
other components of the system become increasingly reliable, software necessarily
becomes the dominant cause of outages” [44].
     Developers of high-end systems have realized the nature of these disk faults
and built mechanisms into their systems to handle them. For example, many redun-
dant storage systems incorporate a background disk scrubbing process [59, 100] to
proactively detect and subsequently correct latent sector faults by creating a new
copy of inaccessible blocks (although, auditing the disk for latent sector faults can
itself increase the rate of visible or latent faults [14]); some recent storage arrays in-
corporate extra levels of redundancy to lessen the potential damage of undiscovered
latent faults [28]. Finally, highly-reliable systems (e.g., Tandem NonStop) utilize
end-to-end checksums to detect when block corruption occurs [16].
    The above said failure characteristics (latent sector faults and block corruption),
trends (e.g., increased use of cheap drives), and our knowledge about reliable high-
end systems raise the question: how do commodity file systems handle disk failures?
We derived the answer to the above question from our file system analysis and it
led us to the second question: how can we improve the robustness of commodity file
systems? Our research efforts to answer the above two questions form two broad
parts of this dissertation. We discuss more about each of these parts in the following
sections.
                                                                                   3
file system consistency). By using such a model, we can inject faults at various
“interesting” points during a file system transaction, and thus monitor how the
system reacts to such failures. For read failure analysis, we simply fail different
block types. Our semantic failure analysis approach leverages gray-box knowledge
[9, 108] of file system data structures to meticulously exercise file system access
paths to disk across a wide range of about 30 different workloads.
     To better characterize failure policy, we develop an Internal RObustNess (IRON)
taxonomy, which catalogs a broad range of detection and recovery techniques that a
file system can use to handle disk failures. For example, a file system can use error
codes from lower layers to detect if its read request has completed successfully and
use redundant information such as a replica to recover from the failure. The output
of our fingerprinting tool is a broad categorization of which IRON techniques a file
system uses across its constituent data structures.
    Our study focuses on four important and substantially different open-source file
systems, ext3 [121], ReiserFS [89], IBM’s JFS [19], and XFS [112] and one closed-
source file system, Windows NTFS [109]. From our analysis results, we find that
the technology used by high-end systems (e.g., checksumming, disk scrubbing, and
so on) has not filtered down to the realm of commodity file systems. Across all plat-
forms, we find ad hoc failure handling and a great deal of illogical inconsistency in
failure policy, often due to the diffusion of failure handling code through the kernel;
such inconsistency leads to substantially different detection and recovery strate-
gies under similar fault scenarios, resulting in unpredictable and often undesirable
fault-handling strategies. Moreover, failure handling diffusion makes it difficult to
examine any one or few portions of the code and determine how failure handling
is supposed to behave. Diffusion also implies that failure handling is inflexible;
policies that are spread across so many locations within the code base are hard to
change. In addition, we observe that failure handling is quite coarse-grained; it is
challenging to implement nuanced policies in the current system.
     We also discover that most systems implement portions of their failure policy
incorrectly; the presence of bugs in the implementations demonstrates the difficulty
and complexity of correctly handling certain classes of disk failure. We observe
little tolerance to transient failures; most file systems assume a single temporarily-
inaccessible block indicates a fatal whole-disk failure. We show that none of the file
systems can recover from partial disk failures, due to a lack of in-disk redundancy.
     Finally, in addition to fingerprinting the failure policy, our analysis also helps us
find several bugs within file systems that can catastrophically affect on-disk data.
For example, under certain write failures, all the analyzed file systems commit
failed transactions to disk; doing so can lead to serious problems, including an
                                                                                    5
In this context, we design, implement, and evaluate new techniques to store redun-
dant information on disk, focusing primarily on parity as the redundant information
due to its simplicity and wide-spread use.
    First, we address the problem of spatially local errors by proposing a probabilis-
tic disk failure model that characterizes disk block errors with spatial correlation
and use this model to layout blocks such that the probability of two or more errors
affecting related blocks is low. Second, we address the lack of NVRAM support
in low-end systems by developing new parity update techniques. Specifically, we
design two block update techniques where one technique updates the data blocks
in the traditional fashion by overwriting the old contents while the other radical
approach, called no-overwrite, issues writes to different locations on the disk and
changes the block pointers to point to the new locations.
    We evaluate the validity of the probabilistic model by comparing it with results
from fault injection experiments. The probability of multiple failures as predicted
by the failure model matches closely with the results from the fault injection exper-
iments. We also evaluate the performance overheads of parity update techniques
and show that both overwrite and no-overwrite techniques incur performance over-
head when compared to ext3 and the cost varies from less than 1 percent for read-
only workloads to about 50 percent for highly synchronous, write intensive work-
loads. Also, no-overwrite technique can fragment a file and therefore running a
de-fragmenter periodically can improve performance.
    In the last part of this dissertation, we seek to improve the state of the art of
failure-handling in modern file systems by solving the following problem we no-
ticed from our analysis: failure handling in commodity file systems is diffused
resulting in ad hoc, inconsistent, and coarse-grained failure policies. Our approach
here begins with the age-old wisdom of separating policy and mechanism [66],
which we achieve with the design and implementation of a Centralized Failure
Handler (CFH). All important failure-handling decisions are routed through the
CFH, which is thus responsible for enforcing the specific failure-handling policy of
the system. All relevant information is also made available within the CFH, hence
enabling the construction of the desired failure-handling policies.
    We demonstrate the power of centralized failure handling in a modified version
of the Linux ext3 file system (ext3c ). We first show how ext3c is flexible by imple-
menting a broad range of vastly different policies. For example, we show how ext3c
can mimic the failure-handling behavior of different file systems such as ReiserFS
and NTFS, all with changes to only a few lines of code. We also demonstrate that
ext3c is inherently consistent; because policies are specified within a single lo-
calized portion of the code, the desired failure handling is enacted. Because of the
                                                                                    7
presence of block-type information within the CFH, ext3c also enables fine-grained
policies to be developed. For example, we show that ext3c can implement a highly
paranoid policy for its own metadata while passing user data errors onto the appli-
cation. The end result is a file system that handles disk failures in a comprehensive,
thorough, and easily understood fashion; as a result, ext3c reacts in the expected
manner even when an unexpected disk failure occurs.
1.3 Contributions
The contributions of this dissertation are as follows:
1.4 Outline
The rest of this dissertation is organized as follows. In Chapter 2, we present the
common causes of disk failures following it with a discussion on fail-partial model
and IRON taxonomy. In Chapter 3, we give a brief introduction to journaling file
systems and different commodity file systems we analyze. Chapter 4 explains our
failure policy analysis. where we discuss our semantic failure analysis methodol-
ogy and results. In Chapter 5, we present the details of building a robust file system.
We start with a broad exploration of different redundant information followed by
a discussion of the system evaluation. Then, we explain our techniques to address
spatially correlated errors and lack of hardware support. Chapter 6 presents our
centralized failure handler for file systems. Related work is discussed in Chapter 7,
and conclusion and future directions are presented in Chapter 8.
                                                                                     9
Chapter 2
In order to understand how commodity file systems handle disk failures, first, we
must understand the storage system architecture and the ways it can fail. Second,
we need a standard taxonomy of various detection and recovery techniques that a
file system can use to handle disk failures in order to classify their failure handling
policies.
     In this chapter, we first explain the components of a storage stack, which is a
complex layered collection of electrical, mechanical, firmware and software com-
ponents (Section 2.2). There are many reasons that the file system may see faults
in the storage system below and we discuss common causes of disk failures (Sec-
tion 2.3). We then present a new, more realistic fail-partial model for disks and
discuss various aspects of this model (Section 2.4). We follow this with a discus-
sion about the techniques that an IRON file system can use to detect and recover
from disk errors (Section 2.5). Finally, we conclude the chapter by reasoning why
failure handling must be performed at the file system (Section 2.5.4) and why RAID
is not a complete solution to improve storage reliability (Section 2.5.5).
2.1 Terminology
Before we proceed to the details, we will clarify certain terminologies in order
to make the following discussion unambiguous and easy to follow. We follow
the IEEE Standard Software Engineering Terminology [22] and use the following
definitions for error, fault, and failure.
   • Error: 1. The difference between a computed, observed, or measured value
     or condition and the true, specified, or theoretically correct value or condi-
     tion. 2. An incorrect step, process, or data definition. 3. An incorrect result.
10
                             Host
                                      Generic Block I/O
                                       Device Driver
                                      Device Controller
                                                            Storage Subsystem
                                          Transport
                                           Firmware
                                    Electrical
                                                  Cache
                                    Mechanical
                             Disk
                                           Media
Figure 2.1: The Storage Stack. We present a schematic of the entire storage stack. At the
top is the file system; beneath are the many layers of the storage subsystem. Gray shading
implies software or firmware, whereas white (unshaded) is hardware.
     As we mentioned in Section 1.1, in this thesis we focus and limit our scope
only to local file systems. For the rest of the thesis unless otherwise stated, we use
the term “file system” to refer to a specific file system on top of the storage stack
(i.e., the end-point on the storage stack).
consists of three main layers: the host, the transport layer, and the disk. An error
can occur in any of these layers and propagate itself to the file system above.
    At the bottom of the “storage stack” is the disk itself; beyond the magnetic stor-
age media, there are mechanical (e.g., the motor and arm assembly), electrical com-
ponents (e.g., buses), and cache. A particularly important component is firmware
– the code embedded within the drive to control most of its higher-level functions,
including caching, disk scheduling, and error handling. This firmware code is of-
ten substantial and complex (e.g., a modern Seagate drive contains roughly 400,000
lines of code [33]).
    Connecting the drive to the host is the transport. In low-end systems, the trans-
port medium is often a bus (e.g., SCSI), whereas networks are common in higher-
end systems (e.g., FibreChannel). If a hardware RAID is used instead of a single
disk, an additional layer of RAID controller is added between the transport layer
and disk drives. A hardware RAID controller manages multiple disk drives and
exports a standard interface (e.g., SCSI) to the transport layer.
    At the top of the stack is the host. Herein there is a hardware controller that
communicates with the device, and above it a software device driver that controls
the hardware. Block-level software forms the next layer, providing a generic device
interface and implementing various optimizations (e.g., request reordering). Mod-
ern operating systems such as Windows and Linux provide support to stack multiple
layers of software components such as volume managers or software RAID layers
beneath the file system and above the device specific drivers, resulting in a more
complex storage stack.
    Above all other software is the file system. This layer is often split into two
pieces: a high-level component common to all file systems, and a specific com-
ponent that maps generic operations onto the data structures of the particular file
system. A standard interface (e.g., Vnode/VFS [63]) is positioned between the two.
drive head contacts the surface momentarily. A media scratch can also occur when
a particle is trapped between the drive head and the media [100]. Thermal asperity
is a spike in the read signal caused by disk asperities or contaminant particles,
which can cause the disk heads to lose their reading capabilities temporarily [124].
Such dangers are well-known to drive manufacturers, and hence modern disks park
the drive head when the drive is not in use to reduce the number of head crashes;
SCSI disks sometimes include filters to remove particles [7]. Media errors most
often lead to permanent failure or corruption of individual disk blocks.
Mechanical: “Wear and tear” eventually leads to failure of moving parts. A drive
motor can spin irregularly or fail completely. Erratic arm movements can cause
head crashes and media flaws; inaccurate arm movement can misposition the drive
head during writes, leaving blocks inaccessible or corrupted upon subsequent reads.
Electrical: A power spike or surge can damage in-drive circuits and hence lead to
drive failure [116]. Thus, electrical problems can lead to entire disk failure.
Drive firmware: Interesting errors arise in the drive controller, which consists of
many thousands of lines of real-time, concurrent firmware. For example, disks
have been known to return correct data but circularly shifted by a byte [67] or have
memory leaks that lead to intermittent failures [116]. Other firmware problems
can lead to poor drive performance [97]. Drive manufacturers often introduce new
firmware versions to fix existing problems on production drives and such fixes can
inadvertently increase the failure rates of the drive [103]. Some firmware bugs
are well-enough known in the field that they have specific names; for example,
misdirected writes are writes that place the correct data on the disk but in the wrong
location, and phantom writes are writes that the drive reports as completed but
that never reach the media [126]. Phantom writes can be caused by a buggy or
even misconfigured cache (i.e., write-back caching is enabled). In summary, drive
firmware errors often lead to sticky or transient block corruption but can also lead
to performance problems.
Transport: The transport connecting the drive and host can also be problematic.
For example, a study of a large disk farm [115] reveals that most of the systems
tested had interconnect problems, such as bus timeouts. Parity errors also occurred
with some frequency, either causing requests to succeed (slowly) or fail altogether.
Thus, the transport often causes transient errors for the entire drive.
Bus controller: The main bus controller can also be problematic. For example, the
EIDE controller on a particular series of motherboards incorrectly indicates com-
pletion of a disk request before the data has reached the main memory of the host,
leading to data corruption [125]. A similar problem causes some other controllers
to return status bits as data if the floppy drive is in use at the same time as the hard
                                                                                    13
drive [47]. Others have also observed IDE protocol version problems that yield cor-
rupt data [40]. In summary, controller problems can lead to transient block failure
and data corruption.
Low-level drivers: Recent research has shown that device driver code is more
likely to contain bugs than the rest of the operating system [27, 38, 113]. While
some of these bugs will likely crash the operating system, others can issue disk
requests with bad parameters, data, or both, resulting in data corruption.
2.4.4 Trends
In many other areas (e.g., processor performance), technology and market trends
combine to improve different aspects of computer systems. In contrast, we believe
that technology trends and market forces may combine to make storage system
failures occur more frequently over time, for the following three reasons.
                                                                                    15
     First, reliability is a greater challenge when drives are made increasingly more
dense; as more bits are packed into smaller spaces, drive logic (and hence complex-
ity) increases [7].
     Second, at the low-end of the drive market, cost-per-byte dominates, and hence
many corners are cut to save pennies in IDE/ATA drives [7]. Low-cost “PC class”
drives tend to be tested less and have less internal machinery to prevent failures
from occurring [56]. The result, in the field, is that ATA drives are observably less
reliable [115]; however, cost pressures serve to increase their usage, even in server
environments [40].
     Finally, the amount of software is increasing in storage systems and, as others
have noted, software is often the root cause of errors [44]. In the storage system,
hundreds of thousands of lines of software are present in the lower-level drivers
and firmware. This low-level code is generally the type of code that is difficult to
write and debug [38, 113] – hence a likely source of increased errors in the storage
stack.
Tables 2.1 and 2.2 present our IRON detection and recovery taxonomies, respec-
tively. Note that the taxonomy is by no means complete. Many other techniques
are likely to exist, just as many different RAID variations have been proposed over
the years [3, 127].
    The detection and recovery mechanisms employed by a file system define its
failure policy. Currently, it is difficult to discuss the failure policy of a system. With
the IRON taxonomy, one can describe the failure policy of a file system, much as
one can already describe a cache replacement or a file-layout policy.
consistent. This check can be performed either within a single block or across
blocks.
     When checking a single block, the file system can either verify individual fields
(e.g., that pointers are within valid ranges) or verify the type of the block. For
example, most file system superblocks include a “magic number” and some older
file systems such as Pilot even include a header per data block [88]. By checking
whether a block has the correct type information, a file system can guard against
some forms of block corruption.
     Checking across blocks can involve verifying only a few blocks (e.g., that a
bitmap corresponds to allocated blocks) or can involve periodically scanning all
structures to determine if they are intact and consistent (e.g., similar to fsck [72]).
Even journaling file systems can benefit from periodic full-scan integrity checks.
For example, a buggy journaling file system could unknowingly corrupt its on-disk
structures; running fsck in the background could detect and recover from such
problems.
 • Redundancy: The final level of the detection taxonomy is redundancy. Many
forms of redundancy can be used to detect block corruption. For example, check-
summing has been used in reliable systems for years to detect corruption [16] and
has recently been applied to improve security as well [79, 110]. Checksums are use-
ful for a number of reasons. First, they assist in detecting classic “bit rot”, where
the bits of the media have been flipped. However, in-media ECC often catches and
corrects such errors. Checksums are therefore particularly well-suited for detecting
corruption in higher levels of the storage system stack (e.g., a buggy controller that
“misdirects” disk updates to the wrong location or does not write a given block to
disk at all). However, checksums must be carefully implemented to detect these
problems [16, 126]; specifically, a checksum that is stored along with the data it
checksums will not detect such misdirected or phantom writes.
     Higher levels of redundancy, such as block mirroring [20], parity [78, 81] and
other error-correction codes [69], can also detect corruption. For example, a file
system could keep three copies of each block, reading and comparing all three to
determine if one has been corrupted. However, such techniques are truly designed
for correction (as discussed below); they often assume the presence of a lower-
overhead detection mechanism [81].
eager methods.
    For example, disk scrubbing is a classic eager technique used by RAID sys-
tems to scan a disk and thereby discover latent sector faults [60]. Disk scrubbing
is particularly valuable if a means for recovery is available, that is, if a replica
exists to repair the now-unavailable block. To detect whether an error occurred,
scrubbing typically leverages the return codes explicitly provided by the disk and
hence discovers block failure but not corruption. If combined with other detection
techniques (such as checksums), scrubbing can discover block corruption as well.
 • Retry: A simple response to failure is to retry the failed operation and recent
work shows that file systems do recover from most number of disk errors by sim-
ply retrying [45]. Retry can appropriately handle transient errors, but wastes time
retrying if the failure is indeed permanent.
 • Repair: If an IRON file system can detect an inconsistency in its internal data
structures, it can likely repair them, just as fsck would. For example, a block
that is not pointed to, but is marked as allocated in a bitmap, could be freed. As
discussed above, such techniques are useful even in the context of journaling file
systems, as bugs may lead to corruption of file system integrity.
 • Remap: IRON file systems can perform block remapping. This technique can
be used to fix errors that occur when writing a block, but cannot recover failed
reads. Specifically, when a write to a given block fails, the file system could choose
to simply write the block to another location. More sophisticated strategies could
remap an entire “semantic unit” at a time (e.g., a user file), thus preserving logical
contiguity.
 • Redundancy: Finally, redundancy (in its many forms) can be used to recover
from block loss. The simplest form is replication, in which a given block has
two (or more) copies in different locations within a disk. Another redundancy
approach employs parity to facilitate error correction. Similar to RAID 4/5 [81], by
adding a parity block per block group, a file system can tolerate the unavailability
or corruption of one block in each such group. More complex encodings (e.g.,
Tornado codes [69]) could also be used, a subject worthy of future exploration.
    However, redundancy within a disk can have negative consequences. First,
replicas must account for the spatial locality of failure (e.g., a surface scratch that
corrupts a sequence of neighboring blocks); hence, copies should be allocated
across remote parts of the disk, which can lower performance. Second, in-disk
redundancy techniques can incur a high space cost; however, in many desktop set-
tings, drives have sufficient available free space [32].
place that can detect corruption of data in higher levels of the storage stack (e.g.,
within the device driver or drive controller).
     A second reason for implementing detection and recovery in the file system
is that the file system has exact knowledge of how blocks are currently being
used. Thus, the file system can apply detection and recovery intelligently across
different block types. For example, the file system can provide a higher level
of replication for its own metadata, perhaps leaving failure detection and correc-
tion of user data to applications (indeed, this is one specific solution that we ex-
plore in Section 5.2). Similarly, the file system can provide machinery to enable
application-controlled replication of important data, thus enabling an explicit per-
formance/reliability trade-off.
     A third reason is performance: file systems and storage systems have an “un-
written contract” [98] that allows the file system to lay out blocks to achieve high
bandwidth. For example, the unwritten contract stipulates that adjacent blocks in
the logical disk address space are physically proximate. Disk-level recovery mech-
anisms, such as remapping, break this unwritten contract and cause performance
problems. If the file system instead assumes this responsibility, it can itself remap
logically-related blocks (e.g., a file) and hence avoid such problems.
     However, there are some complexities to placing IRON functionality in the
file system. First, some of these techniques require new persistent data structures
(e.g., to track where redundant copies or parity blocks are located). Second, some
mechanisms require control of the underlying drive mechanisms. For example, to
recover on-disk data, modern drives will attempt different positioning and reading
strategies [7]; no interface exists to control these different low-level strategies in
current systems. Third, a file system may not know the exact disk geometry, which
makes it harder to place redundant copies such that they tolerate spatially local
errors.
stack, as shown in Figure 2.1. Because many layers exist between the storage
subsystem and the file system, and errors can occur in these layers as well, the file
system must ultimately be responsible for detecting and perhaps recovering from
such errors. Ironically, a complex RAID controller can consist of millions of lines
of code [127], and hence be a source of faults itself.
     Third, depending on the particular RAID system employed, not all types of
disk faults may be handled. For example, lower-end RAID controller cards do not
use checksums to detect data corruption, and only recently have some companies
included machinery to cope with latent sector faults [28].
     Hence, we believe that IRON techniques within a file system are useful for
all single-disk systems, and even when multiple drives are used in a RAID-like
manner. Although we focus on single-disk systems in this paper, we believe there
is a rich space left for exploration between IRON file systems and redundant storage
arrays.
22
                                                                                   23
Chapter 3
Background
Our failure policy analysis entirely focuses on journaling file systems as most mod-
ern file systems perform journaling. To understand the analysis methodology, it is
necessary that we understand the general concepts of journaling, transaction, and
checkpointing and some specific file system details like how the on-disk structures
are updated.
    Journaling file systems, in order to maintain the file system integrity, use the
concept of transactions [46] and follow certain write ordering constraints when
they issue their updates. Depending upon the file system, transactions may be
composed of whole file system blocks or only modified records. In our analysis, we
examine five different journaling file systems. Four of them are open source: Linux
ext3 [121], ReiserFS [89], JFS [19], and XFS [112] and Windows NTFS [109] is
closed source. There are two reasons for selecting these file systems. First, they are
widely used being default file systems in several operating system configurations
(e.g., ext3 is the default file system in Red Hat Linux, ReiserFS in SuSE Linux
distribution, and NTFS in Windows NT, 2000, and XP). Second, in addition to
being used in personal computers, they are also used in high-end storage arrays [36,
50].
    In this chapter, first we give a general introduction to journaling file systems
(Section 3.1) and then briefly discuss different commodity journaling file systems
(Sections 3.1.1 to 3.1.5).
Fixed (Data)
Fixed (Data)
Figure 3.1: Journaling Modes. The diagram depicts the three different journaling modes
supported in modern file systems: writeback, ordered, and data. In the diagram, time flows
downward. Boxes represent updates to the file system, e.g., “Journal (Inode)” implies the
write of an inode to the journal; the other destination for writes is labeled “Fixed”, which
is a write to the fixed in-place ext2 structures. An arrow labeled with a “Sync” implies that
the two blocks are written out in immediate succession synchronously, hence guaranteeing
the first completes before the second. A curved arrow indicates ordering but not immediate
succession; hence, the second write will happen at some later time. Finally, for writeback
mode, the dashed box around the “Fixed (Data)” block indicates that it may happen at any
time in the sequence. In this example, we consider a data block write and its inode as the
updates that need to be propagated to the file system; the diagrams show how the data flow
is different for each of the journaling modes.
circular buffer; once the necessary information has been propagated to its fixed
location structures, journal space can be reclaimed.
Journaling Modes: Modern journaling file systems such as ext3 and ReiserFS
present three flavors of journaling: writeback mode, ordered mode, and data jour-
naling mode; Figure 3.1 illustrates the differences between these modes. The
choice of mode is made at mount time and can be changed via a remount.
    In writeback mode, only file system metadata is journaled; data blocks are writ-
ten directly to their fixed location. This mode does not enforce any ordering be-
tween the journal and fixed-location data writes, and because of this, writeback
mode has the weakest integrity and consistency semantics of the three modes. Al-
though it guarantees integrity and consistency for file system metadata, it does not
provide any corresponding guarantees to the data blocks.
    For example, assume a process appends a block to a file; internally, the file
system allocates a new pointer within the inode I and a new data block at address
D to hold the data. In writeback mode, the file system is free to write D at any
time, but is quite careful in how it updates I. Specifically, it will force a journal
record to disk containing I, followed by a commit block. Once the commit block is
written, the file system knows it can safely recover from a crash, and at some time
later will write I to its fixed location structures. However, if I makes it to the log
successfully, and a crash happens before D gets to disk, upon recovery I will point
to D but the contents of D will not be correct.
    In ordered journaling mode, again only metadata writes are journaled; however,
data writes to their fixed location are ordered before the journal writes of the meta-
data. In contrast to writeback mode, this mode provides more sensible integrity
semantics where a metadata block is guaranteed not to point to a block that does
not belong to the file.
    In continuing from our example above, D will be forced to its fixed location
before I is written to the journal. Thus, even with an untimely crash, the file system
will recover in a reasonable manner.
    However, in ordered journaling mode, consistency between metadata and data
might be affected. For example, if a data block is overwritten and the system
crashes before writing the inode block to the journal, after recovery, the file sys-
tem inode timestamps will be inconsistent with the recency of the data.
    In full data journaling mode, the file system logs both metadata and data to
the journal. This decision implies that when a process writes a data block, it will
typically be written out to disk twice: once to the journal, and then later to its fixed
location. Data journaling mode provides the strongest integrity and consistency
guarantees; however, it has different performance characteristics, in some cases
26
3.1.1 Ext3
Linux ext3 [121, 122] is the default journaling file system in several distributions of
Linux (e.g., Red Hat), and it is built as an extension to the ext2 file system. In ext3,
                                                                                                                            27
IB = Inode Bitmap, DB = Data Bitmap, JS = Journal Superblock, JD = Journal Descriptor Block, JC = Journal Commit Block
Figure 3.2: Ext3 Layout. The picture shows the layout of an ext3 file system. The disk
address space is broken down into a series of block groups (akin to FFS cylinder groups),
each of which has bitmaps to track allocations and regions for inodes and data blocks.
The ext3 journal is depicted here as a file within the first block group of the file system;
it contains a superblock, various descriptor blocks to describe its contents, and commit
blocks to denote the ends of transactions.
data and metadata are eventually placed into the standard ext2 structures, which
are the fixed-location structures. In this organization (which is loosely based on
FFS [71]), the disk is split into a number of block groups; within each block group
are bitmaps, inode blocks, and data blocks. The ext3 journal (or log) is commonly
stored as a file within the file system, although it can be stored on a separate device
or partition. Figure 3.2 depicts the ext3 on-disk layout.
Ext3 Journal Structure: Ext3 supports all the three journaling modes and uses
additional metadata structures to track the list of journaled blocks. The journal
superblock tracks summary information for the journal, such as the block size and
head and tail pointers. A journal descriptor block marks the beginning of a trans-
action and describes the subsequent journaled blocks, including their final fixed
on-disk location. In data journaling mode, the descriptor block is followed by the
data and metadata blocks; in ordered and writeback mode, the descriptor block
is followed by the metadata blocks. In all modes, ext3 logs full blocks (physical
journaling), as opposed to differences from old versions; thus, even a single bit
change in a bitmap results in the entire bitmap block being logged. Depending
upon the size of the transaction, multiple descriptor blocks each followed by the
corresponding data and metadata blocks may be logged. Finally, a journal commit
block is written to the journal at the end of the transaction; once the commit block
is written, the journaled data can be recovered without loss.
Ext3 Transactions: Ext3 maintains a single, system-wide, compound transac-
tion into which all the file system updates are added and committed. Although
group commit improves performance, it can potentially result in combining unre-
lated updates, which can lead to a tangled synchrony between asynchronous and
synchronous traffic in the file system [84]. Ext3 commits and checkpoints trans-
actions under the influence of three factors: application initiated synchronization
calls, journal size, and commit timer settings.
28
3.1.2 ReiserFS
ReiserFS [89] is the default journaling file system under certain distributions of
Linux such as SuSE. The general behavior of ReiserFS is similar to ext3. For exam-
ple, both file systems support all the three journaling modes, both have compound
transactions, and both perform physical journaling. However, ReiserFS differs from
ext3 in three primary ways.
     First, the two file systems use different on-disk structures to track their fixed-
location data. Ext3 uses the same structures as ext2; for improved scalability, Reis-
erFS uses a B+ tree, in which data is stored on the leaves of the tree and the metadata
is stored on the internal nodes. Since the impact of the fixed-location data structures
is not the focus of this thesis, this difference is largely irrelevant.
     Second, the format of the journal is slightly different. In ext3, the journal can
be a file, which may be anywhere in the partition and may not be contiguous. The
ReiserFS journal is not a file and is instead a contiguous sequence of blocks at
the beginning of the file system; as in ext3, the ReiserFS journal can be put on a
different device. Further, ReiserFS limits the journal to a maximum of 32 MB.
     Third, ext3 and ReiserFS differ slightly in their journal contents. In ReiserFS,
the fixed locations for the blocks in the transaction are stored not only in the de-
scriptor block but also in the commit block. Also, unlike ext3, ReiserFS uses only
one descriptor block in every compound transaction, which limits the number of
blocks that can be grouped in a transaction.
3.1.3 JFS
JFS is a journaling file system developed by IBM, which was designed with the
journaling support fully integrated from the start rather than adding it later. The
journal is located, by default, at the end of the partition and is treated as a contigu-
ous sequence of blocks [18]. One cannot specify any journaling mode during the
file system mount.
     In a previous work, we infer that JFS uses ordered journaling mode [84]. Due to
the small amount of traffic to the journal, it was obvious that it was not employing
data journaling. To differentiate between writeback and ordered modes, we observe
that the ordering of writes matched that of ordered mode. That is, when a data block
is written by the application, JFS orders the write such that the data block is written
successfully before the metadata writes are issued.
     JFS does logging at the record level (logical journaling). That is, whenever an
inode, index tree, or directory tree structure changes, only that structure is logged
instead of the entire block containing the structure. As a result, JFS writes fewer
                                                                                      29
journal blocks than ext3 and ReiserFS for the same operations.
    JFS does not by default group concurrent updates into a single compound trans-
action although, there are circumstances in which transactions are grouped: for
example, if the write commit records are on the same log page.
    Finally, there are no commit timers in JFS and the fixed-location writes happen
whenever the Linux kernel timer (i.e., the kupdate daemon) expires. However,
the journal writes are never triggered by the timer: journal writes are indefinitely
postponed until there is another trigger such as memory pressure or an unmount
operation. This infinite write delay can limit reliability, as a crash can result in data
loss even for data that was written minutes or hours before.
3.1.4 XFS
XFS is a 64-bit journaling file system developed by SGI. XFS supports only ordered
journaling mode and data/writeback journaling modes are not present [112]. Sim-
ilar to JFS, XFS journals only records in the log instead of writing whole blocks.
Since only records are logged, XFS has no separate journal commit block; a jour-
nal block contains different types of records and a commit record is one of them.
Also, there is no separate journal super block. Usually, the journal super block
(or its equivalent) stores the ‘dirty’ state of the journal, which is used to determine
whether the journal has to be replayed on a mount. In XFS, on every mount, the log
is searched to find any transaction that needs to be replayed. On a clean unmount,
XFS writes an ‘unmount’ record into the journal to mark the journal as clean.
     The fixed-location structures of XFS consist of allocation groups where each
allocation group manages its own inodes and free space. Since each allocation
group is, in effective, an independent entity, the kernel can interact with multiple
groups simultaneously. Internally, each allocation group uses B+ trees to keep track
of data and metadata.
3.1.5 NTFS
Microsoft’s NTFS is a widely used (and default) journaling file system on Win-
dows operating systems such as NT, 2000, and XP. Although the source code or
documentation of NTFS is not publicly available, tools for finding the NTFS file
layout exist [2].
    Every object in NTFS is a file. Even metadata is stored in terms of files. The
journal itself is a file and is located almost at the center of the file system. In a
related work, we find that NTFS does not implement data journaling [84]. We
find that NTFS, similar to JFS and XFS, does logical journaling, where it journals
30
metadata in terms of records. We also infer that NTFS performs ordered journaling
by introducing artificial delays on data writes and noting the corresponding delays
on metadata writes as well.
Table 3.1: Journaling File Systems Summary. This table gives a summary of the journal-
ing properties of commodity journaling file systems. Although ext3 and ReiserFS support
all the three journaling modes, ordered journaling mode is the default one.
Chapter 4
In this chapter, we describe the file system failure policy analysis. Broadly, there
are two ways to unearth the failure policies. One, by examining the source code
to understand what detection and recovery techniques are used by a file system
under different failure scenarios. Two, by injecting failures and monitoring how
failures are handled. The first approach is time consuming and error prone due to
the large code base consisting of several thousands (even hundreds of thousands for
commercial file systems) of lines of intricate code involving lots of low-level calls.
Instead, we use a unique Semantic Failure Analysis (SFA) technique, wherein we
use block type information and transactional semantics to fail particular blocks in
order to understand how file systems handle different block failures.
    We analyze five different journaling file systems (Linux ext3, ReiserFS, JFS,
XFS, and Windows NTFS) and from our analysis we draw the following main
conclusions: (a) Commodity file systems store little or no redundant information
on disk. As a result, they are not able to recover when portions of a disk fail. (b) We
find illogical inconsistency in file system failure handling policies. (c) Finally, we
find several bugs in failure handling routines, which show that it is hard to handle
block failures in a logically consistent manner.
    We begin by explaining the failure analysis, which uses semantic block level
information (Section 4.1), followed by a description of our analysis methodology
(Section 4.2), and finally present the results (Section 4.3).
32
                            Workload                           Purpose
              Singlets:
               access, chdir, chroot,
               stat, statfs, lstat, open,
               utimes, read, readlink,                       Exercise the
               getdirentries, creat,                          Posix API
               link, mkdir, rename, chown,
               symlink, write, truncate,
               rmdir, unlink, mount,
               chmod, fsync, sync, umount
              Generics:
               Path traversal                            Traverse hierarchy
               Recovery                                   Invoke recovery
               Log writes                                  Update journal
Table 4.1: Workloads. The table presents the workloads applied to the file systems under
test. The first set of workloads each stresses a single system call, whereas the second group
invokes general operations that span many of the calls (e.g., path traversal).
steps in detail.
Table 4.2: Ext3 Data Structures. The table presents the data structures of interest in ext3
file system. In the table, we list the names of the major structures and their purpose.
sufficiently large files are created to access these structures. Other file systems
have similar peculiarities that we make sure to exercise (e.g., the B+-tree balanc-
ing code of ReiserFS). The block types of the file systems we test are listed in
Tables 4.2, 4.3, 4.4, 4.5, and 4.6.
Error Model
In our error model, we assume that the latent faults or block corruption originate
from any of the layers of the storage stack. These errors can be accurately mod-
eled through software-based fault injection because in Linux, all detected low-level
errors are reported to the file system in a uniform manner as “I/O errors” at the
device-driver layer.
                                                                                       37
Table 4.3: ReiserFS Data Structures. The table presents the ReiserFS data structures of
interest and their purpose.
Table 4.4: JFS Data Structures. The table presents the JFS data structures of interest and
their purpose.
38
Table 4.5: XFS Data Structures. The table presents the XFS data structures of interest
and their purpose.
Table 4.6: NTFS Data Structures. The table presents the NTFS data structures of interest
and their purpose.
                                                                                      39
     The errors we inject into the block write stream have three different attributes,
similar to the classification of faults injected into the Linux kernel by Gu et al. [49].
The fault specification consists of the following attributes:
Failure Type: This specifies whether a read or write must be failed. If it is a read
error, one can specify either a latent sector fault or block corruption. Additional
information such as whether the system must be crashed before or after certain
block failure can also be specified.
Block Type: This attribute specifies the file system and block type to be failed.
The request to be failed can be a dynamically-typed one (like a directory block) or
a statically typed one (like a super block). Specific parameters can also be passed
such as an inode number of an inode to be corrupted or a particular block number
to be failed.
Transience: This determines whether the fault that is injected is a transient error
(i.e., fails for the next N requests, but then succeeds), or a permanent one (i.e., fails
upon all subsequent requests).
Our testing framework is shown in Figure 4.1(a). It consists of two main com-
ponents; a device driver called the fault-injection driver and a user-level process
labeled the coordinator. The driver is positioned between the file system and the
disk and is used to observe I/O traffic from the file system and to inject faults at
certain points in the I/O stream. The coordinator monitors and controls the entire
process by informing the driver as to which specific fault to insert, running work-
loads on top of the file system, and then observing the resultant behavior. We use
a similar fault injection framework for NTFS in Windows, where a filter driver is
used to interpose on the file system traffic.
    A flow diagram of the benchmarking process is shown in Figure 4.1(b). We
now describe the entire process in detail.
The Fault-Injection Driver: The fault-injection driver (or just “driver”) is a pseudo-
device driver, and hence appears as a typical block device to the file system. It is
placed in the Linux kernel and associated with a real disk. The file system of inter-
est is mounted on top of the pseudo-device and the driver simply interposes upon
all I/O requests to the real underlying disk. As the driver passes the traffic to and
from the disk, it also efficiently tracks each request and response by storing a small
record in a fixed-sized circular buffer.
     The driver has three main roles in our system. First, it must classify each block
that is written to disk based on its type (i.e., what specific file-system data struc-
40
      ioctl()
                                                                       Save fault specification
                                              SYSTEM                      from coordinator
                                               LOG
                           VFS LAYER
                                                                         Receive file system
                                                                         read/write requests
      2.6.9
                     EXT3 REISERFS JFS XFS
                                                                        Classify block types
      LINUX
                     FAULT−INJECTION DRIVER                                     Does
                                                                  No                              Yes
                                                                         the block match the
                                                                                model ?
                                                                                                     Does the
                          SCSI   / IDE                                                             block match
                                                                                        No                            Yes
                                                       Report error                                  the fault
                                                                                                  specification?
(a) (b)
Figure 4.1: Benchmarking Framework and Algorithm Flow. Figure (a) shows the
benchmarking framework we use to measure the fault tolerance of journaling file systems
to I/O failures. The two main components in the figure are the user level process that issues
the fault and the fault-injection driver that classifies blocks and injects faults. Figure (b)
shows a simplified flowchart of our benchmarking algorithm that is implemented by the
fault-injection driver.
ture this write represents). For example, one must interpret the contents of the
journal to infer the type of journal block (e.g., a descriptor or commit block) and
one must interpret the journal descriptor block to know which data blocks are jour-
naled. Therefore, it is necessary that we interpret the semantics online so that block
failures can be injected correctly. We explain the complexity and overhead of this
classification in Sections 4.2.5 and 4.2.5.
    Second, the driver must “model” what the journaling file system above is doing.
Specifically, for write failures, such a model represents the correct sequence of
states that a transaction must go through in committing to disk. For read failures, it
is sufficient to correctly interpret the file system block types. By inserting failures
for specific block types and transactional states, we can observe how the file system
handles different types of faults and better judge if it correctly handles the faults
we have injected.
    Third, the driver is used to inject faults into the system. This layer injects both
block faults (on reads or writes) and block corruption (on reads). To emulate a block
fault, we simply return the appropriate error code and do not issue the operation to
                                                                                   41
the underlying disk. To emulate corruption, we change bits within the block before
returning the data; in some cases we inject random noise, whereas in other cases
we use a block similar to the expected one but with one or more corrupted fields.
The software layer also models both transient and sticky faults.
     By injecting failures just below the file system, we emulate faults that could be
caused by any of the layers in the storage subsystem. Therefore, unlike approaches
that emulate faulty disks using additional hardware [24], we can imitate faults in-
troduced by buggy device drivers and controllers. A drawback of our approach
is that it does not discern how lower layers handle disk faults; for example, some
SCSI drivers retry commands after a failure [90]. However, given that we are char-
acterizing how file systems react to faults, we believe this is the correct layer for
fault injection.
the fault is injected, the coordinator collects the error logs from the child process,
system log, and the driver.
After running a workload and injecting a fault, the final step is to determine how the
file system behaved. To determine how a fault affected the file system, we compare
the results of running with and without the fault. We perform this comparison
across all observable outputs from the system: the error codes and data returned
by the file system API, the contents of the system log, and the low-level I/O traces
recorded by the fault-injection layer. Currently, this is the most human-intensive
part of the process, as it requires manual inspection of the visible outputs.
    In certain cases, if we identify an anomaly in the failure policy, we check the
source code to verify specific conclusions; however, given its complexity, it is not
feasible to infer failure policy only through code inspection.
     We collect large volumes of results in terms of traces and error logs for each
fault injection experiment we run. Due to the sheer volume of experimental data, it
is difficult to present all results for the reader’s inspection. We represent file system
failure policies using a unique representation called failure policy graphs, which is
similar to the one shown in Figure 4.2.
    In Figure 4.2, we plot the different workloads on x-axis and the file system data
structures on y-axis. If applicable, each <row, column> entry presents the IRON
detection or recovery technique used by a file system. If not applicable (i.e., if the
workload does not generate the particular block type traffic), a gray box is used.
The symbols are superimposed when multiple mechanisms are employed by a file
system.
    Next, we explain how the entries in Figure 4.2 must be interpreted by walking
through an example. Specifically, consider the entry for workload “W1” and “D5”
data structure. It has three symbols superimposed: retry (RRetry ), error propagation
(RP ropagate ) and finally, a file system stop (RStop ). This means that whenever an
I/O to “D5” fails during workload “W1”, the file system first retries and if that fails,
stops and propagates the error to the application.
     We use failure policy graphs not only to present our analysis results but also
throughout this thesis to represent the failure policies of different versions of IRON
file systems we build.
                                                                                   43
                                                Workloads
                                                W1 W2 W3 W4
                                           D5
                         Data Structures
                                           D4
                                           D3
                                           D2
                                           D1
Figure 4.2: Example Failure Policy Graph. The figure presents a sample failure policy
graph. A gray box indicates that the workload is not applicable for the block type. If
multiple mechanisms are observed, the symbols are superimposed.
           J                    C                  J                C                         J               C
     S0             S1                 S2   S0              S1                  S2   S0               S1               S2
                                J                                    J                                         J
                                                              D
               F        F   F                       F        F      F                         F        F      F
                    S3                                      S3                                        S3
(a) Data Journaling Model (b) Ordered Journaling Model (c) Writeback Journaling Model
Figure 4.3: Journaling Models. This figure shows the models that we use for verifying
the different journaling modes. Each model is built based on a regular expression and then
the state S3, which represents the state that is reached on any write failure, is added to it.
In the above models, J represents journal writes, D represents data writes, C represents
journal commit writes, K represents checkpoint writes, S represents journal super block
writes, and F represents any write failure.
((J + C)+ (K ∗ S ∗ )∗ )+ .
In data journaling mode, all the file system writes are journaled (represented by J)
and there are no ordered or unordered writes. After writing one or more journal
blocks, a commit block (represented by C) is written by the file system to mark the
end of the transaction. The file system could write one or more such transactions
to the log. Once the transactions are committed, the file system might write the
checkpoint blocks (represented by K) to their fixed locations or the journal super
block (represented by S) to mark the new head and tail of the journal. We convert
this regular expression to a state diagram as shown in Figure 4.3(a) and add the
failure state S3 to it.
Ordered Journaling: Ordered journaling can be expressed by the following reg-
ular expression:
(((D ∗ J + D ∗ )+ C)+ (K ∗ S ∗ )∗ )+ .
In ordered mode, data (D) must be written before metadata blocks are committed
to the journal. Note that the data blocks can be issued in parallel with the journal
writes (J) but all of those writes must be complete before the commit block (C) is
written. Once the commit block is written, the transaction is over. There could be
one or more such transactions. Similar to data journaling, the file system can write
the checkpoint blocks (K) or the journal super block (S) after the transactions. This
regular expression is converted into a state diagram and a failure state S3 is added
to it as shown in Figure 4.3(b).
Writeback Journaling: Writeback journaling is given by the following regular
expression:
(((D ∗ J + D∗ )+ C)+ (K ∗ D∗ S ∗ )∗ )+ .
In writeback journaling mode, the data (D) can be written at any time by the file
system. It can be written before the journal writes (J) or after them. Once the
journal writes are complete, a commit block (C) is written. After the transaction is
committed, the file system can write the journal super block (S) or the checkpoint
blocks (K) or the unordered writes (D). The writeback journaling model in Fig-
ure 4.3(c) is obtained by taking this regular expression and adding the state S3 to
it.
46
Table 4.7: Code Size of Fault Injection Drivers. The number of C statements (counted as
the number of semicolons) needed to perform a thorough failure policy analysis for ext3,
ReiserFS, and JFS and a preliminary analysis for XFS and NTFS. Our XFS and NTFS
analysis are preliminary because we do not run all the workloads and fail all the data
structures in these file systems.
only between approximately 350 and 500 statements are needed to support different
journaling file systems.
D J K
                                                                        J               C
                                                                   S0         S1             S2
                                                                              (d)
 D            D          J           D          J
                    J                      J          C
 S0           S0         S1          S0         S1         S2
                                                (c)                     D               J
 (a)               (b)
                                                                              J
                                                                        S0              S1
                                                                                        S3
                                                                                  (e)
Figure 4.4: Fault Injection Example. This figure shows the sequence of steps followed
by the fault-injection driver to track transaction writes in ordered journaling mode and fail
the commit block write.
recovery properties, the fault injection can take up to a few minutes if the system
needs to be rebooted. Since a system crash is caused only on few experiments, it
did not add much overhead to the analysis.
writes, the journal blocks are logged (Figure 4.4b). The commit block is written
after all the data and journal writes are complete and it is failed (Figure 4.4c). The
file system can be oblivious to this commit block failure and continue to check-
point the journaled blocks (Figure 4.4d). Or, the file system can recognize this
failure and take steps to prevent file system corruption by moving to state S3 (Fig-
ure 4.4e). In state S3, the file system could abort the failed transaction, or do bad
block remapping, or remount itself read-only, or crash the system.
     From the example, we can see that it is not sufficient to know just the block
types to inject faults in file system requests. Without the model, the fault-injection
driver cannot determine if the requests following a write failure belong to the failed
transaction or to new transactions from the file system. By keeping track of the
writes using the journaling model, the fault-injection driver can explain why a par-
ticular block write failure leads to certain file system errors.
4.2.7 Summary
We have developed a three-step fingerprinting methodology to determine file sys-
tem failure policy. We believe our approach strikes a good balance: it is straight-
forward to run and yet exercises the file system under test quite thoroughly. Our
workload suite contains roughly 30 programs, each file system has on the order of
10 to 20 different block types, and each block can be failed on a read or a write or
have its data corrupted. For each file system, this amounts to roughly 400 relevant
tests.
failure handling by the file system is same for all those workloads. We now provide
a qualitative summary of the results that are presented within the figure.
Table 4.8: Keys for Detection and Recovery. The table presents the keys we use to repre-
sent the detection and recovery policies in file systems.
Detection
To detect read failures, ext3 primarily uses error codes (DErrorCode ). However,
when a write fails, ext3 does not record the error code (DZero ); hence, write errors
are often ignored, potentially leading to serious file system problems (e.g., when
checkpointing a transaction to its final location). Ext3 also performs a fair amount
of sanity checking (DSanity ). For example, ext3 explicitly performs type checks
for certain blocks such as the superblock and many of its journal blocks. However,
little type checking is done for many important blocks, such as directories, bitmap
blocks, and indirect blocks. Ext3 also performs numerous other sanity checks (e.g.,
when the file-size field of an inode contains an overly-large value, open detects
this and reports an error).
Recovery
For most detected errors, ext3 propagates the error to the user (RP ropagate ). For
read failures, ext3 also often aborts the journal (RStop ); aborting the journal usually
leads to a read-only remount of the file system, preventing future updates without
explicit administrator interaction. Ext3 also uses retry (RRetry ), although spar-
ingly; when a prefetch read fails, ext3 retries only the originally requested block.
                                                                                                              51
                                         Detection                          Recovery
                            a b c d e f g h i j k l mn o p q r s t   a b c d e f g h i j k l mn o p q r s t
                 inode
                 dir
 Read Failure
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
                 inode
 Write Failure
                 dir
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
                 inode
                 dir
 Corruption
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
Figure 4.5: Ext3 Failure Policies. The failure policy graphs plot detection and re-
covery policies of ext3 for read, write, and corruption faults injected for each block
type across a range of workloads. The workloads are a: path traversal b: ac-
cess,chdir,chroot,stat,statfs,lstat,open c: chmod,chown,utimes d: read e: readlink f: getdi-
rentries g: creat h: link i: mkdir j: rename k: symlink l: write m: truncate n: rmdir o: unlink
p: mount q: fysnc,sync r: umount s: FS recovery t: log write operations. A gray box in-
dicates that the workload is not applicable for the block type. If multiple mechanisms are
observed, the symbols are superimposed. The keys for detection and recovery are presented
in Table 4.8.
52
these modes, ext3 checkpoints all the journaled blocks even from a failed transac-
tion. Below we describe a generic case where it can cause file system corruption.
    Let there be two transactions T1 and T2 , where T1 is committed first followed
by T2 . Let block B1 be journaled in T1 and blocks B1 and B2 be journaled in T2 .
Assume transaction T2 fails and that the file system continues to checkpoint blocks
B1 and B2 of the failed transaction T2 . If a crash occurs after writing blocks B1
and B2 to their fixed locations, the file system log recovery runs during next mount.
During the recovery only transaction T1 will be recovered because T2 is a failed
transaction. When T1 is recovered, the contents of block B1 will be overwritten by
old contents from T1 . After the recovery, the file system will be in an inconsistent
state where block B1 is from transaction T1 and block B2 is from transaction T2 .
    The above explained problem can occur in ext3 when the write of a journal
metadata block like descriptor block, revoke block, or commit block fails. This can
lead to file system corruptions resulting in loss of files, inaccessible directories and
so on.
Not Replaying Failed Checkpoint Writes: Checkpointing is the process of writ-
ing the journaled blocks from the log to their fixed locations. When a checkpoint
write fails, the file system must either attempt to write again or mark the journal
such that the checkpoint write will happen again during the next log replay. Ext3
does not replay failed checkpoint writes. This can cause data corruption, data loss,
or loss of files and directories.
Not Replaying Transactions: Journaling file systems maintain a state variable to
mark the log as dirty or clean. When the file system is mounted, if the log is dirty,
the transactions from the log are replayed to their fixed locations. Usually jour-
naling file systems update this state variable before starting a transaction and after
checkpointing the transaction. If the write to update this state variable fails, two
things can possibly happen; one, the file system might replay a transaction that
need not be replayed; two, it might fail to replay a transaction that needs recovery.
Replaying the same transaction again does not cause any integrity problems. But
the second possibility (i.e., not replaying the journal contents) can lead to corrup-
tion, or loss of data.
    Ext3 maintains its journal state in the journal super block. Ext3 clears this field
and writes the journal super block to indicate a clean journal. To mark the journal
as dirty, the journal super block is written with a non-zero value in this field. When
the journal super block write fails, ext3 does not attempt to write it again or save the
super block in other locations. Moreover, even after the journal super block failure,
ext3 continues to commit transactions to the log. If the journal super block written
to mark the journal as dirty is failed, the journal appears clean on next mount. If
54
any transaction needs replay due to a previous crash, ext3 fails to replay them. This
can result in lost files and directories.
Replaying Failed Transactions: When a journal data block write fails, that trans-
action must be aborted and not replayed. If the transaction is replayed, journal data
blocks with invalid contents might be read and written to the fixed location. If not
handled properly, this can lead to serious file system errors. As said earlier, ext3
does not abort failed transactions. It continues to commit them to the log. Therefore
during recovery, it can write invalid contents to file system fixed location blocks.
This can corrupt important file system metadata and even result in an unmountable
file system.
4.3.2 ReiserFS
Detection
Our analysis reveals that ReiserFS pays close attention to error codes across reads
and writes (DErrorCode ). ReiserFS also performs a great deal of internal sanity
checking (DSanity ). For example, all internal and leaf nodes in the balanced tree
have a block header containing information about the level of the block in the tree,
the number of items, and the available free space; the super block and journal
metadata blocks have “magic numbers” which identify them as valid; the jour-
nal descriptor and commit blocks also have additional information; finally, inodes
and directory blocks have known formats. ReiserFS checks whether each of these
blocks has the expected values in the appropriate fields. However, not all blocks
are checked this carefully. For example, bitmaps and data blocks do not have asso-
ciated type information and hence are never type-checked.
Recovery
The most prominent aspect of the recovery policy of ReiserFS is its tendency to
panic the system upon detection of virtually any write failure (RStop ). When
ReiserFS calls panic, the file system crashes, usually leading to a reboot and
recovery sequence. By doing so, ReiserFS attempts to ensure that its on-disk struc-
tures are not corrupted. ReiserFS recovers from read and write failures differently.
For most read failures, ReiserFS propagates the error to the user (RP ropagate ) and
sometimes performs a single retry (RRetry ) (e.g., when a data block read fails, or
when an indirect block read fails during unlink, truncate, and write opera-
tions). However, ReiserFS never retries upon a write failure.
                                                                                                          55
                                        Detection                          Recovery
                            a b c d e f g h i j k l mn o p q r s t   a b c d e f g h i j k l mn o p q r s t
                 inode
 Read Failure
                 dir
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
 Write Failure
                 inode
                 dir
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
                 inode
 Corruption
                 dir
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
Figure 4.6: ReiserFS Failure Policy. The failure policy graphs plot detection and re-
covery policies of ReiserFS for read, write, and corruption faults injected for each block
type across different workloads. The workloads are varied across the columns of the
figure, and the different block types of the ReiserFS file system are varied across the
rows. The workloads are a: path traversal b: access,chdir,chroot,stat,statfs,lstat,open
c: chmod,chown,utimes d: read e: readlink f: getdirentries g: creat h: link i: mkdir j: re-
name k: symlink l: write m: truncate n: rmdir o: unlink p: mount q: fysnc,sync r: umount
s: FS recovery t: log write operations. A gray box indicates that the workload is not ap-
plicable for the block type. For multiple mechanisms, the symbols are superimposed. The
keys for detection and recovery are presented in Table 4.8.
56
Recovery
The recovery strategies of JFS vary dramatically depending on the block type.
For example, when an error occurs during a journal superblock write, JFS crashes
                                                                                                          57
                                        Detection                          Recovery
                            a b c d e f g h i j k l mn o p q r s t   a b c d e f g h i j k l mn o p q r s t
                 inode
 Read Failure
                 dir
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
 Write Failure
                 inode
                 dir
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
                 inode
 Corruption
                 dir
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
Figure 4.7: JFS Failure Policy. The tables plot detection and recovery policies of JFS
for read, write, and corruption faults. The workloads are varied across the columns, and
the different block types are varied across the rows. The workloads are a: path traver-
sal b: access,chdir,chroot,stat,statfs,lstat,open c: chmod,chown,utimes d: read e: readlink
f: getdirentries g: creat h: link i: mkdir j: rename k: symlink l: write m: truncate n: rmdir
o: unlink p: mount q: fysnc,sync r: umount s: FS recovery t: log write operations. A gray
box indicates that the workload is not applicable for the block type. For multiple mecha-
nisms, the symbols are superimposed. The keys for detection and recovery are presented in
Table 4.8.
58
the system (RStop ); however, other write errors are ignored entirely (RZero ). On
a block read failure to the primary superblock, JFS accesses the alternate copy
(RRedundancy ) to complete the mount operation; however, a corrupt primary results
in a mount failure (RStop ). Explicit crashes (RStop ) are used when a block allo-
cation map or inode allocation map read fails. Error codes for all metadata reads
are handled by generic file system code called by JFS; this generic code attempts
to recover from read errors by retrying the read a single time (RRetry ). Finally, the
reaction for a failed sanity check is to propagate the error (RP ropagate ) and remount
the file system as read-only (RStop ); during journal replay, a sanity-check failure
causes the replay to abort (RStop ).
We also found problems with the JFS failure policy. First, while JFS has some
built-in redundancy, it does not always use it as one would expect; for example,
JFS does not use its secondary copies of aggregate inode tables (special inodes used
to describe the file system) when an error code is returned for an aggregate inode
read. Second, a blank page is sometimes returned to the user (RGuess ), although
we believe this is not by design (i.e., it is a bug); for example, this occurs when
a read to an internal tree block does not pass its sanity check. Third, some bugs
limit the utility of JFS recovery; for example, although generic code detects read
errors and retries, a bug in the JFS implementation leads to ignoring the error and
corrupting the file system. Finally, we present the cases of journaling semantics
violations during write failures in JFS.
Not Replaying Failed Checkpoint Writes: When a checkpoint write fails, JFS
does not rewrite it or mark the transaction for a replay. JFS simply ignores this
error, which can lead to a corrupted file system. This behavior is similar to that
of ext3. Since both these file systems do not record failed checkpoint writes, they
have no way of identifying which transactions must be replayed again.
Committing, Checkpointing, and Replaying Failed Transactions: We find that
all the five journaling file systems commit and checkpoint a transaction on an or-
dered block write failure. JFS does not notify the application of an ordered write
failure and commits the transaction. This can lead to data corruption.
Failing to Recover: When a journal block write fails, JFS does not abort the failed
transaction but commits it. If a crash happens after a journal write failure, the
logredo routine of JFS fails because of unrecognized log record type. This can
lead to unmountable file system.
                                                                                     59
4.3.4 XFS
We perform a preliminary analysis of failure policy for write failures in XFS. We
fail the writes of the following block types in XFS: ordered data, journal blocks
containing both the metadata and commit records, journal blocks with unmount
records, and checkpoint blocks.
Detection
XFS detects most write failures using the error code (DErrorCode ), including the
asynchronous checkpoint write errors, which are ignored by ext3 and JFS. XFS
does not always handle write failures correctly. We find a flaw that we found in
ext3, ReiserFS, and JFS as well. That is, when an ordered data block write fails,
XFS continues to log the failed transaction to the journal resulting in data corrup-
tion.
Recovery
XFS takes different recovery techniques under different block failures. It retries
checkpoint write (which are asynchronous) failures repeatedly without any bounds
(RRetry ). If the block error is a transient one, the repeated retry successfully writes
the block. On permanent errors, the failed block write is repeated infinitely. If one
unmounts the system to rectify the write error, XFS crashes (RStop ). On journal
write failures, XFS reacts immediately by stopping the file system (RStop ).
Table 4.9: IRON Techniques Summary. The table depicts a√summary of the IRON tech-
niques used by the file systems under test. More check marks ( ) indicate a higher relative
frequency of usage of the given technique.
Table 4.10: File System Bugs. This table gives a summary of the bugs we identified in
ext3, ReiserFS, IBM JFS, and XFS.
With writes, the number of retries varies (e.g., three times for data blocks, two
times for MFT blocks).
cluding XFS and NTFS) and Table 4.10 lists the bugs we found in our file system
analysis.
• Ext3: Overall simplicity. Ext3 implements a simple and mostly reliable failure
policy, matching the design philosophy found in the ext family of file systems. It
checks error codes, uses a modest level of sanity checking, and recovers by prop-
agating errors and aborting operations. The main problem with ext3 is its failure
handling for write errors, which are ignored and cause serious problems including
possible file system corruption.
• ReiserFS: First, do no harm. ReiserFS is the most concerned about disk fail-
ure. This concern is particularly evident upon write failures, which often induce a
panic; ReiserFS takes this action to ensure that the file system is not corrupted.
ReiserFS also uses a great deal of sanity and type checking. These behaviors com-
bine to form a Hippocratic failure policy: first, do no harm.
• JFS: The kitchen sink. JFS is the least consistent and most diverse in its failure
detection and recovery techniques. For detection, JFS sometimes uses sanity, some-
times checks error codes, and sometimes does nothing at all. For recovery, JFS
sometimes uses available redundancy, sometimes crashes the system, and some-
times retries operations, depending on the block type that fails, the error detection
and the API that was called.
• XFS: Simple and well-defined. From our preliminary analysis, we find that XFS
has a simple and well-defined failure policy to handle write failures. It checks error
codes and on synchronous write failures, XFS stops the file system and propagates
errors. On asynchronous write failures, the failed write is retried persistently.
• NTFS: Persistence is a virtue. Compared to several Linux file systems, NTFS is
more persistent, retrying failed requests many times before giving up. It also seems
to propagate errors to the user quite reliably. However, more thorough testing of
NTFS is needed in order to broaden these conclusions (a part of our ongoing work).
file system. JFS is the most illogically inconsistent, employing different techniques
in scenarios that are quite similar.
     We note that inconsistency in and of itself is not problematic [37]; for example,
it would be logically inconsistent (and a good idea, perhaps) for a file system to
provide a higher level of redundancy to data structures it deems more important,
such as the root directory [106]. What we are criticizing are inconsistencies that
are undesirable (and likely unintentional); for example, JFS will attempt to read the
alternate superblock if a read failure occurs when reading the primary superblock,
but it does not attempt to read the alternate if it deems the primary corrupted.
     In our estimation, the root cause of illogical inconsistency is failure policy dif-
fusion; the code that implements the failure policy is spread throughout the kernel.
Indeed, the diffusion is encouraged by some architectural features of modern file
systems, such as the split between generic and specific file systems. Further, we
have observed some cases where different developers implement different portions
of the code and hence implement different failure policies (e.g., one of the few cases
in which ReiserFS does not panic on write failure arises due to this); perhaps this
inconsistency is indicative of the lack of attention paid to failure policy. We pay
attention to this problem and design a centralized failure handler for file systems,
which we discuss in Chapter 6.
• Detection and Recovery: Bugs are common. We also found numerous bugs
across the file systems we tested, some of which are serious, and many of which
are not found by other sophisticated techniques [129]. We believe this is generally
indicative of the difficulty of implementing a correct failure policy; it certainly hints
that more effort needs to be put into testing and debugging of such code. One sug-
gestion in the literature that could be helpful would be to periodically inject faults
in normal operation as part of a “fire drill” [80]. Our method reveals that testing
needs to be broad and cover as many code paths as possible; for example, only by
testing the indirect-block handling of ReiserFS did we observe certain classes of
fault mishandling.
• Detection: Error codes are sometimes ignored. Amazingly (to us), error codes
were sometimes clearly ignored by the file system. This was most common in JFS,
but found occasionally in the other file systems. Perhaps a testing framework such
as ours should be a part of the file system developer’s toolkit; with such tools, this
class of error is easily discovered.
• Detection: Sanity checking is of limited utility. Many of the file systems use
sanity checking to ensure that the metadata they are about to use meets the expec-
tations of the code. However, modern disk failure modes such as misdirected and
phantom writes lead to cases where the file system could receive a properly for-
                                                                                    63
matted (but incorrect) block; the bad block thus passes sanity checks, is used, and
can corrupt the file system. Indeed, all file systems we tested exhibit this behavior.
Hence, we believe stronger tests (such as checksums) should be used.
• Recovery: Stop is useful – if used correctly. Most file systems employed some
form of RStop in order to limit damage to the file system when some types of errors
arose; ReiserFS is the best example of this, as it calls panic on virtually any
write error to prevent corruption of its structures. However, one has to be careful
with such techniques. For example, upon a write failure, ext3 tries to abort the
transaction, but does not correctly squelch all writes to the file system, leading to
corruption. Perhaps this indicates that fine-grained rebooting is difficult to apply in
practice [25].
• Recovery: Stop should not be overused. One downside to halting file system
activity in reaction to failure is the inconvenience it causes: recovery takes time and
often requires administrative involvement to fix. However, all of the file systems
used some form of RStop when something as innocuous as a read failure occurred;
instead of simply returning an error to the requesting process, the entire system
stops. Such draconian reactions to possibly temporary failures should be avoided.
• Recovery: Retry is underutilized. Most of the file systems assume that failures
are not transient, or that lower layers of the system handle such failures, and hence
do not retry requests at a later time. The systems that employ retry generally assume
read retry is useful, but write retry is not; however, transient faults due to device
drivers or transport issues are equally likely to occur on reads and writes. Hence,
retry should be applied more uniformly. NTFS is the lone file system that embraces
retry; it is willing to issue a much higher number of requests when a block failure
is observed.
• Recovery: Automatic repair is rare. Automatic repair is used rarely by the file
systems; instead, after using an RStop technique, most of the file systems require
manual intervention to attempt to fix the observed problem (i.e., running fsck).
• Detection and Recovery: Redundancy is not used. Finally, and perhaps most
importantly, while virtually all file systems include some machinery to detect disk
failures, none of them apply redundancy to enable recovery from such failures. The
lone exception is the minimal amount of superblock redundancy found in JFS; even
this redundancy is used inconsistently. Also, JFS places the copies in close proxim-
ity, making them vulnerable to spatially-local errors. As it is the least explored and
potentially most useful in handling the failures common in drives today, we next
investigate the inclusion of various forms of redundancy into the failure policy of a
file system.
64
4.4 Conclusion
Commodity file systems use a variety of detection and recovery techniques to han-
dle disk sector errors. In this chapter, we explain the process of semantic failure
analysis, wherein we apply our knowledge about file system block types and trans-
actional semantics to unearth the failure policies. From our analysis, we find that
commodity file systems are built mostly with the assumption that disk fails in fail-
stop manner and therefore, store little or no redundant information on disk. We also
find that the failure policies enacted by a file system are often ad hoc, inconsistent,
and buggy. We conclude that it is time we rearchitect the file systems with more re-
dundant information to handle partial disk errors and a framework that can support
well-defined failure policies.
                                                                                  65
Chapter 5
As we mention in the Introduction (Section 1.2), there are two challenges in build-
ing a robust file system. First, we must design low-level redundancy machinery to
detect and recover from partial disk errors. Second, we need a new failure handling
framework to unify the different low-level machineries with various partial disk er-
rors and failure policies. In this chapter, we focus on the first problem, and design
and implement new redundancy techniques within a single disk drive (the second
problem is addressed in the next chapter). Note that the challenge of building cost
effective low-level machinery is especially important in the context of commodity
file systems because currently they store little or no redundant information on disk.
     A well-known technique in building reliable systems is to use redundant com-
ponents and specialized hardware mechanisms, not only to improve reliability but
also to keep the performance from degrading [16]. However, commodity systems
such as desktops and laptops may not be able to afford redundant components such
as extra disk drives or special hardware support like NVRAM (both of which are
usually available in high-end systems) as they will result in an increased system
cost, system size, and heat dissipation. Therefore, we specifically focus on stor-
ing redundant information within a single disk drive with no additional hardware
support.
     Redundant information can be stored and used in various ways, where each
technique has its own merits and demerits. We explore three types of redundant
information in this thesis: checksums, parity, and replicas due to their wide spread
use in high-end robust file and storage systems [16, 20, 21, 28, 71, 79, 96, 110].
We first describe and qualitatively evaluate the robustness, performance, and space
66
copies to achieve atomicity, which in turn can cause additional rotational latencies
and disk seeks. Fourth, a portion of the host main memory might be used to hold
redundant information such as checksums, which can affect the performance as it
reduces the amount of available memory to the overall system. Finally, it is compu-
tationally expensive to produce certain types of redundant information such as the
Reed-Solomon or Tornado error correction codes [69], and such processing may
use significant amount of CPU power available in a system.
Space Overheads: Not surprisingly, redundant information can incur space over-
heads as well and it varies according to the redundancy technique used. For exam-
ple, although stronger in detecting corruptions, a cryptographic checksum such as
SHA-1 occupies more bytes than a simple CRC-type checksum. Similarly, parity
blocks, replicas, and error correction codes each occupy different amount of space
depending on how many copies are maintained.
    Table 5.1 presents a qualitative evaluation of effectiveness, performance, and
space overheads of various redundancy techniques. Different redundancy tech-
niques are listed under each row and the effectiveness under a single block error,
spatially local errors that spread up to t disk tracks, and spatially local errors that
spread beyond t disk tracks are listed under different columns along with the per-
formance and space cost. We also study whether a technique can recover from a
                                                                               √
failure both during normal file system operation and journal recovery. A mark
means that a particular technique is guaranteed to detect or recover from the failure
whereas a × mark represents that no guarantee can be given for failure detection
or recovery.
    We now explain the table entries in detail. Broadly, checksums can be stored
in 3 classes of locations: within a sector, at a neighboring block, or at a block
separated by t tracks. If checksum is stored within a sector, its update can be guar-
anteed to be atomic due to the basic write guarantees offered by a disk drive. If
separated from the data, checksum update can be either atomic or not. Although,
ChecksumW ithin is atomic, it is not guaranteed to detect phantom write or mis-
directed write errors. While in general, the space and time costs are tolerable for
checksums, atomic techniques might require extra writes to the journal, which can
increase the performance overheads.
    In case of parity, it can either be calculated for a set of contiguous N blocks
or, if one wants to tolerate spatially correlated failures, it can be calculated for
blocks that are separated by say, t tracks. Parity update can be atomic as well; how-
ever, logging both parity and its corresponding blocks into the journal can decrease
the performance significantly. Finally, replicas incur high space and performance
overheads. One can use journal to ensure atomic updates although with severe
                                                              Effectiveness                                Overheads
                                       Single block       Spatial (<= t tracks)    Spatial (> t tracks)   Time Space
        Techniques                 Normal FS Recovery     Normal FS Recovery      Normal FS Recovery
        ChecksumW ithin              ×
                                     √
                                                 ×          ×           ×           ×           ×         low     low
        Checksumd=0,Non−atomic
                                     √
                                                 ×
                                                 √
                                                            ×           ×           ×           ×         low     low
        Checksumd=0,Atomic
                                     √
                                                            ×
                                                            √
                                                                        ×           ×           ×         high    low
        Checksumd=t,Non−atomic
                                     √
                                                 ×
                                                 √           √
                                                                        ×
                                                                        √
                                                                                    ×           ×         low     low
        Checksumd=t,Atomic
                                     √
                                                                                    ×           ×         high    low
        Parityd=0,Non−atomic
                                     √
                                                 ×
                                                 √
                                                            ×           ×           ×           ×         low     low
        Parityd=0,Atomic
                                     √
                                                            ×
                                                            √
                                                                        ×           ×           ×         high    low
        Parityd=t,Non−atomic
                                     √
                                                 ×
                                                 √           √
                                                                        ×
                                                                        √
                                                                                    ×           ×         low     high
        Parityd=t,Atomic
                                     √
                                                                                    ×           ×         high    high
        Replicad=0,Non−atomic
                                     √
                                                 ×
                                                 √
                                                            ×           ×           ×           ×         high    high
        Replicad=0,Atomic
                                     √
                                                            ×
                                                            √
                                                                        ×           ×           ×         high    high
        Replicad=t,Non−atomic
                                     √
                                                 ×
                                                 √           √
                                                                        ×
                                                                        √
                                                                                    ×           ×         high    high
        Replicad=t,Atomic                                                           ×           ×         high    high
Table 5.1: Qualitative Evaluation of Redundancy Techniques. The table presents a qualitative evaluation of the effectiveness, √
performance, and space overheads of different redundancy techniques to detect and recover from sector errors or corruption. A
mark means that a particular technique is guaranteed to detect or recover from the failure whereas a × mark represents that no
guarantee can be given for failure detection or recovery.
                                                                                                                                  69
70
performance penalties.
    In conclusion, we can observe from the table that the more effective a redun-
dancy technique is in handling various errors, higher its performance and/or space
overhead becomes. In the following sections, we implement various redundancy
techniques from the table and evaluate their robustness and cost quantitatively.
5.2.1 Implementation
We now describe the ixt3 implementation. We explain how we add checksumming,
metadata replication, user parity, and a new performance enhancement known as
transactional checksums into the existing ext3 file system framework.
Checksumming: To implement checksumming within ixt3, we borrow techniques
from other recent research in checksumming in file systems [79, 110]. Specifically,
we compute checksums for each file system block and treat checksums as meta-
data by placing them first into the journal, and then checkpoint them to their final
location, which is at the end of the file system. The checksums are separated from
                                                                                   71
system must recompute the parity and make it consistent with the modified data
block contents. However, in ext3 ordered journaling mode, no additional infor-
mation is stored in the journal about the locations of data blocks that are being
modified.1 To address this problem, Denehy et al. introduce a new journaling
mode called declared journaling mode, where they store the block numbers of data
blocks that are being modified as part of a transaction commit into the journal [30].
To support parity for data blocks, we add declared mode to ixt3. In summary, we
implement the Parityd=0,N on-atomic design for parity blocks in ixt3.
Transactional Checksums: We also explore a new idea for leveraging checksums
in a journaling file system; specifically, checksums can be used to relax ordering
constraints and thus to improve performance. In particular, when updating its jour-
nal, standard ext3 ensures that all previous journal data reaches disk before the
commit block; to enforce this ordering, standard ext3 induces an extra wait be-
fore writing the commit block, and thus incurs extra rotational delay. To avoid this
wait, ixt3 implements what we call a transactional checksum, which is a checksum
over the contents of a transaction. By placing this checksum in the journal commit
block, ixt3 can safely issue all blocks of the transaction concurrently. If a crash oc-
curs during the commit, the recovery procedure can reliably detect the crash and not
replay the transaction, because the checksum over the journal data will not match
the checksum in the commit block. We use SHA-1 for transactional checksums
also. Note that a transactional checksum provides the same crash semantics as in
the original ext3 and thus can be used without other IRON extensions.
Cleaning Overheads: Note that “cleaning overhead”, which can be a large prob-
lem in pure log-structured file systems [93, 101], is not a major performance issue
for journaling file systems, even with ixt3-style checksumming and replication.
Journaling file systems already incorporate cleaning into their on-line maintenance
costs; for example, ext3 first writes all metadata to the journal and then cleans the
journal by checkpointing the data to a final fixed location. Hence, the additional
cleaning performed by ixt3 increases total traffic only by a small amount.
5.2.2 Evaluation
We now evaluate our prototype implementation of ixt3. We focus on three ma-
jor axes of assessment: robustness of ixt3 (with checksums, parity, and replicas)
to modern disk failures, and both the time and space overhead of the additional
redundancy mechanisms employed by ixt3.
     1
         The same problem exists in other journaling file systems such as ReiserFS, JFS, XFS, and NTFS.
                                                                                                              73
                                         Detection                          Recovery
                            a b c d e f g h i j k l mn o p q r s t   a b c d e f g h i j k l mn o p q r s t
                 inode
                 dir
 Read Failure
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
                 inode
 Write Failure
                 dir
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
                 inode
                 dir
 Corruption
                 bitmap
                 i-bitmap
                 indirect
                 data
                 super
                 g-desc
                 j-super
                 j-revoke
                 j-desc
                 j-commit
                 j-data
Figure 5.1: Ixt3 Failure Policy. The tables plot both detection and recovery policies of
ixt3 for read, write, and corruption faults injected for each block type across a range of
workloads. The workloads are varied across the columns of the figure, and the different
block types of the ixt3 file system are varied across the rows. The workloads are grouped
in the same manner as in Figure 4.5. The keys for detection and recovery are presented in
Table 4.8.
74
Robustness: To test the robustness of ixt3, we harness our fault injection frame-
work, running the same partial-failure experiments on ixt3. The results are shown
in Figure 5.1.
     Ixt3 detects read failures in the same way as ext3, by using the error codes from
the lower level (DErrorCode ). When a metadata block read fails, ixt3 reads the
corresponding replica copy (RRedundancy ). If the replica read also fails, it behaves
like ext3 by propagating the error (RP ropagate ) and stopping the file system activity
(RStop ). When a data block read fails, the parity block and the other data blocks of
the file are read to compute the failed data block’s contents (RRedundancy ).
     Ixt3 detects write failures using error codes as well (DErrorCode ). It then aborts
the journal and mounts the file system as read-only to stop any writes from going
to the disk (RStop ).
     When a data or metadata block is read, the checksum of its contents is computed
and is compared with the corresponding checksum of the block (DRedundancy ). If
the checksums do not match, a read error is generated (RP ropagate ). On read errors,
the contents of the failed block are read either from the replica or computed using
the parity block (RRedundancy ).
     In the process of building ixt3, we also fixed numerous bugs within ext3. By
doing so, we avoided some cases where ext3 would commit failed transactions to
disk and potentially corrupt the file system [85].
     Overall, by employing checksumming to detect corruption, and replication and
parity to recover lost blocks, ixt3 provides robust file service in spite of partial
disk failures. More quantitatively, ixt3 detects and recovers from over 200 possible
different partial-error scenarios that we induced. The result is a logical and well-
defined failure policy.
Time Overhead: We now assess the performance overhead of ixt3. We isolate the
overhead of each IRON mechanism by enabling checksumming for metadata (Mc )
and data (Dc ), metadata replication (Mr ), parity for user data (Dp ), and transac-
tional checksumming (Tc ) separately and in all combinations.
     We use four standard file system benchmarks: SSH-Build, which unpacks and
compiles the SSH source distribution; a web server benchmark, which responds
to a set of static HTTP GET requests; PostMark [61], which emulates file system
traffic of an email server; and TPC-B [117], which runs a series of debit-credit
transactions against a simple database. We run each experiment five or more times
and present the average results.
     The SSH-Build time measures the time to unpack, configure, and build the SSH
source tree (the tar’d source is 11 MB in size); the Web server benchmark transfers
25 MB of data using http requests; with PostMark, we run 1500 transactions with
                                                                                       75
file sizes ranging from 4 KB to 1 MB, with 10 subdirectories and 1500 files; with
TPC-B, we run 1000 randomly generated debit-credit transactions. These bench-
marks exhibit a broad set of behaviors. Specifically, SSH-Build is a good (albeit
simple) model of the typical action of a developer or administrator; the web server
is read intensive with concurrency; PostMark is metadata intensive, with many file
creations and deletions; TPC-B induces a great deal of synchronous update traffic
to the file system.
    Table 5.2 reports the relative performance of the variants of ixt3 for the four
workloads, as compared to stock Linux ext3. Along the rows, we vary which re-
dundancy technique is implemented, in all possible combinations: Mc implies that
metadata checksumming is enabled; Dc that data checksumming is enabled; Mr
that replication of metadata is turned on; Dp that parity for data blocks is enabled;
Tc that transactional checksums are in use. All results are normalized to the perfor-
mance of standard Linux ext3; for the interested reader, running times for standard
ext3 on SSH-Build, Web, PostMark, and TPC-B are 117.78, 53.05, 150.80, and
58.13 seconds, respectively. Slowdowns greater than 10% are marked in bold,
whereas speedups relative to base ext3 are marked in [brackets]. All testing is done
on the Linux 2.6.9 kernel on a 2.4 GHz Intel P4 with 1 GB of memory and a West-
ern Digital WDC WD1200BB-00DAA0 disk.
    We now explain the reasons behind some of the performance overheads. Check-
summing data and metadata does not add high overhead for most workloads (i.e.,
less than 5% in most cases). However, data checksums for PostMark increase the
performance cost by about 13%. This is due to the fact that our PostMark bench-
mark writes over 1.4 GB of data with checksums being calculated for every data
block. Since checksums are considered as metadata, when data checksums are
coupled with metadata replication, the overhead on PostMark raises to about 26%.
When applied individually, metadata replication reduces the performance for meta-
data intensive workloads such as PostMark and TPC-B, while causing little over-
heads for SSH-Build and Web server benchmarks. Parity for data blocks does not
add much cost individually but when combined with metadata replication, it raised
the overhead to as much as 39% for TPC-B. The reason for this behavior is due
to the interaction between the declared mode [30] and metadata replication. In
declared mode, extra synchronous writes are issued to the journal, which in ad-
dition to causing rotational latencies also incur disk seeks between the primary
and secondary journal. However, when coupled with transactional checksums, this
overhead drops to about 20%.
   From these numbers, we draw three main conclusions. First, for both SSH-
Build and the web server workload, there is little time overhead, even with all
                                                                                      77
Table 5.3: Space Overhead. The above table lists the space overhead due to redundant
information across nine different home directories. Inode blocks are allocated statically
in ext3 and it adds about 64 MB of replica overhead per file system. Checksums for each
block cause a constant space overhead of about 0.5% for each file system.
Space Overhead: To evaluate space overhead, we measured about nine local file
systems (mostly home directories of graduate students at UW Computer Sciences
Department) and computed the increase in space required if all metadata was repli-
78
cated, room for checksums was included, and an extra block for parity was allo-
cated. Table 5.3 presents a detailed list of space overhead numbers. Each row lists
the space cost per file system. For each file, one parity block is allocated. Directory
and indirect blocks are dynamic metadata blocks, which changes across file sys-
tems, while inode blocks and bitmap blocks are statically allocated when the file
system is created. Checksums cause only a constant, small space overhead of about
0.5%. Overall, we found that the space overhead of checksumming and metadata
replication is small, in the 3% to 10% range. We found that parity-block overhead
for all user files is a bit more substantial, in the range of 3% to 17% depending on
the volume analyzed.
5.2.3 Summary
We investigate a family of redundancy techniques, and find that ixt3 greatly in-
creases the robustness of the file system under partial failures while incurring mod-
est time and space overheads. We also see that parity blocks can strike a balance
in terms of robustness, performance, and space overheads, especially for user data,
which occupies a significant part of a local file system. We believe that ixt3 rep-
resents just a single point in a large space of possible IRON file systems. In the
following sections, we explore other designs, focusing specifically on parity blocks.
way they interact with the file system journaling when ordering their block reads
and writes. The overwrite technique does not guarantee atomic update of data
and parity (ParityN on-atomic ) but no-overwrite technique provides this atomicity
(ParityAtomic ). Finally, we evaluate the performance overhead of overwrite and
no-overwrite techniques.
redundancy set(k,1) that can tolerate only up to 1 failure, and cannot recover from
multiple failures. 2 We define two events:
F1 : A block, Bi , in the redundancy set fails.
F2 : At least one more block in the redundancy set fails.
    When event F2 happens, the file system cannot recover the failed blocks from
the redundancy set. P (F2 |F1 ) gives the probability that a second failure happens
given that a first failure has already occurred. Using conditional probability,
                                                        P (F2 ∩ F1 )
                                       P (F2 |F1 ) =                 .
                                                           P (F1 )
    The objective of the data layout algorithm is to select blocks such that the prob-
ability of more than one block failure occurring in the same redundancy set is below
a certain threshold T. That is, we want
     We calculate the probability of more than one block failure under two different
cases. First, we consider block errors as independent events and derive the probabil-
ity for 2 or more failures in a redundancy set. Then, we incorporate spatial locality
into the model and solve for a more generic case. We denote P (F2 ∩ F1 ) by P (Er ),
where Er denotes the event of two or more failures happening in a redundancy set
r containing k blocks.
     Case 1: We can treat block failures as independent events and assume that
they are uniformly spread on the disk. That is, if there are n blocks on the disk and
some block fails, then the probability that a block B fails is n1 . If the block failures
are independent, then P (Er ) is given by:
     2
         Note that we interchangeably use the term “multiple failures” to refer to 2 or more failures.
                                                                                     81
failures may be correlated and the probability of failure can differ according to the
location of a block with respect to other faulty blocks on the disk.
     We propose that the spatial dependencies among a set of sites can be expressed
using Markov Random Fields (MRF). MRF gives the probability distribution at a
site i, specified conditionally on the sites in the neighborhood of i. The neighbor-
hood of a site i is defined by some function. For example, all the sites within a
radius of R from site i could be defined as neighbors of i.
     We give a brief introduction to MRF and then discuss how we can express the
dependencies between the disk block failures as an MRF (a more detailed treatment
of MRF can be found elsewhere [68]). Let F be a finite collection of random
variables i.e., F = {V1 , ..., Vn }. Each random variable Vi corresponds to a site i
and takes a value vi from a set Li . Let Ni be the set of neighbors of i. F is an MRF
if and only if the random variable Vi depends only on its neighbors. More formally,
For the sake of simplicity, let all the mi blocks be at an exact distance of di from
the failed block Bi . The probability of two or more failures in a redundancy set
given that block Bi has failed is given by:
P (At least one more block out of mi fails | Bi has failed)×P (Block Bi has failed)
⇒ [1 − P (None of the mi blocks fail | Bi has failed)] × P (Block Bi has failed)
                                                          1
                            ⇒ [1 − (1 − P (di ))mi ] ×                             (5.5)
                                                          n
     where (1 − P (di ))mi is the probability with which all the mi blocks do not
fail. In practice, the mi blocks must be at least at distance di and then Equation 5.5
gives the upper bound on the probability of multiple failures. Our objective is to
find the probability of multiple failures for all such Bi in the redundancy set r. This
is given by,
                                  Xk
                                                                1
                        P (Er ) =     [1 − (1 − P (di ))mi ] ×                     (5.6)
                                                                n
                                  i=1
     Equations 5.3 and 5.6 give the probability of 2 or more failures in a redundancy
set r with k blocks. A disk with n blocks has nk redundancy sets. To find the
probability of multiple failures happening over the entire disk, one must consider
all the nk redundancy sets while calculating the probability using either Equation 5.3
or 5.6. Let q = nk . We derive the probability of multiple failures within at least one
redundancy set (represented by P (M)) over the entire disk as follows:
                                        Y
                                        q
                                =1−           [1 − P (Ei )]                       (5.7)
                                        i=1
        Q
where qi=1 [1 − P (Ei )] gives the probability of no multiple failures happening in
all the redundancy sets.
                                               Exponential Distribution
                               1
                                                                      p(d)
                             0.8                                   k = 50
              P(failure)     0.6                                  k = 400
                             0.4
                             0.2
                               0
                                     0   50   100   150 200 250         300   350   400
                                                      Distance d
                                                Linear Distribution
                             1
                                                                      p(d)
                           0.8                                     k = 50
                           0.6                                    k = 400
                           0.4
                           0.2
                             0
                                 0       50     100       150    200      250       300
                                                      Distance d
                                               Step-like Distribution
                             1
                                                                      p(d)
                           0.8                                     k = 50
                           0.6                                    k = 400
                           0.4
                           0.2
                             0
                                 0       50     100       150    200      250       300
                                                      Distance d
tion is available today (due to the proprietary nature of the disk industry), they can
be made available in the future by disk manufacturers or users of large disk farms.
Since there are no numbers available to choose for P (d), we use different distribu-
tion functions for P (d) to compute the probability of multiple errors happening on
at least one redundancy set over the entire disk.
    In Figure 5.2, we plot the probability of multiple failures under three different
distributions for P (d): one, an exponential decay curve, where the probability of
84
                                                                    Track 102
                                                                    Track 101
                                                                    Track 2
                                                                    Track 1
                                                        1
                                                        0
                                                        01
                                             1
                                             0            0
                                             01
                                              0         1 0
                                                          1
                                             10
                                              1
         A file
                                                                   Redundancy set 2
                                                                   Redundancy set 1
Figure 5.3: Block Layout Using the Failure Model. The above figure shows an example
of how redundancy sets are constructed from various disk tracks. Note that the physical
contiguity of logically contiguous blocks are not affected. For example, all the blocks in a
file can still be allocated continuously.
the assumption that file systems can observe the disk geometry and control block
placements. Logical block to physical disk address mappings can be obtained for
SCSI disks using low level SCSI commands. However, for low end ATA drives
such features may not be available. For such low end drives, the disk features can be
extracted by running micro benchmarks similar to the one developed by Talagala et
al. [114]. Finally, lower layers on the storage stack can reassign blocks to different
locations on the disk breaking the layout pattern suggested by the file system [111].
For example, disk drives themselves can do bad block remapping which can affect
the file system’s assumption about where disk blocks are located. However, in
modern systems, such remapping is rare and the locations of the remapped blocks
can be obtained through low-level disk commands.
Model Evaluation
                                          Exponential Distribution
                             1
                                                        k=50, model
                           0.8                        k=50, simulate
              P(failure)
                           0.6                         k=400, model
                                                     k=400, simulate
                           0.4
                           0.2
                             0
                                 0   50     100       150    200       250   300
                                                  Distance d
                                            Linear Distribution
                             1
                                                        k=50, model
                           0.8                        k=50, simulate
              P(failure)
                                             Step Distribution
                             1
                                                        k=50, model
                           0.8                        k=50, simulate
              P(failure)
Figure 5.4: Comparing the Probability of Multiple Failures Using Failure Model and
Simulation.
random and failed. Following the first failure, the other blocks from the same parity
set as B are considered for the second failure. Depending on the distance of a block
from the failed block B, we compute the chances of the second failure and inject
the second fault probabilistically. We run the experiment for 10,000 iterations on
a simulated disk of 50 GB. The probability of the second (or higher number of
failures) is computed as the number of times a second or a higher number of faults
were injected over 10,000 runs. We repeat the experiment for different parity set
                                                                                    87
sizes and block distances. In Figure 5.4, we plot the probability of 2 or more failures
from the equation 5.7 and the results of the fault injection tests. From Figure 5.4,
we can see that the curves from the model and experiments match closely.
differs from the other by the type of blocks they write to the journal and the order
in which the blocks are written. We focus on ordered journaling mode, which is the
default mode in the modern file systems such as Linux ext3, ReiserFS, IBM JFS,
XFS, and Windows NTFS due to its low performance overhead compared to the
data journaling mode and better consistency semantics compared to the writeback
mode. In fact, JFS, XFS, and NTFS do not support any other journaling modes.
     In ordered journaling mode, only the metadata blocks are written to the journal.
The data blocks are written directly to their fixed locations. But, the data writes are
ordered such that they complete before the metadata blocks are committed to the
journal. For example, when an application writes to a block, first the data block on
the disk is updated. Next, the inode block and other metadata blocks are committed
to the journal. Once the transaction is committed, at a later point in time, the inode
and other metadata blocks are written to their final fixed locations on disk and this
process is called checkpointing.
                                        Inode          B1          B1              P      XOR       P
                                                                                          (d)
                     In Memory                                         (b)                (c)
                     On Disk
B0 B1 ... Bk−1 P
                                      Inode
                        Journal
Inode B1 P
                        Inode
                       Commit                      B0              B1        ...       Bk−1         P
                                      Inode
                        Journal
Inode B1 P
                      In Memory         (h)
                      On Disk
                        Inode
                        Commit                    B0              B1    ...    Bk−1             P
                                     Inode
                        Journal
Figure 5.5: Overwrite Technique for Parity Update. The figure shows the sequence of
steps followed by the file system to update the parity block in overwrite technique.
Overwrite
Figures 5.5(i) to 5.5(iii) depict the three phases of overwrite update. The first phase
is the parity update phase. When a new write is issued, the old copies of the data
block and the parity block are read from the disk (steps (a), (b), and (c)). Using
90
the old copies and the new write, the new parity contents are calculated (step (d)).
The second phase is the transaction commit phase in which the new data and parity
blocks are written to their fixed locations on disk (steps (e) and (f)). Once the or-
dered writes are over, the metadata blocks including the inode block are committed
to the journal (step (g)). The last phase is the checkpointing phase during which the
journaled blocks are written to their fixed locations on disk (step (h)). At the end
of the third phase, the parity set is consistent and if one of the blocks in the parity
set fails, it can be recovered using the rest of the blocks.
     However, the overwrite update algorithm does not provide atomic update of
data and parity, and therefore, cannot recover from block errors during journal re-
covery. This can be explained by considering the following case: if the file system
crashes after step (e) and before the completion of step (g) in Figure 5.5(ii), then
during journal recovery, all the disk blocks in the parity set must be read to recom-
pute parity. This is due to the lack of atomicity, which results in an inconsistent
parity block after the crash. However, while reading the parity set blocks, if the file
system encounters a latent sector fault or block corruption, then the parity block
cannot be used to recover the failed block because the parity set is in an inconsis-
tent state. The root cause of the problem is in the overwrite update; by modifying
the old contents of the data block, the parity set is made inconsistent until the par-
ity block is also updated with its new contents. Note that even if the parity block
is in a consistent state during the crash (say, the crash happened after step (f) and
before step (g)), the file system cannot provide guarantees about the consistency of
the parity block, and therefore it has to read all k blocks to compute the parity. In
general, the same problem would occur if the file system crashed at any point in
time before the completion of journal commit (step (g)) in the transaction commit
phase (Figure 5.5(ii)).
No-Overwrite
In this section, we describe the no-overwrite technique to update the parity block. In
this schema, the data block is not overwritten; rather, it is written to a new location
on disk. The three phases of this technique are shown in Figures 5.6(i) to 5.6(iii).
     The first phase is the transaction commit (in contrast to the parity update in the
overwrite technique). When a write to a block is issued, instead of overwriting the
old contents, a new block is allocated and the data is written to the new location
(steps (a) and (b)). This constitutes the ordered write in the ordered journaling
mode. Note that the new block allocated must be in the same parity set as the old
                                                ′
block. That is, in Figure 5.6, blocks B1 and B1 must be from the same parity group.
After the ordered writes, the metadata blocks including the inode are committed to
                                                                                                     91
                                          Inode          B1
                                                              (b)
                      In Memory         (c)
                      On Disk
                                                                    B1’
                          Inode
                         Commit                       B0            B1     ...      Bk−1         P
                                        Inode
                         Journal
                                          Inode          B1         B1          P      XOR       P
                                                                                       (f)
                      In Memory                                           (d)          (e)
                      On Disk
                                                                    B1’
                          Inode
                         Commit                       B0            B1     ...      Bk−1         P
                                        Inode
                         Journal
Inode B1 P
                                        Inode
                         Journal
Figure 5.6: No-overwrite Technique for Parity Update. The figure shows the sequence
of steps followed by the file system to update the parity block in no-overwrite technique.
the journal (step (c)). The second phase, which is the parity update phase, happens
before checkpointing. During this phase, the old versions of the data and parity
blocks are read from the disk, and the new parity contents are calculated (steps (d),
(e), and (f)). The final phase is the checkpoint phase, where the parity block is
written to its original location on disk (step (g)), followed by the checkpoint of the
92
metadata blocks to their fixed locations (step (h)). Note that in step (h), the block
pointer in the inode has changed to refer to the new location of the data block.
     Since the data blocks are not overwritten, this schema provides the necessary
atomicity and can recover from block failures even during journal recovery. Con-
sidering the same example we saw before, even if the file system crashes after step
(b) in Figure 5.6(i), the parity set remains in a consistent state. Therefore, any
block error in the parity set can be recovered by reading the other blocks in the par-
ity set. A file system crash happening during any step in Figures 5.6(i) to 5.6(iii)
followed by a single block error during journal recovery can be successfully recov-
ered because at any point the parity block is either consistent with the old contents
                                                   ′
of block B1 or with the new contents of block B1 . Since the parity block is always
in a consistent state, block errors can be recovered successfully.
     The disadvantage of this approach is that it fragments a file into several blocks
due to a new block allocation on every overwrite. Periodic cleaning of fragmented
blocks to reallocate them contiguously can be performed to improve the file conti-
guity on the disk.
                                                                      File Creates
                                      50
                                                                                                Over
                                                                                             No-Over
                                      40
30
20
10
                                          0
                                              50   100   150   200   250       300     350   400        450   500
                                                                 Number of files created
                                                                       File Writes
                                        4
                                                                                                Over
                                      3.5                                                    No-Over
                   Bandwidth (Mb/s)
                                        3                                                     Default
                                      2.5
                                        2
                                      1.5
                                        1
                                      0.5
                                        0
                                               5    10    15    20    25       30       35    40        45    50
                                                               Amount of data written (Mb)
                                                                      File Reads
                                      6
                                                                                                Over
                                      5                                                      No-Over
                   Bandwidth (Mb/s)
                                                                                              Default
                                      4
                                      3
                                      2
                                      1
                                      0
                                          5        10    15    20    25       30     35      40         45    50
                                                               Amount of data read (Mb)
Figure 5.7: Performance Under Microbenchmarks. The three graphs plot the perfor-
mance of overwrite and no-overwrite techniques with respect to ext3. Performance under
file creates, file writes, and file reads are plotted in the top, middle, and bottom graphs
respectively.
consists of a Linux 2.6.9 kernel on a 2.4 GHz Intel P4 with 1 GB of memory and a
Western Digital WDC WD1200BB-00DAA0 disk.
and some sequential I/Os within each file. Between each phase, the file system is
unmounted to clear the cache contents. Although artificial, this helps us separate
the performance impact in each phase.
    Figure 5.7 presents the results of microbenchmarks and we make the following
observations. First, in File Creates, overwrite and no-overwrite perform worse than
default ext3 and this is due to the additional disk traffic incurred by reads to old
copies of blocks and writes to parity blocks. Second, in File Writes, no-overwrite
performs better than overwrite approach. Examining the disk traffic further, we
find that this is due to the higher disk seek incurred in the overwrite approach. If
writes are issued over multiple transactions, overwrite technique updates the parity
on every transaction by reading the old copies of the data and parity blocks (Fig-
ure 5.5(i)) whereas the no-overwrite scheme batches all the parity updates together
just before checkpointing (Figure 5.6(ii)). This parity update batching reduces the
disk seeks and improves the write performance. Finally in File Reads, in contrast to
writes, no-overwrite scheme performs worse than overwrite approach (which has
no performance degradation with respect to default ext3). Since no-overwrite allo-
cates a new block for every ordered write, over time the data gets scattered around
the disk. Therefore, while reading back the data written, we pay additional cost
by seeking to the scattered data locations. With periodic de-fragmentation, this
overhead can be reduced.
Table 5.4: Performance Overheads of Overwrite and No-Overwrite. The table lists
the relative running overheads of four different benchmarks in overwrite and no-overwrite
techniques when compared to Linux ext3. Overheads greater than 10% are marked in bold.
head depends on the workload. For write intensive workloads with no idle time
like PostMark or TPC-B, the overhead varies from 19% to 50%. Note that we run
these benchmarks on a cold cache with no copies of disk blocks present in memory
and this results in extra overhead due to reads to old disk block copies. However,
on an actual system some blocks may be present in the cache, and we expect that
this can reduce some overhead. For SSH, which is a desktop-like workload, the
overhead is modest (9% for overwrite and 11% for no-overwrite). Second, there
are no additional overheads for the web server workload in overwrite technique
but no-overwrite incurs some overhead (about 4%) due to scattered file blocks (we
see a similar behavior in the File Reads microbenchmark as well). Finally, al-
though no-overwrite technique performs better when compared to overwrite tech-
nique on simple write intensive microbenchmarks, its performance degrades when
the benchmark involves both reads and writes.
     In summary, the no-overwrite technique incurs less overhead compared to the
overwrite technique, primarily due to the batching of parity updates. However, the
no-overwrite technique has the disadvantage of reallocating the file blocks to differ-
ent locations on disk and thereby fragmenting the file. Periodic cleaning with file
re-layouts is necessary to maintain the physical contiguity of logically contiguous
file blocks.
5.4 Conclusion
Although redundant information has been used for years in building robust file and
storage systems, their use within a single disk has not been thoroughly studied
before. Low-end systems offer several challenges including the lack of redundant
disk drives and NVRAM support. In this chapter, we explore the effectiveness and
cost of different redundancy techniques that use checksum, parity, and replica to
detect and recover from errors. From our experiences in building ixt3, we conclude
that while the performance cost of redundant information can be severe for high
I/O intensive workloads, the overheads are modest for typical desktop applications.
We also find that the parity approach strikes a balance in terms of effectiveness,
performance, and space overheads. Since disk sectors can be subject to spatially
correlated failures, parity blocks must be laid out carefully. We derive a simple
probabilistic equation that can be used by a file system to construct parity sets
such that the chances of multiple errors affecting the same set is low. Finally, we
develop two parity update techniques. While the traditional overwrite approach
modifies data blocks in place, the radical no-overwrite approach writes data to new
locations and changes block pointers. Although no-overwrite offers the atomicity
96
Chapter 6
In this chapter, we address the second challenge in building robust file systems;
that is, how to incorporate low-level redundancy machinery with the myriads of
high-level failure handling policies. We restructure commodity file systems with
a Centralized Failure Handler that unifies all the I/O failure handling at a single
point. First, we describe the design and implementation details and then explain
the evaluation of the system.
policies, where different failure handling techniques are used even under similar
failure scenarios unintentionally. For example, in Figure 4.5 that is presented in
Section 4.3.1, we can observe that for similar indirect block read errors, four dif-
ferent recovery actions are taken by ext3: zero (RZero ), propagate (RP ropagate ),
stop (RStop ), and retry (RRetry ). We confirm by verifying the source code that this
inconsistent failure handling is due to ad hoc application of recovery techniques.
Tangled Policies and Mechanisms: Due to the diffusion of I/O calls and their
corresponding failure handling, it becomes harder to separate failure policies (e.g.,
“detect block corruption”) from their implementation (e.g., “do a specific sanity
checking”). As a result of this tangled policy and mechanism, neither can be mod-
ified without affecting the other, resulting in an inflexible failure handling system.
Coarse Grained Policies: Modern file systems do not support fine-grained per-
block-type failure policies. Under different block type failures, file systems often
use the same recovery technique such as stopping the entire file system. In the cur-
rent file system framework, if different I/O failures must be handled with different
failure policies, the detection and recovery code must change at all the places where
an I/O is being issued. This can increase the diffusion of failure handling and in
turn, worsen the above mentioned problems.
    We propose a Centralized Failure Handler (CFH) for file systems as a way to
mitigate the aforementioned problems and support well-defined failure policies.
The CFH architecture decouples failure handling mechanisms from policies, and
offers a configurable framework where a file system can specify per-block-type
failure policies.
    There are many challenges in building a CFH. First, the CFH needs semantic
information about I/O calls; that is, what block type each I/O represents. Broadly,
there are two ways to obtain this information. One, we can apply reverse engineer-
ing techniques with file system level gray-box knowledge [9, 83] to the I/O stream
and find the block types. Although similar techniques have been developed in the
past to do this to a certain extent [10, 12, 105, 106, 108], they are complex and
cannot guarantee correct block type detection due to file system asynchrony. The
other approach is to modify the file system source code to explicitly pass block
type information along with each I/O call, and we adopt this approach. Second,
even if one has access to source code, it is not always possible to get the correct
block type due to generic file system components. For example, if an I/O is issued
by a file system via the generic buffer cache layer, block type information is lost.
We modify the common journaling layer and buffer cache manager to overcome
this limitation. Finally, failure policies are harder to specify without the correct
                                                                                    99
Basic Framework
Figure 6.1 shows the framework of the CFH. I/O calls are issued from different
components of a file system such as the core file system, journaling layer, and
buffer cache manager. Each of the reads and writes are passed to the CFH, which
then issues the request to the lower level device drivers of the disk.
    Each I/O call, besides from containing the standard information such as the
block number and request size, also contains semantic information about the I/O.
For example, semantic information of each I/O includes the block type, inode num-
100
Figure 6.1: The CFH Framework. The figure shows components of a centralized failure
handler. It has three main subsystems: a file system specific layer, a generic layer, and
threads for failure handling. The file system specific layer also maintains a failure policy
table. The generic layer contains various mechanisms such as replication, parity, and
checksums that can be commonly used by various file systems.
ber, pointers to specific detection and recovery routines, and so on. Such informa-
tion is crucial in enabling fine-grained and flexible policies at the CFH.
    The CFH basic framework consists of three main components: a file system
specific layer, a generic layer, and additional threads to handle failures. We describe
each in turn.
    File system specific layer: The file system specific layer receives requests that
are issued from a particular file system or from other layers on behalf of the file
system, and it understands the semantics of the I/O requests. For example, the ext3
specific component understands different block types of ext3.
    A file system can use generic detection and recovery routines such as the check-
sum or retry module or file system specific block type sanity checking modules or
its own special recovery routines. In our current prototype implementation, the
policies are specified by the file system developer. However, extending the inter-
face to other users such as system administrators is simple. When a file system
mounts, it registers its specific detection and recovery routines. For example, ext3
could register a function pointer to ext3 inode sanity check() with the
                                                                                   101
CFH during its mount, which will be called every time an inode block is read or
written by ext3. Note that using this framework, a file system can even perform
sanity checking before its writes are issued to the disk, which can help find certain
software-induced or memory-related corruptions before the data reaches the disk.
     Generic layer: The CFH consists of a generic layer that implements several
mechanisms such as retry, replication, parity, remap, and different types of check-
sums, all of which are available to the policy implementor. The functionality pro-
vided by the generic layer can be used commonly across multiple file systems and
the generic mechanisms can be easily configured to suit the needs of the file sys-
tem. For example, the number of retries can be easily set for ext3 and they can be
different for reads, writes, and other block types. This design aspect of the CFH
provides a clean separation of mechanisms from policy.
     Failure handlers: CFH contains additional threads to handle I/O failures. An
I/O completion is notified under interrupt context and if there is a failure, the recov-
ery actions cannot be invoked under interrupt context because they are time critical
and performing complex recovery actions under them might crash the entire sys-
tem.
     When an error is detected, the I/O block along with its context information is
added to a queue and the failure handler thread is woken up. The failure handler
thread removes the failed I/O and enforces the recovery policy specified by the
file system. For example, if the file system wants to retry a failed I/O, the I/O
is reissued. One can implement and apply any of the techniques under the IRON
taxonomy (Tables 2.1 and 2.2) using the failure handlers.
     Failure handling threads are also useful in monitoring asynchronous I/O calls.
Typically, asynchronous I/O failures are not detected by a file system and no recov-
ery action is taken. In CFH, the failure handlers capture all I/O failures including
asynchronous ones and handle them appropriately.
Table 6.1: CFH Policy Table. The table presents various fields in a CFH policy table
record. The fields can be configured separately for each I/O and offer great flexibility in
supporting fine-grained failure policies.
the policy specified by the record is applied. Validate IO is a flag that speci-
fies whether a block content must be validated are not, and if so, whether to use
checksums or sanity checking for validation. Possible values for this field include
DONT VALIDATE, SANITY CHECK, and CHECKSUM.
     We implement certain generic recovery routines such as retrying failed requests
and remapping writes to a different location. If a file system wants to use those re-
covery techniques, it can use the Retry max to specify maximum number of times
the failed I/O must be retried and Remap to point to the block to which a write fail-
ure must be remapped. Blocks around a failed block may have higher chances of
failing and therefore, a file system can use Isolate to specify the number of
blocks that must be remapped around a failed block. In addition to offering a per-
block-type failure policy, one can even fine tune the policy to per file or directory.
For example, one can apply stronger failure handling approaches to root directory
than rest of the directories and this can be achieved by specifying the inode num-
ber of the file or directory in Inode field. (*Sanity) and (*Recover) are
function pointers specific to a file system. For example, in inode failure policy
record, a file system can set (*Sanity) to an inode sanity checking routine and
(*Recover) to a file system stop function. Certain file system recovery actions
such as stopping the entire file system require access to the in-core super block of
the file system and the policy table record also provides a field (FS Arg) to store
it.
                                                                                    103
     It must be noted that this is just one possible design for the policy table fields.
While we have taken a first approach here, more complex designs are possible. For
example, a file system might want to use its own remap routine in order to remap
an entire logically contiguous unit. In general, a file system can either implement
its own specialized routines for every IRON detection and recovery techniques, or
use all the techniques from the CFH generic layer. To support such highly flexible
designs, the policy table fields must be changed. However, we believe that those
changes are not significant and can be implemented as an extension of our prototype
design.
One of the design issues in building a CFH is understanding the semantics of each
I/O. The file system specific component of a CFH must understand the context of
an I/O such as the block type to invoke specific failure policies on I/O failure.
    Understanding the I/O semantics is straightforward for reads and writes that
are issued directly from the file system. We modified ext3 to pass the I/O context
information along with an I/O call.
    However, I/O calls can be issued from common file system layers such as the
generic buffer cache manager. There are two issues in modifying the functions in
buffer cache layer to use CFH. First, since the buffer cache is commonly used by
other parts of the system, file systems that do not use CFH will be affected. We
solve this problem by altering the buffer cache layer such that only if a file system
uses CFH, its I/O calls are redirected. Second, since the I/Os are issued from a
generic layer, the file system specific semantics is lost in these I/O calls. We handle
this problem by adding calls in ext3 that pass the block type information to CFH
for those I/Os that are issued via the generic layer.
    When a file system is recovering after a crash, file system blocks that are written
in the journal as part of a successfully committed transaction are read and written
to their fixed locations. It is important that we know the type of the recovered
blocks so that the failure policies can be applied even during recovery. We cannot
configure the file system to pass the block type information during journal recovery
because the file system itself is not active until the recovery completes. We solve
this problem by storing the block type information along with the blocks in the
journal, which can be read during recovery to apply the appropriate failure policy.
104
Table 6.2: Fixing Error Propagation. The table presents a list of functions that fail to
propagate error to the application on various block type failures in ext3. We fix all these
functions to send an “I/O Error” code to the application upon failure.
Implementation Details
We have implemented a centralized failure handler for Linux 2.6.9 kernel and mod-
ified Linux ext3 to use it. We also modify the generic buffer cache and journaling
layer to issued disk reads and writes to CFH if the requests are issued on behalf of
ext3. We call the resulting system as ext3c .
     We verified that all the I/O calls are redirected to the CFH by flagging I/O
requests within the CFH and ensuring that all requests that are about to be passed
to disk have such a flag. Although not comprehensive, this simple technique gave
us confidence that the CFH interposed on all relevant I/Os.
     In order to build a robust system, error codes must be propagated reliably across
different layers of the system until the error is handled appropriately. CFH is de-
signed such that when an I/O failure cannot be recovered, it propagates the failure
to the file system and also logs error message in the system log. However, file
systems ignore failures and often fail to propagate the error to the application [86].
From our analysis, we find that ext3 fails to propagate error to the application under
several POSIX calls (for example, directory block failure in rmdir is not notified
to an application). Table 6.2 lists a set of functions that ignore error propagation in
ext3 and we fix all those functions in ext3c .
6.1.2 Evaluation
We evaluate ext3c as follows. First, we show the flexibility and consistency of
ext3c by making it mimic ReiserFS-like and NTFS-like failure policies. By chang-
ing a few lines of code in the policy table, ext3c acts entirely differently in response
to a disk failure. Second, we evaluate how ext3c can be used to implement fine-
grained policies for different block types. Specifically, we develop and evaluate a
                                                                                 105
policy that acts in a more paranoid fashion about its own metadata, while simply
propagating failures encountered when accessing user data. Finally, we evaluate
the performance overhead of routing I/O requests via CFH by running several mac-
robenchmarks.
 ext3c_dir_block                 = {
     .Type                   =    ext3c_DIR_BLOCK,
     .Validate_IO            =    SANITY_CHECK,
     .Retry_max              =    0,
     .Sanity                 =    ext3c_dir_sanity_check,
     .Recover                =    ext3c_stop,
 };
     The above code defines the failure policy for directory blocks in ext3c . It in-
forms CFH to perform sanity checking using ext3c dir sanity check rou-
tine and if there is an I/O failure, asks CFH to stop the system by calling ext3c stop.
     The above configuration mimics ReiserFS-like failure policy. Previous work
has shown that ReiserFS is paranoid about disk failures and therefore stops the
file system on most block I/O errors [86]. We show the flexibility of CFH by
changing its failure policy from ReiserFS to that of NTFS by altering only a few
lines. NTFS assumes failures are more transient, and therefore performs persistent
retries [45, 86]. We make ext3c behave like NTFS on directory failures by setting
Retry max to a non-zero value and Recover to an error propagating function,
     We use a testing framework similar to that of our previous analysis (Section 5.2)
to perform a thorough failure policy analysis of ext3c . We run different workloads
and fail specific file system block types. Tables in Figure 6.2 present ext3c failure
policy under read failures. Each column represents one or a set of workloads and
each row represents a data structure in ext3. If a data structure is read on a work-
load, then the corresponding entry gives the failure policy enforced by CFH. We
can see that by just changing 2 lines for each of the 13 data structures, ext3c can be
made to mimic a different failure policy for all block type failures. Although the
results are not shown, we ensure that ext3c enforces similar policy on write failures
too.
106
Figure 6.2: Consistent Failure Policies in ext3c . The table indicates recovery poli-
cies of ext3c for read faults injected for each block type across a range of work-
loads. The workloads are a: path traversal b: access,chdir,chroot,stat,statfs,lstat,open
c: chmod,chown,utimes d: read e: readlink f: getdirentries g: creat h: link i: mkdir j: re-
name k: symlink l: write m: truncate n: rmdir o: unlink p: mount q: fysnc,sync r: umount
s: FS recovery t: log write operations. A gray box indicates that the workload is not ap-
plicable for the block type. If multiple mechanisms are observed, the symbols are superim-
posed. Key for recovery: A “/”, “−”, and “|” represent retry, error propagation, and file
system stop respectively.
                                                                                      107
     In addition to being flexible, ext3c can be highly consistent in applying its poli-
cies and this can be noticed from Figure 6.2. Two design aspects of CFH enables
this consistent application of policies: first, we separate failure policies from mech-
anisms; second, all the failure policies are enforced at one central place. The result
is a file system with a well-defined failure handling system.
Figure 6.3: Fine-grained Failure Policies in ext3c. The table indicates fine-grained poli-
cies of cxt3 for write failures. The workloads and data structures are the same as in Fig-
ure 6.2. Key for recovery failure policy: A “/”, “\”, and “|” represent retry, remap, and
file system stop respectively. Error is propagated on all the failures either through error
code or by logging error messages.
     File systems can configure CFH to implement different policies for different
block types depending upon their importance to the system. For example, a file
system can be more paranoid about metadata block failures and handle them dif-
ferently from data block failures. We show how CFH can be used to implement
fine-grained policies by configuring ext3c with the following policy: file system
metadata write errors are retried and remapped, and journal write failures are only
retried. In both the above cases, ext3c stops the system when it cannot recover from
the write error. When data write fails, ext3c only retries but does not remap or stop,
but rather just propagates the error, enabling the application to take the appropriate
action. We evaluate this failure policy for write failures of all block types and Fig-
ure 6.3 presents the failure policy table. From the table, we can see that under all
108
                                                          Remap                Remap
                                                                                                            Remap
                                 Retry          Retry               Isolate                        Retry
                  2000
                             Commit               Dir                Isolate                     Super
                                   WW
                  500
                                                 W       WWW      R R
                                                                                         WWWWWWWWWWW
                    0
                         0                50              100                 150          200              250                               300
                                                                     Time (ms)
Figure 6.4: Fine-grained Failure Handling. The above figure shows how different block
type failures are handled differently by CFH. On X-axis, I/O completion time is plotted and
on Y-axis, block numbers are shown. Failed I/Os are shown with legends that are not filled,
while successful I/Os are shaded in black. Writes are marked with a “W” and reads with a
“R”. We show three different block type failures: a commit block write failure, which is a
transient error, a directory block and a super block write errors, which are permanent. CFH
retries the commit write failure and successfully recovers (Section A). CFH is configured
to retry 3 times after directory write failure and when all the retries have failed, it remaps
the directory block to a different location on disk (Section B). CFH also marks few blocks
around the failed dir block for isolation, reads and remaps them (Section C). Finally, when
the super block write fails, it is retried 10 times and remapped. When the remap also fails,
CFH stops the file system (Section D).
workloads, all metadata write failures are handled more thoroughly than data block
failures.
    Figure 6.4 shows a specific example of the above policy. When a commit block
write fails in a transient manner, it is recovered by a retry. However, when a di-
rectory block write fails continuously, CFH remaps it to a new location. We im-
plement an isolation policy in CFH that reads blocks around a failed block, writes
them to a new location, and marks the region as unusable to prevent future errors.
Finally, when the super block write fails, CFH first retries 10 times, then tries to
recover by remapping, and finally stops the file system when the remap also fails.
Even though we show a simple policy here, a file system can implement more fine-
                                                                                  109
grained policies such as treating root directory I/O failures differently from other
directory failures.
Performance Overhead
Table 6.3: Performance Overheads in CFH. The table lists the running time (in seconds)
for four different benchmarks in ext3 and ext3c .
6.1.3 Conclusion
We design and implement a centralized failure handler for file systems that for-
wards I/O requests and responses between a file system and disk. CFH implements
generic and specific error handling routines that can be used by a file system to
handle disk failures. By unifying all the failure handling at a single point and sep-
arating policies from mechanisms, CFH enables fine-grained and consistent failure
handling in file systems.
    While CFH offers greater flexibility in failure handling, the implementation
efforts are non-trivial. Overall, we modified about 600 lines of existing code in
ext3, the buffer cache, and the journaling layer and added about 3100 lines of code
to build the CFH (including its generic detection and recovery mechanisms such
as checksums and remap). Extending CFH to other file systems is less expensive
once the basic CFH framework is in place. In fact, certain file systems such as
110
ReiserFS do not use the generic buffer cache and therefore, it is relatively easier to
embed the block-type semantic information in their I/O calls. We also find that it is
easier to maintain CFH even if individual file systems go through different design
changes. Since CFH contains a clear separation between file system specific layer
and generic layer, whenever a file system data structure is modified, routines that
handle that data structure alone must change without affecting the rest of the failure
handler parts.
                                                                                      111
Chapter 7
Related Work
Our effort builds upon related work from two bodies of literature. Our file system
analysis (Section 4) is related to efforts that inject faults or otherwise test the robust-
ness of systems to failure. Our work on building robust file systems including the
prototype ixt3 (Section 5.2), probabilistic model for data layouts (Section 5.3.1),
parity update techniques (Section 5.3.2), and centralized failure handler (Chap-
ter 6) draw on efforts in building software that is more robust to hardware failure.
We discuss each in turn.
of the early systems to use fault injection techniques to simulate the occurrences
of hardware errors by changing the contents of memory or registers [17]. FIAT
injects memory bit errors into a task image and the physical location of the fault
is gathered from the compiler and loader information automatically. In our fault
injection experiments, the driver automatically determines the fault location (i.e.,
the block to fail) by inferring the block type information. FERRARI (Fault and
ERRor Automatic Real-time Injector) is another tool that emulates permanent and
transient hardware faults in software and uses software traps to inject faults [58].
Its four modules roughly correspond to the functionalities of coordinator and fault
injection driver of our system. The initializer and activator module in FERRARI
prepares the system for fault injection; the user information module obtains param-
eters such as fault and error types from the user; the fault injector injects faults dur-
ing the program execution; and finally, data collection and analysis module collects
logs of all the results. The coordinator and driver perform all the above activities in
our system.
     Other systems that use SWIFI approach include FTAPE (Fault Tolerance And
Performance Evaluator), which is a tool that performs dynamic workload measure-
ments and injects faults by automatically determining the time and location that
will maximize fault propagation [118]. FTAPE uses stress-bases injection tech-
niques to inject faults during high workload activities, which ensure higher fault
propagation. FTAPE uses a framework similar to ours to inject disk system faults,
where a driver is used to emulate I/O errors. An important difference is that in their
system a fault is initiated by the workload stress, while in our framework a fault is
initiated when a particular block type is written to disk. Fault injection experiments
have also been used to study fault propagation in the system. FINE (Fault Injection
and moNitoring Environment) is a tool developed by Kao et al., to inject hardware
induced software faults into UNIX kernel and trace the execution flow of the ker-
nel [70]. Although we do not explicitly trace the fault propagation in our system,
the disk errors that we inject propagate from its origin through a generic device
driver layer until it reaches the file system, where it can cause several behaviors
including system crashes. Several work has targeted the kernel for reliability anal-
ysis. In a more recent work, fault injection techniques are used to test the Linux
kernel behavior under the presence of errors in the kernel instruction code [49].
They test four kernel subsystems including the architecture dependent code, virtual
file system, memory management, and the core kernel. Our testing focuses only on
file systems and while they inject faults in instruction streams, we inject disk I/O
failures.
File and Storage System Testing: Fault injection has been used specifically to
                                                                                   113
study the reliability of storage systems as well. Brown et al. developed a method to
measure the system robustness and applied it to measure the availability of software
RAID systems in Linux, Solaris and Windows [24]. They use a PC to emulate
a disk and use the disk emulator to inject faults. They test the software RAID
systems while our work targets the file systems. Moreover, we use file system
knowledge to carefully select and fail specific block types whereas they do not
require any semantic information for fault injection. Other studies have evaluated
RAID storage systems for reliability and availability [55, 62]. These studies have
developed detailed simulation models of RAID storage arrays and network clusters
and used them to obtain the dependability measures. In contrast to their simulation-
based fault injection, we perform a prototype-based fault injection for our failure
analysis.
    Several file system testing tools test the file system API with various types of
invalid arguments. Siewiorek et al. develop a benchmark to measure the system’s
robustness and use it to test the dependability of file system’s libraries [104]. Their
benchmark consists of two parts, each exercising a different part in the file man-
agement system. The first part focuses on data structures in file management such
as corrupting a file pointer or a file system flag. The second part measures the ef-
fectiveness of input error detection mechanisms of the functions in the C standard
input/output library (STDIO). Similarly, Koopman et al. use the Ballista testing
suite to find robustness problems in Safe/Fast IO (SFIO) library [31]. They test 36
functions in the SFIO API and show that although SFIO robustness is far greater
than STDIO, it still had a fair number of robustness failures in functions like read
and write. In contrast to both the above work, which test the robustness of file
system API, we measure the robustness of file system to disk write failures.
    One major difference between the related work in fault injection and ours is
that our approach focuses on how file systems handle the broad class of modern
disk failure modes; we know of no previous work that does so. Our approach also
assumes implicit knowledge of file-system block types; by doing so, we ensure that
we test many different paths of the file system code. Much of the previous work
inserts faults in a “blind” fashion and hence is less likely to uncover the problems
we have found.
complex file system operations. In a recent work, Yang et al. use model checking
comprehensively to find bugs in three different file systems: ext3, ReiserFS and
JFS [129]. They use formal verification techniques to systematically enumerate a
set of file system states and verify them against valid file system states. Their work
can be used to identify problems like deadlock, and NULL pointers whereas our
work focuses mainly on how file systems handle latent sector faults.
    More recently, Yang et al. develop a method to automatically find bugs in file
system code that sanity checks on-disk data structure values [128]. They use a
symbolic execution system, called EXE, which instruments a program and runs on
symbolic input that is initially free to have any value. Based on the conditional
expressions on data values, EXE generates constraints on the input values and gen-
erates cases for the “true” path and “false” path of the conditional expressions. This
approach, as other formal methods, requires access to source code whereas we do
not need source code for our analysis to work.
includes a discussion on how RAID-5 can be constructed using parity blocks [81].
While it can tolerate a single disk failure, more recent work have focused on using
parity blocks to with stand double disk failures. For example, EVENODD [21]
and RDP [28] are techniques that spread parity among multiple disks such that the
data can be recovered after a double disk failure. In a more recent work, Denehy et
al. develop techniques to handle the parity update consistency issues on software
RAID systems [30]. They introduce a new mode of operation in journaling file sys-
tem called declare mode that provides information about outstanding disk writes
after crash. They augment the software RAID layer to improve the speed of resyn-
chronization of parity with other blocks with the help of the file system. While all
the previous work have focused on adding parity at a layer beneath the file system,
our work is primarily built around implementing redundancy within the file system.
Chapter 8
      “One therefore has the problem of being able to deal ... with ... bizarre
      occurrences such as a disc channel which writes to the wrong place on the
      disc without any error indication or, perhaps even worse, one which says that
      it has written but in fact has not.”
    Commodity operating systems have grown to assume the presence of mostly re-
liable hardware. The result, in the case of file systems, is that most commodity file
systems do not include the requisite machinery to handle the types of partial faults
one can reasonably expect from modern disk drives. In this thesis, we develop
and employ a new technique, semantic failure analysis (SFA), that uses block type
information and transaction semantics to test and understand failure handling in
journaling file systems. Then, we built robust versions of Linux ext3 that use vari-
ous redundant information and a centralized failure handler to provide fine-grained
and well-defined failure policies.
    In this chapter, we summarize this dissertation by recapitulating our failure
policy analysis and experiences in building IRON file systems (Section 8.1). We
then list a set of lessons we learned from this dissertation (Section 8.2). Finally,
we present the future directions where this thesis can possibly be extended (Sec-
tion 8.3).
120
8.1 Summary
Storage stacks on modern computers present complex failure modes such as la-
tent sector faults, data corruption, and transient errors. The first part of this thesis
focuses on understanding the failure policies of file system when confronted with
such faults. We choose to focus on local file systems due to their ubiquitous pres-
ence and new challenges they present.
     Commodity file systems are large, complex pieces of software with intricate
interactions with other parts of the system such as buffer cache manager, I/O sched-
ulers, journaling layer, and low level drivers. Given this complexity, we need new
techniques and approaches that can be used by system developers, and even users
of file systems, to test and understand how commodity file systems handle disk fail-
ures. In this regard, we develop a new technique called Semantic Failure Analysis,
which applies file system specific knowledge, such as block types and transaction
semantics to low-level block I/O stream and fails specific file system I/O. In con-
trast to traditional fault injection, which fails blocks in a “blind” fashion, semantic
failure analysis is fast and can be used to identify failure policies, bugs, and incon-
sistencies relatively easily.
     We analyze five commodity journaling file systems: Linux ext3, ReiserFS, JFS,
and XFS and Windows NTFS. We make the following observations. First, com-
modity file systems are built with the assumption that disk fails in a fail-stop man-
ner and therefore, they store little or no redundant information on disk. As a result,
when portions of a disk fail, file systems are not able to recover from them. Second,
file systems exhibit illogical inconsistency in their failure handling policies. That
is, even under similar disk failure scenarios, different detection and recovery ac-
tions are employed. We suspect that this may be the result of the assumptions made
by different developers when writing different sections of the code. Finally, failure
handling is hard in current file system architecture, and we need new frameworks
for supporting well-defined failure policies.
     In the second part of this thesis, we focus on improving the robustness of com-
modity file systems to disk failures. First, we explore the effectiveness and cost
of different redundant information like checksum, replica, and parity. By apply-
ing redundancy to file system metadata and data, both individually and in various
combinations, we study in breadth the cost in terms of time and space. Our results
show that for typical desktop-like workloads, it is indeed feasible to use redundant
information with small overheads.
     Second, we focus in depth on two issues presented by single disk drives on low-
end systems: spatially correlated faults and lack of NVRAM support. We develop
a probabilistic model that a file system can use to construct redundancy sets such
                                                                                  121
that the chances of multiple faults affecting blocks in the same set is low. We also
develop two parity update techniques that integrate with journaling framework to
update data and parity blocks consistently. One of the techniques, the no-overwrite
approach, can provide atomicity in updating the redundant information and as a
result, can handle latent sector faults even during journal recovery. Our evaluation
of the techniques point out that the performance overhead of parity update tech-
niques varies between less than 1% to 11% for typical desktop-type workloads and
increases as high as 51% for synchronous, write-intensive benchmarks.
    Finally, we rearchitect the file system with a Centralized Failure Handler (CFH)
to handle all the failures in a consistent manner. The CFH receives all the I/O calls
from a file system along with the I/O semantics such as the block type information.
By separating policies from mechanisms and handling all I/O failures at one single
point, CFH enables uniform, fine-grained, and flexible failure policies. We also
show that CFH incurs no additional overhead in forwarding requests between a file
system and disk.
    To conclude, we believe it is time to reexamine how file systems handle failures.
One excellent model is already available to us within the operating system kernel:
the networking subsystem. Indeed, because network hardware has long been con-
sidered an unreliable hardware medium, the software stacks above them have been
designed with well-defined policies to cope with common failure modes [82].
        modifying a common layer like buffer cache that is used by several sub-
        systems is generally discouraged. Under such circumstances, we believe
        that the techniques developed in the context of semantically-smart disk sys-
        tems [10, 12, 105, 106, 108] to reverse engineer file system information can
        be immensely useful.
      • Failure as a first class citizen: Traditionally, systems have been built with
        performance as the main focus and as a result, various policies have been de-
        veloped for different applications, where each policy is aimed at maximizing
        the performance of a specific workload behavior.
      • Ways to describe policies: Policies are harder to get right without the right
        framework. This is especially so for failure policies where there are hundreds
        of combinations that are possible among different block types, detection, and
        recovery techniques. In this thesis, we show using CFH that given a right
        interface, it is indeed possible to specify even intricate policies using simple
        commands.
                                                                                    123
blocks) on disk platters. Deriving such data from drives that are discarded as un-
usable can shed light into block faults and help higher layers improve their data
layout patterns.
    SCSI drives provide low-level commands to obtain the mapping details between
a logical block to its physical location on disk (including details like cylinder, track,
and sector). As a first step, one can reconstruct the layout of logical blocks on disk,
which can show the bad blocks that are not used by a disk drive. Although this
only brings out the bad blocks that are already identified by a disk drive, it can
unearth certain internal policies of a disk firmware such as its bad block remapping
algorithm, which can possibly be used by a higher layer to improve its layout.
explored some uses of file system level knowledge at the disk drives, where they
reverse engineered the higher-level semantics. However, with the advent of object-
based storage systems [6] some of this information can be available through explicit
interfaces.
[1] Anurag Acharya. Reliability on the Cheap: How I Learned to Stop Worrying
    and Love Cheap PCs. EASY Workshop ’02, October 2002.
[4] Dave Anderson. You Don’t Know Jack about Disks. ACM Queue, June
    2003.
[5] Dave Anderson. “Drive manufacturers typically don’t talk about disk fail-
    ures”. Personal Communication from Dave Anderson of Seagate, 2005.
[7] Dave Anderson, Jim Dykes, and Erik Riedel. More Than an Interface: SCSI
    vs. ATA. In Proceedings of the 2nd USENIX Symposium on File and Storage
    Technologies (FAST ’03), San Francisco, California, April 2003.
[8] Akshat Aranya, Charles P. Wright, and Erez Zadok. Tracefs: A File System
    to Trace Them All. In Proceedings of the 3rd USENIX Symposium on File
    and Storage Technologies (FAST ’04), San Francisco, California, April 2004.
                                    127
128
[13] Mary Baker, John Hartman, Martin Kupfer, Ken Shirriff, and John Ouster-
     hout. Measurements of a Distributed File System. In Proceedings of the
     13th ACM Symposium on Operating Systems Principles (SOSP ’91), pages
     198–212, Pacific Grove, California, October 1991.
[14] Mary Baker, Mehul Shah, David S. H. Rosenthal, Mema Roussopoulos, Pet-
     ros Maniatis, TJ Giuli, and Prashanth Bungale. A fresh look at the reliability
     of long-term digital storage. In Proceedings of the 2006 EuroSys Confer-
     ence, Leuven, Belgium, April 2006.
[16] Wendy Bartlett and Lisa Spainhower. Commercial Fault Tolerance: A Tale
     of Two Systems. IEEE Transactions on Dependable and Secure Computing,
     1(1):87–96, January 2004.
[17] J.H. Barton, E.W. Czeck, Z.Z. Segall, and D.P. Siewiorek. Fault Injection
     Experiments Using FIAT. IEEE Transactions on Computers, 39(4):1105–
     1118, April 1990.
                                                                             129
[18] Steve Best. JFS Log. How the Journaled File System performs logging. In
     Proceedings of the 4th Annual Linux Showcase and Conference, pages 163–
     168, Atlanta, 2000.
[20] Dina Bitton and Jim Gray. Disk shadowing. In Proceedings of the 14th
     International Conference on Very Large Data Bases (VLDB 14), pages 331–
     338, Los Angeles, California, August 1988.
[25] George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Ar-
     mando Fox. Microreboot – A Technique for Cheap Recovery. In Proceed-
     ings of the 6th Symposium on Operating Systems Design and Implementation
     (OSDI ’04), pages 31–44, San Francisco, California, December 2004.
[26] Remy Card, Theodore Ts’o, and Stephen Tweedie. Design and Implementa-
     tion of the Second Extended Filesystem. In Proceedings of the First Dutch
     International Symposium on Linux, Amsterdam, Holland, 1994.
[27] Andy Chou, Jun-Feng Yang, Benjamin Chelf, Seth Hallem, and Dawson
     Engler. An Empirical Study of Operating System Errors. In Proceedings
     of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01),
     pages 73–88, Banff, Canada, October 2001.
[28] Peter Corbett, Bob English, Atul Goel, Tomislav Grcanac, Steven Kleiman,
     James Leong, and Sunitha Sankar. Row-Diagonal Parity for Double Disk
130
[33] James Dykes. “A modern disk has roughly 400,000 lines of code”. Personal
     Communication from James Dykes of Seagate, August 2005.
[34] Jon G. Elerath. Specifying Reliability in the Disk Drive Industry: No More
     MTBF’s. In Proceedings of the Annual Reliability and Maintainability Sym-
     posium (RAMS), pages 194–199, January 2000.
[35] Jon G. Elerath and Sandeep Shah. Server Class Disk Drives: How Reliable
     Are They? In Proceedings of the Annual Reliability and Maintainability
     Symposium (RAMS), pages 151–156, 2004.
[37] Ralph Waldo Emerson. Essays and English Traits – IV: Self-Reliance. The
     Harvard classics, edited by Charles W. Eliot. New York: P.F. Collier and
     Son, 1909-14, Volume 5, 1841. A foolish consistency is the hobgoblin of
     little minds, adored by little statesmen and philosophers and divines.
                                                                           131
[38] Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin
     Chelf. Bugs as Deviant Behavior: A General Approach to Inferring Errors
     in Systems Code. In Proceedings of the 18th ACM Symposium on Operating
     Systems Principles (SOSP ’01), pages 57–72, Banff, Canada, October 2001.
[39] Gregory R. Ganger, Marshall Kirk McKusick, Craig A.N. Soules, and
     Yale N. Patt. Soft Updates: A Solution to the Metadata Update Problem
     in File Systems. ACM Transactions on Computer Systems, 18(2):127–153,
     May 2000.
[40] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File
     System. In Proceedings of the 19th ACM Symposium on Operating Systems
     Principles (SOSP ’03), pages 29–43, Bolton Landing (Lake George), New
     York, October 2003.
[41] Garth A. Gibson. Redundant disk arrays: reliable, parallel secondary stor-
     age. PhD thesis, University of California at Berkeley, 1991.
[42] Garth A. Gibson, David Rochberg, Jim Zelenka, David F. Nagle, Khalil
     Amiri, Fay W. Chang, Eugene M. Feinberg, Howard Gobioff, Chen
     Lee, Berend Ozceri, and Erik Riedel. File server scaling with network-
     attached secure disks. In Proceedings of the 1997 Joint International Con-
     ference on Measurement and Modeling of Computer Systems (SIGMET-
     RICS/PERFORMANCE ’97), pages 272–284, Seattle, Washington, June
     1997.
[43] Jason Goldberg. “Spatially local failures can be caused by thermal asper-
     ity”. Personal Communication from Jason Goldberg of Seagate Research,
     February 2006.
[44] Jim Gray. A Census of Tandem System Availability Between 1985 and 1990.
     Technical Report 90.1, Tandem Computers, 1990.
[45] Jim Gray and Catharine Van Ingen. Empirical measurements of disk failure
     rates and error rates. Microsoft Technical Report, December 2005.
[46] Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Tech-
     niques. Morgan Kaufmann, 1993.
[49] Weining Gu, Z. Kalbarczyk, Ravishankar K. Iyer, and Zhenyu Yang. Char-
     acterization of Linux Kernel Behavior Under Error. In Proceedings of the In-
     ternational Conference on Dependable Systems and Networks (DSN-2003),
     pages 459–468, San Francisco, California, June 2003.
[51] Robert Hagmann. Reimplementing the Cedar File System Using Logging
     and Group Commit. In Proceedings of the 11th ACM Symposium on Oper-
     ating Systems Principles (SOSP ’87), Austin, Texas, November 1987.
[53] Dave Hitz, James Lau, and Michael Malcolm. File System Design for an
     NFS File Server Appliance. In Proceedings of the USENIX Winter Technical
     Conference (USENIX Winter ’94), San Francisco, California, January 1994.
[54] Mei-Chen Hsueh, Timothy K. Tsai, and Ravishankar K. Iyer. Fault injection
     techniques and tools. IEEE Computer, 30(4):75–82, 1997.
[56] Gordon F. Hughes and Joseph F. Murray. Reliability and Security of RAID
     Storage Systems and D2D Archives Using SATA Disk Drives. ACM Trans-
     actions on Storage, 1(1):95–107, February 2005.
[59] Hannu H. Kari. Latent Sector Faults and Reliability of Disk Arrays. PhD
     thesis, Helsinki University of Technology, September 1997.
[63] Steve R. Kleiman. Vnodes: An Architecture for Multiple File System Types
     in Sun UNIX. In Proceedings of the USENIX Summer Technical Conference
     (USENIX Summer ’86), pages 238–247, Atlanta, Georgia, June 1986.
[64] John Kubiatowicz, David Bindel, Patrick Eaton, Yan Chen, Dennis Geels,
     Ramakrishna Gummadi, Sean Rhea, Westley Weimer, Chris Wells, Hakim
     Weatherspoon, and Ben Zhao. OceanStore: An Architecture for Global-
     Scale Persistent Storage. In Proceedings of the 9th International Conference
     on Architectural Support for Programming Languages and Operating Sys-
     tems (ASPLOS IX), pages 190–201, Cambridge, Massachusetts, November
     2000.
[65] James Larus. The Singularity Operating System. Seminar given at the Uni-
     versity of Wisconsin, Madison, 2005.
[67] Blake Lewis. Smart Filers and Dumb Disks. NSIC OSD Working Group
     Meeting, April 1999.
[68] Stan Li. Markov Random Fields and Gibbs Distributions. In Markov Ran-
     dom Field Modeling in Image Analysis, chapter 1. Springer-Verlag, 2001.
[70] Wei lun Kao, Ravishankar K. Iyer, and Dong Tang. FINE: A Fault Injection
     and Monitoring Environment for Tracing the UNIX System Behavior Under
     Faults. In IEEE Transactions on Software Engineering, pages 1105–1118,
     1993.
[71] Marshall K. McKusick, William N. Joy, Sam J. Leffler, and Robert S. Fabry.
     A Fast File System for UNIX. ACM Transactions on Computer Systems,
     2(3):181–197, August 1984.
[72] Marshall Kirk McKusick, Willian N. Joy, Samuel J. Leffler, and Robert S.
     Fabry. Fsck - The UNIX File System Check Program. Unix System Man-
     ager’s Manual - 4.3 BSD Virtual VAX-11 Version, April 1986.
[73] Larry McVoy and Carl Staelin. lmbench: Portable Tools for Performance
     Analysis. In Proceedings of the USENIX Annual Technical Conference
     (USENIX ’96), San Diego, California, January 1996.
[76] John K. Ousterhout. Why Aren’t Operating Systems Getting Faster as Fast
     as Hardware? In Proceedings of the 1990 USENIX Summer Technical Con-
     ference, Anaheim, CA, June 1990.
[77] John K. Ousterhout, Herve Da Costa, David Harrison, John A. Kunze, Mike
     Kupfer, and James G. Thompson. A Trace-Driven Analysis of the UNIX 4.2
                                                                            135
[79] Swapnil Patil, Anand Kashyap, Gopalan Sivathanu, and Erez Zadok. I3 FS:
     An In-kernel Integrity Checker and Intrusion detection File System. In Pro-
     ceedings of the 18th Annual Large Installation System Administration Con-
     ference (LISA ’04), Atlanta, Georgia, November 2004.
[80] David Patterson, Aaron Brown, Pete Broadwell, George Candea, Mike
     Chen, James Cutler, Patricia Enriquez, Armando Fox, Emre Kiciman,
     Matthew Merzbacher, David Oppenheimer, Naveen Sastry, William Tetzlaff,
     Jonathan Traupman, and Noah Treuhaft. Recovery Oriented Computing
     (ROC): Motivation, Definition, Techniques, and Case Studies. Technical
     Report CSD-02-1175, U.C. Berkeley, March 2002.
[81] David Patterson, Garth Gibson, and Randy Katz. A Case for Redundant
     Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM
     SIGMOD Conference on the Management of Data (SIGMOD ’88), pages
     109–116, Chicago, Illinois, June 1988.
[87] Feng Qin, Joseph Tucek, Jagadeesan Sundaresan, and Yuanyuan Zhou. Rx:
     Treating Bugs as Allergies – A Safe Method to Survive Software Failures. In
     Proceedings of the 20th ACM Symposium on Operating Systems Principles
     (SOSP ’05), Brighton, United Kingdom, October 2005.
[90] Peter M. Ridge and Gary Field. The Book of SCSI 2/E. No Starch, June
     2000.
[91] Martin Rinard, Christian Cadar, Daniel Dumitran, Daniel M. Roy, Tudor
     Leu, and Jr. William S. Beebe. Enhancing Server Availability and Security
     Through Failure-Oblivious Computing. In Proceedings of the 6th Sympo-
     sium on Operating Systems Design and Implementation (OSDI ’04), San
     Francisco, California, December 2004.
[93] Mendel Rosenblum and John Ousterhout. The Design and Implementation
     of a Log-Structured File System. ACM Transactions on Computer Systems,
     10(1):26–52, February 1992.
[94] Jerome H. Saltzer, David P. Reed, and David D. Clark. End-to-end argu-
     ments in system design. ACM Transactions on Computer Systems, 2(4):277–
     288, November 1984.
[95] Russel Sandberg. The Design and Implementation of the Sun Network File
     System. In Proceedings of the 1985 USENIX Summer Technical Conference,
     pages 119–130, Berkeley, CA, June 1985.
                                                                               137
 [96] Stefan Savage and John Wilkes. AFRAID — A Frequently Redundant Array
      of Independent Disks. In Proceedings of the USENIX Annual Technical Con-
      ference (USENIX ’96), pages 27–39, San Diego, California, January 1996.
 [97] Jiri Schindler. “We have experienced a severe performance degradation that
      was identified as a problem with disk firmware. The disk drives had to be re-
      programmed to fix the problem”. Personal Communication from J. Schindler
      of EMC, July 2005.
[100] Thomas J.E. Schwarz, Qin Xin, Ethan L. Miller, Darrell D.E. Long, Andy
      Hospodor, and Spencer Ng. Disk Scrubbing in Large Archival Storage Sys-
      tems. In Proceedings of the 12th Annual Meeting of the IEEE Interna-
      tional Symposium on Modeling, Analysis, and Simulation of Computer and
      Telecommunication Systems (MASCOTS), Volendam, Netherlands, October
      2004.
[101] Margo Seltzer, Keith Bostic, Marshall Kirk McKusick, and Carl Staelin. An
      Implementation of a Log-Structured File System for UNIX. In Proceedings
      of the USENIX Winter Technical Conference (USENIX Winter ’93), pages
      307–326, San Diego, California, January 1993.
[102] Sandeep Shah and Jon G. Elerath. Disk Drive Vintage and Its Effect on
      Reliability. In Proceedings of the Annual Reliability and Maintainability
      Symposium (RAMS), pages 163–167, 2000.
[103] Sandeep Shah and Jon G. Elerath. Reliability analysis of disk drive failure
      mechanisms. In Proceedings of the Annual Reliability and Maintainability
      Symposium (RAMS), pages 226–231, January 2005.
[104] D.P. Siewiorek, J.J. Hudak, B.H. Suh, and Z.Z. Segal. Development of a
      Benchmark to Measure System Robustness. In Proceedings of the 23rd In-
      ternational Symposium on Fault-Tolerant Computing (FTCS-23), Toulouse,
      France, June 1993.
138
[110] Christopher A. Stein, John H. Howard, and Margo I. Seltzer. Unifying File
      System Protection. In Proceedings of the USENIX Annual Technical Con-
      ference (USENIX ’01), Boston, Massachusetts, June 2001.
[111] Lex Stein. Stupid File Systems Are Better. In The Tenth Workshop on Hot
      Topics in Operating Systems (HotOS X), Sante Fe, New Mexico, June 2005.
[112] Adan Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto,
      and Geoff Peck. Scalability in the XFS File System. In Proceedings of the
      USENIX Annual Technical Conference (USENIX ’96), San Diego, Califor-
      nia, January 1996.
[113] Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Improving the Re-
      liability of Commodity Operating Systems. In Proceedings of the 19th ACM
      Symposium on Operating Systems Principles (SOSP ’03), Bolton Landing
      (Lake George), New York, October 2003.
                                                                               139
[118] T. K. Tsai and R. K. Iyer. Measuring Fault Tolerance with the FTAPE Fault
      Injection Tool. In The 8th International Conference On Modeling Techniques
      and Tools for Computer Performance Evaluation, pages 26–40, September
      1995.
[120] Theodore Ts’o. Email discussion on soft updates with the subject “soft up-
      date vs journaling?”. http://lkml.org/lkml/2006/1/22/27.
[121] Stephen C. Tweedie. Journaling the Linux ext2fs File System. In The Fourth
      Annual Linux Expo, Durham, North Carolina, May 1998.
[123] Werner Vogels. File system usage in Windows NT 4.0. In Proceedings of the
      17th ACM Symposium on Operating Systems Principles (SOSP ’99), pages
      93–109, Kiawah Island Resort, South Carolina, December 1999.
[124] Larry Y. Wang, Mike Sullivan, and Jim Chao. Thermal asperities sensitivity
      to particles: Methodology and test results. Journal of Tribology, 123(2):376–
      379, April 2001.
140
[125] John Wehman and Peter den Haan. The Enhanced IDE/Fast-ATA FAQ.
      http://thef-nym.sci.kun.nl/cgi-pieterh/atazip/atafq.html, 1998.
[127] John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan. The HP
      AutoRAID Hierarchical Storage System. ACM Transactions on Computer
      Systems, 14(1):108–136, February 1996.
[128] Junfeng Yang, Can Sar, Paul Twohey, Cristian Cadar, and Dawson Engler.
      Automatically Generating Malicious Disks using Symbolic Execution. In
      IEEE Security and Privacy, Berkeley, California, May 2006.
[129] Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Us-
      ing Model Checking to Find Serious File System Errors. In Proceedings of
      the 6th Symposium on Operating Systems Design and Implementation (OSDI
      ’04), San Francisco, California, December 2004.
[131] S. Zhou, H. Da Costa, and A.J. Smith. A File System Tracing Package for
      Berkeley UNIX. In Proceedings of the USENIX Summer Technical Con-
      ference (USENIX Summer ’84), pages 407–419, Salt Lake City, Utah, June
      1984.