Byzantine Fault Isolation in the Farsite Distributed File System
John R. Douceur and Jon Howell
                                                         Microsoft Research
                                                   {johndo, howell}@microsoft.com
ABSTRACT                                                                  We have therefore developed a methodology for
In a peer-to-peer system of interacting Byzantine-fault-tolerant      designing a distributed system of BFT groups, wherein a
replicated-state-machine groups, as system scale increases, so        faulty group is prevented from corrupting the entire
does the probability that a group will manifest a fault. If no        system. We call our method Byzantine Fault Isolation
steps are taken to prevent faults from spreading among groups,        (BFI). BFI makes use of formal specification [15] to
a single fault can result in total system failure. To address this    constrain the semantic behavior of a faulty system. The
problem, we introduce Byzantine Fault Isolation (BFI), a              technique involves selectively weakening the system
technique that enables a distributed system to operate with           semantics and concomitantly strengthening the system
application-defined partial correctness when some of its              design. Formal specification helps determine when these
constituent groups are faulty. We quantify BFI’s benefit and
                                                                      two processes have satisfactorily converged.
describe its use in Farsite, a peer-to-peer file system designed to
scale to 100,000 machines.
                                                                          The next section surveys previous approaches to
                                                                      resisting Byzantine faults. Section 3 describes the Farsite
                                                                      file system, the context for our work. Section 4 quantifies
1     INTRODUCTION                                                    the value of isolating Byzantine faults, and Section 5
    Farsite [2] is a distributed peer-to-peer file system             describes our technique. Section 6 shows an example of
that runs on a network of desktop workstations and                    BFI in Farsite, and Section 7 summarizes.
provides centralized file-system service. File storage in
Farsite is secure, even though the system runs on                     2    PREVIOUS WORK
unsecured machines. This security is established, in part,
through the use of Byzantine Fault Tolerance (BFT), a                     In 1980, Pease et al. [23] introduced the problem of
well-known mechanism for building trusted services                    reaching agreement among a group of correctly
from untrusted components [14, 6]. BFT enables a                      functioning processors in the presence of arbitrarily
service to continue functioning correctly as long as fewer            faulty processors; they proved that a minimum of 3t + 1
than a third of the machines it is running on are faulty.             processors are needed to tolerate t faults. Two years later,
    Farsite is designed to be scalable. As more and more              they christened this the Byzantine Generals Problem
workstations make use of the file-system service, the                 [14], as it has been known ever since. A large body of
resources of these workstations become available for use              research has addressed the problem of Byzantine
by the system. However, the BFT technique cannot                      agreement, the first decade of which is well surveyed by
exploit these additional resources to achieve the greater             Barborak and Malek [4].
throughput demanded by an increasing user base: Adding                    In the mid-to-late 1990’s, several researchers
machines to a BFT group decreases throughput, rather                  combined Byzantine-fault-tolerant protocols with state-
than increasing it.                                                   machine replication [26] to produce toolkits such as
    To achieve an increase in throughput with scale,                  Rampart [24], SecureRing [12], and BFT [6]. These
Farsite partitions its workload among multiple interacting            toolkits provide a replication substrate for services
BFT groups. Unfortunately, as the count of BFT groups                 written as deterministic state machines. The substrate
increases, so too does the probability that some group                guarantees that the service will operate correctly as long
will contain enough faulty machines that the group will               as fewer than a third of its replicas are faulty. An
be unable to suppress the fault. If the system design does            unfortunate property of these toolkits is that their
not account for the failure of one or more BFT groups, a              throughput scales negatively: As the group size grows,
single group failure can cause the entire system to fail.             the system throughput shrinks, which is the exact
    The alternative to total failure is degraded operation,           opposite of the behavior desired for scalable systems.
wherein individual group failures cause the system to                     Throughput scaling is non-positive because a non-
operate in a way that is still partially correct, rather than         decreasing fraction of replicas redundantly perform each
completely unspecified. However, “partial correctness” is             computation. Throughput scaling is made negative by the
not something that can be defined in an application-                  message load of each processor, which is linear in group
independent fashion [11]. It is thus not possible to build a          size. Some recent research has addressed the latter issue:
generic substrate that enables an arbitrary service to                Lewis and Saia [16] have developed a protocol that
degrade gracefully in a meaningful and useful manner.                 probabilistically reaches agreement if fewer than an
                                                                      eighth of the replicas are faulty. The message workload
of each processor is logarithmic in group size, so                 Thus, every known technique for building systems
throughput scaling is less dramatically negative than for      that resist Byzantine faults has at least one of the
BFT. Abd-El-Malek et al. [1] have built a replicated state     following weaknesses:
machine based on the Query/Update (Q/U) protocol,                   • Its throughput does not increase with scale.
which requires 5t + 1 processors to tolerate t Byzantine            • It addresses only a narrow academic problem.
faults. In theory, this protocol has zero throughput scaling        • It does not support computation.
with system size; however, their implementation exhibits            • It does not address general Byzantine faults.
negative throughput scaling, albeit at a lower rate than
BFT.
    The above systems all exhibit two properties: (1) non-
                                                               3    CONTEXT – FARSITE FILE SYSTEM
positive throughput scaling and (2) all-or-nothing failure         We developed BFI in the context of a scalable, peer-
semantics, meaning that failures beyond the tolerated          to-peer file system called Farsite [2]. Farsite logically
threshold can cause the entire system to fail.                 functions as a centralized file server, but it is physically
    In the absence of a Byzantine-fault-tolerant substrate     distributed among the desktop workstations of a
that provides positive throughput scaling, researchers         university or corporation, which may have over 105
have built systems that partition their workload among         machines. In this environment, independent Byzantine
multiple machines. However, as the system size grows,          faults are significantly more likely than they would be in
so does the expected number of faulty machines, which          a physically secure server cluster.
in turn – given all-or-nothing failure semantics – leads to        Farsite employs different techniques for managing
an increasing likelihood of total system failure. We           file content and metadata. File content is encrypted,
observe that this problem could be assuaged if there were      replicated, and distributed among the machines in the
some means to limit the spread of Byzantine faults.            system, and a secure hash of each file’s content is
    Three avenues of research are related to this problem:     maintained with its metadata. File metadata is managed
First, several researchers have isolated Byzantine faults      by BFT groups of workstations; we call each group a
in distributed problems of academic interest, such as          “server”. Every machine in a Farsite system fills three
dining philosophers [21], vertex coloring [21], and edge       roles: a client for its local user, a file host storing
coloring [25, 18]. Their solutions employ self-stabilizing     encrypted content of data files, and a member of a BFT
protocols to guarantee that correct results are eventually     group that acts as a server for metadata.
obtained by all nodes that are beyond a specified distance         File metadata is dynamically partitioned among
(the “containment radius”) from faulty nodes. The formal       servers, as follows: A Farsite system is initialized with a
notion of fault containment for self-stabilizing systems       single server, called the root, which initially manages
was introduced by Ghosh et al. [10], who applied it only       metadata for all files. When the metadata load on the root
to transient faults. Such transient-fault containment was      becomes excessive, it assembles a randomly chosen set
achieved by Demirbas et al. [8] for the problem of             of machines into a new BFT group, and it delegates a
tracking in sensor networks. None of this research offers      subset of its files to this newly formed server. This
a broad approach to containing Byzantine faults.               process continues as necessary, resulting in a tree of
    Second, a number of researchers have investigated          delegations. The fanout of the tree is a matter of policy.
ways to limit Byzantine corruption when performing                 The only difference between directories and data files
broadcast [3], multicast [22], or gossip [17, 20]. These       is that the former may have no content and the latter may
closely related problems have no computational aspect;         have no children. This is a small enough distinction that
they merely propagate data. Furthermore, they have the         we refer to them both simply as “files”.
property that correct operation implicitly replicates all of
the data to all machines. The resulting redundancy             4    MOTIVATION
enables machines to vote on the data’s correctness, as in          This section argues for the value of isolating
the original Byzantine agreement problem.                      Byzantine faults in a scalable peer-to-peer system. In
    Third, some researchers have tackled specialized           particular, we consider a distributed file system,
subclasses of the general problem. Merideth [19]               specifically a Farsite-like system of interacting BFT
proposes a proactive fault-containment system that relies      groups. For analytical simplicity, we assume that the
on fault detection, a well-known specialization of fault-      system’s files are partitioned evenly among the
tolerance problems [7]. Krings and McQueen [13]                constituent BFT groups, and we assume independent
employ standard Byzantine-fault-tolerant protocols only        machine failure. For concreteness, we assume a machine
for carefully defined “critical functionalities.” The TTP/C    fault probability of 0.001; in the analysis, we discuss our
protocol [5] isolates only a constrained subset of             sensitivity to this value. We evaluate the operational
Byzantine faults, namely reception failures and consistent     fault rate, which is the probability that an operation on a
transmission failures.                                         randomly selected file exhibits a fault.
4.1 Model                                                                                    Thus, the probability of a faulty absolute-path-based file
    If a third or more of the machines in a BFT group are                                    operation in a system of N(f, l) BFT groups is:
faulty, the group cannot mask the fault. Therefore, if Pm                                                      Pt = F ( f , l ) N ( f , l )
is the probability that a machine is faulty, the probability                                 This is illustrated by the light and medium lines in Fig. 1
that a group of size g manifests a fault is:                                                 for systems in which each 4-member BFT group has 4 or
                                                     (
                                       Pg = 1 − B ⎢⎣ ( g − 1) 3⎥⎦ , g , Pm     )             16 children, respectively. Since not all file operations are
                                                                                             path-based, and since paths can be specified relative to
Function B is the cumulative binomial distribution                                           open files, the operational fault rate will actually be
function. In a system of n BFT groups, the probability                                       somewhat lower than that indicated by Fig. 1.
that at least one group manifests a fault is:
                                             Ps = 1 − B ( 0, n, Pg )                         4.2 Analysis
    We consider three cases. In the first case, there is no                                      As the light dashed line shows, a single 4-member
fault-isolation mechanism, so a single faulty group may                                      group has an operational fault rate of 6×10–6, which is
spread misinformation to other groups and thereby                                            better than five nines. However, when the scale reaches
corrupt the entire system. The probability that any given                                    105 groups, the operational fault rate is 0.45; almost half
file operation exhibits a fault is thus equal to the                                         of all operations exhibit faults. By contrast, with a group
probability Ps that the system contains even one faulty                                      fanout of 16, a 105-group Farsite system exhibits an
group. This is shown by the three dashed lines in Fig. 1                                     operational fault rate of 3×10–5, showing that fault
for BFT groups of size 4, 7, and 10, as the count of                                         tolerance is largely preserved as scale increases.
groups scales up to 105.                                                                         An alternate way to improve the large-scale
    The second case illustrates ideal fault isolation. Each                                  operational fault rate is to increase the size of BFT
BFT group can corrupt only operations on the files it                                        groups. However, as Fig. 1 shows, in a system of 105
manages, so the probability of a faulty file operation                                       groups, the group size must be increased from 4 to 10 to
equals the probability Pg that the file’s managing group is                                  achieve the operational fault rate that BFI achieves. This
faulty. System scale is thus irrelevant, as illustrated by                                   increase cuts the system throughput by 60% (= 1 – 4/10)
the dark line in Fig. 1 for 4-member BFT groups.                                             or more, because it increases the redundant computation
    The third case illustrates BFI in Farsite. Pathname                                      by a factor of 2.5 (= 10 / 4) and the intra-group message
lookups involve metadata from all files along the path, so                                   traffic by a factor of 6.25 (= 102 / 42).
a faulty group can corrupt lookups to its descendent                                             The curves in Fig. 1 are based on a machine fault rate
groups’ files. Recall that the count of nodes in a tree with                                 of 10–3. For higher rates, the benefit of BFI is even
l levels and node fanout of f is:                                                            greater, since a larger increase in group size (and
                                         N ( f , l ) = ( f l − 1)   ( f − 1)                 corresponding drop in throughput) is needed to achieve
                                                                                             the same operational fault rate as BFI. By contrast, a
In a tree of N(f, l) BFT groups, the expected number of                                      lower machine fault rate reduces the benefit, although not
groups that are faulty or have a faulty ancestor is defined                                  by much: Even with a machine fault probability of 10–5 in
by the recurrence:                                                                           a system of 105 groups, BFI enables 4-member groups to
                          F ( f , l ) = Pg ⋅ N ( f , l ) + (1 − Pg ) ⋅ f ⋅ F ( f , l − 1)    achieve an operational fault rate that would otherwise
                                                                                             require 7-member groups and an attendant drop in
                                                   F ( f ,1) = Pg                            throughput of over 42% (= 1 – 4/7).
                              0
                          1×10
                                                 BFT 4, no BFI
                          1×10–1                 BFT 7, no BFI                               5    BYZANTINE FAULT ISOLATION
                                                 BFT 10, no BFI                                  BFI is a technique that prevents a faulty BFT group
 operational fault rate
                              –2
                          1×10                   BFT 4, tree (4) BFI
                                                 BFT 4, tree (16) BFI
                                                                                             from corrupting an entire system. The technique is based
                          1×10–3                 BFT 4, ideal BFI                            on using formal specification to design a distributed
                          1×10
                              –4                                                             system with well-defined semantics [15]. BFI
                                                                                             semantically specifies the faulty behavior that can
                          1×10–5                                                             manifest when faults occur in the distributed system.
                          1×10–6
                                                                                                 Several approaches may be used to validate that the
                                                                                             system design adheres to the specified fault semantics. In
                          1×10–7                                                             our work, we used only informal proof, but greater
                                   1        10           100     1,000    10,000   100,000   confidence can be attained through model checking [15]
                                             system scale (count of BFT groups)              or mechanically verified formal proof [9]. Our limited
                                                                                             experience with such methods suggests that they would
      Fig. 1: Faults at scale (machine fault probability = 0.001)
                                                                                             be challenging to apply to a formal spec as big as ours.
5.1 Formal Distributed System Specification                     corresponds to an action in the semantic spec; for
    We follow an approach to formal specification               example, updating a certain distributed-system data
advocated by Lamport [15]. This approach has three              structure may correspond to the completion of an open
components: a semantic spec, a distributed-system spec,         operation in the semantic spec. A background action
and a refinement.* Each of the two specs defines state          corresponds to a non-action in the semantic spec; for
information and actions that affect the state.                  example, passing resources to a client by means of a
    A semantic specification describes how users                message has no semantic effect.
experience the system. Farsite logically functions as a             The basic distributed-system spec defines interactions
centralized file server, so Farsite’s semantic spec defines     among a set of non-faulty machines. The basic semantic
the behavior of a centralized file server. The state defined    spec defines the user-visible behavior of a fault-free
by the semantic spec includes files, handles, and pending       system. If the system design is correct, the refinement
operations. The actions defined by the semantic spec are        will show that the distributed-system spec satisfies the
file-system operations: open, close, read, write, create,       semantic spec.
delete, and move/rename. For example, the semantic spec
says that the open operation resolves a pathname, checks        5.2 Specifying Failure Semantics
access rights to the file, adds a new handle to the set of          To understand the system’s behavior under fault
handles open on the file, and returns the handle to the         conditions, we modify the distributed-system spec as
caller. The spec also describes the error codes returned if     follows: We augment the state of each machine with a
a pathname fails to resolve or the access rights are            new flag indicating whether the machine is corrupt, and
invalid. In sum, the semantic spec characterizes the            we add a new action that sets a machine’s corrupt flag.
system from the users’ point of view.                           When a machine is corrupt, its local state is undefined,
    A distributed-system specification describes how a set      and it can send arbitrary messages to other machines at
of machines collaborate to produce a desired behavior.          any time.
The state defined by the distributed-system spec includes           These modifications to the distributed-system spec
machines, abstract data structures maintained by                prevent it from refining to the basic semantic spec. The
machines, and messages exchanged between machines.              BFI technique proceeds by progressively modifying the
The actions defined by the distributed-system spec are          two specs until the distributed-system spec again satisfies
tasks performed by individual machines or BFT groups:           the semantic spec:
sending messages, receiving messages, and modifying                  • The semantic spec is weakened to describe how
local-machine state. For example, the distributed-system                   faults appear to the users of the system.
spec says that when a server receives a message from a               • The distributed-system spec is strengthened to
client asking for the necessary resources to perform an                    quarantine faults from non-corrupt machines.
open operation, if the server has the resources available,      The art of Byzantine fault isolation is in determining
it sends them to the client; otherwise, it begins the           what modifications to make to the two specs. An overly
process of recalling the resources from other clients that      weak semantic spec may not isolate faults sufficiently; an
are holding the resources. In sum, the distributed-system       overly strong distributed-system spec may not facilitate a
spec characterizes the system from the system designers’        practical and performant implementation; and a semantic
point of view.                                                  spec that is too strong or a distributed-system spec that is
    A refinement is a formal correspondence between the         too weak will not admit a satisfactory refinement.
semantic spec and the distributed-system spec. It                   In Farsite, we modified the semantic spec by
describes how to interpret distributed-system state as          associating a flag with each file to indicate whether the
semantic state and distributed-system actions as semantic       file is tainted, and we added an action that sets the flag.
actions. For a state example, the refinement may say that       We established a refinement in which a file becomes
the attributes of a particular semantic-level file are          tainted if and only if the BFT group that manages the file
defined by a particular data structure on a particular          becomes corrupt.
machine in the distributed system; the particulars express          It would have been ideal to weaken the semantic spec
the system designer’s rules for what data means and             so little that tainted files could not corrupt operations on
which data is authoritative.                                    non-tainted files. Unfortunately, because path-based file
    Actions in the distributed-system spec are classified       operations involve metadata from all files along the path,
as either foreground or background actions, according to        we were unable to design a distributed system that
whether they have semantic effects. A foreground action         refined to such a semantic spec. We thus had to weaken
                                                                the semantic spec further, permitting tainted files to lie
   *
     Lamport uses different terminology because his approach
                                                                about their parents and children and thereby to corrupt
is applicable to a broader class of problems than distributed   path-based operations on the tainted file’s descendents.
systems.
    We were, however, able to constrain those lies to         analysis shows that this satisfies the fault semantics
prevent a tainted file from affecting operations on files     enumerated above. This rule also covers the case in
elsewhere in the namespace. In particular, a tainted file     which all servers are faulty, because then any semantic
cannot falsely claim an existing non-tainted file as its      result is consistent with the fault semantics.
child or parent. Exhaustively, the weakened safety                If no servers are faulty, our protocol ensures that all
semantics allow a tainted to appear to…                       servers agree on the result.
     (1) …have arbitrary contents and attributes,                 The tricky cases are those in which one server is
     (2) …not be linked into the file tree,                   faulty and the other two disagree on the result. It would
     (3) …not have children that it actually has,             be ideal to somehow prevent these cases from ever
     (4) …have children that do not actually exist, or        occurring, but this is provably impossible [23]. As Table
     (5) …have another tainted file as a child or parent.     1 shows, for each case (faulty object, faulty source, and
In addition, the weakened liveness semantics allow            faulty destination), refinement can select a satisfactory
operations involving a tainted file to not complete.          semantic result if the other servers disagree in a particular
    The modifications to Farsite’s distributed-system spec    way (the a subcases) but not if they disagree in the
are far more involved. Some of the key principles include     opposite way (the b subcases).
maintaining redundant information across BFT group                For example, in case 1, the object server is faulty. In
boundaries, augmenting messages with information that         subcase a, the source believes the move to be successful
justifies their correctness, ensuring unambiguous chains      so it unlinks the object, but the destination believes the
of authority over data, and carefully ordering messages       move to be unsuccessful so it fails to link the object.
and state updates for operations involving more than two      Consequently, the object becomes unlinked from the file
BFT groups. We illustrate an example in the next section.     tree. However, safety weakness 2 (see previous section)
                                                              states that a tainted file may appear to not be linked into
6    BFI EXAMPLE: MOVE OPERATION                              the file tree. Thus, our refinement could interpret the
                                                              outcome either as a tainted file successfully moving and
    Our BFI modifications to the distributed-system spec
                                                              then not appearing linked into the tree or as a tainted file
are too extensive to permit even a high-level summary in
                                                              failing to move and then not appearing linked into the
this paper. Nonetheless, to convey the flavor of these
                                                              tree.
modifications, we outline one example of how we
                                                                  The analysis of the a subcases for cases 2 and 3 is
modified Farsite’s distributed-system spec to satisfy the
                                                              similar albeit slightly more complex, because 2a must be
semantic spec. The example exploits a redundancy that
                                                              interpreted as a failed move and 3a must be interpreted as
we added for BFI, namely that parent-child links are
                                                              a successful move, to be consistent with the safety
stored by both parent and child files. If a client observes
                                                              weaknesses allowed by the failure semantics.
inconsistent information about the link between a parent
                                                                  In the b subcases, no semantic result is consistent
and a child, the client regards that link as nonexistent.
                                                              with the distributed-system outcome. For example, in
    The specific example we present is a component of
                                                              subcase 1b, the destination links the object but the source
the most complex file-system operation, move, whose
                                                              does not unlink it. If the refinement were to interpret this
semantics are that an object file has its parent changed
                                                              as a successful move, the destination file would become
from a source file to a destination file, thereby changing
                                                              the object’s parent, but because the source thinks the
the full pathnames of all descendents of the object file.
                                                              move failed, it still believes that it is the object’s parent,
    The object, source, and destination files might be
                                                              so the object could pretend that the source is in fact its
managed by three separate servers, each of which may be
                                                              parent, which our failure semantics disallow. A similar
faulty or not. We will not present our full case analysis
                                                              argument holds for interpreting the result as a failed
here, but we will describe the highlights.
                                                              move. Since it is impossible to refine the subcase b
    If all non-faulty servers agree on whether the move
                                                              outcomes, Farsite must prevent them.
succeeds, the refinement defines the semantic result as
                                                                  Our protocol precludes the b subcases by ensuring
the result obtained by the non-faulty servers. Detailed
                                                              that the non-faulty servers observe certain ordering
    Case    Object     Source       Dest.    Semantic         constraints before declaring a move operation successful:
       a     faulty    success    failure       either        The object server does not commit until after the source
    1                                                         server commits, and the destination server does not
       b     faulty     failure   success       none
                                                              commit until after the source and object servers commit.
       a    success     faulty     failure     failure        Examination of the table shows that this prevents the
    2
       b     failure    faulty    success       none          problematic subcases. For example, subcase 1b cannot
       a     failure   success     faulty     success         occur because the destination will not declare success
    3
       b    success    failure     faulty       none          until it first hears that the source has declared success,
      Table 1: Fault case analysis for move operation         which it has not.
7     SUMMARY                                                  [4]    M. Barborak, M. Malek. “The Consensus Problem in
                                                                      Fault-Tolerant Computing.” ACM Computing Surveys,
    Although Byzantine Fault Tolerance (BFT) allows a                 25(2), 1993.
trusted service to run on a peer-to-peer system of             [5]    G. Bauer, H. Kopetz, W. Steiner. “Byzantine Fault
untrusted machines, it does not support scaling up to                 Containment in TTP/C.” Workshop on Real-Time LANs
increase system throughput. Scale-up can be achieved by               in the Internet Age, 2002.
partitioning a workload among multiple BFT groups, but         [6]    M. Castro, B. Liskov, “Practical Byzantine Fault
this leads to an increase in the probability of total system          Tolerance.” 3rd OSDI, 1999.
failure as the system scale increases.                         [7]    T. D. Chandra, S. Toueg. “Unreliable Failure Detectors
    To solve this problem, we introduced Byzantine Fault              for Asynchronous Systems.” JACM 43(2), 1996.
Isolation (BFI), a methodology for using formal                [8]    M. Demirbas, A. Arora, T. Nolte, N. Lynch. “A
specification to constrain the semantic behavior of a                 Hierarchy-Based Fault-Local Stabilizing Algorithm for
faulty system. BFI modifies a system design to formally               Tracking in Sensor Networks.” 8th OPODIS, 2004.
recognize that machines may become corrupt, wherein            [9]    U. Engberg, P. Grønning, L. Lamport. “Mechanical
they have undefined local state and can send arbitrary                Verification of Concurrent Systems with TLA.” 4th
messages to other machines. These modifications                       Computer Aided Verification, 1992.
highlight areas that require design changes to restrict the    [10]   S. Ghosh, A. Gupta, T. Herman, S. V. Pemmaraju. “Fault-
spread of faulty information.                                         Containing Self-Stabilizing Algorithms.” PODC, 1996.
    Even in the presence of design features that restrict      [11]   M. P. Herlihy, J. M. Wing. “Specifying Graceful
faults, corrupt machines may still affect the system’s                Degradation.” IEEE TPDS 2(1), 1991.
semantics. Thus, the BFI technique involves modifying          [12]   K. P. Kihlstrom, L. E. Moser, P. M. Melliar-Smith. “The
                                                                      SecureRing       Protocols     for   Securing    Group
the system’s defined semantics to reflect the partial
                                                                      Communication.” Hawaii International Conference on
correctness achievable when the system is degraded. BFI
                                                                      System Sciences, 1998.
uses formal specification to ensure that the modified
                                                               [13]   A. W. Krings, M. A. McQueen. “A Byzantine Resilient
system design satisfies the modified system semantics.
                                                                      Approach to Network Security.” 29th International
    We quantified the benefit of BFI to scalable systems:             Symposium on Fault-Tolerant Computing, 1999.
In a tree-structured system of 105 BFT groups, wherein a       [14]   L. Lamport, R. Shostak, M. Pease. “The Byzantine
faulty group can corrupt its descendents’ operations, BFI             Generals Problem.” ACM TPLS, 4(3), 1982.
can enable 4-member BFT groups to achieve the same             [15]   L. Lamport. Specifying Systems. Addison-Wesley, 2003.
operational fault rate as 10-member BFT groups, without        [16]   C.S. Lewis, J. Saia. “Scalable Byzantine Agreement.”
the corresponding 60% drop in throughput due to                       Technical Report, University of New Mexico, 2004.
increased replication and message traffic.                     [17]   D. Malkhi, Y. Mansour, M. K. Reiter, “On Diffusing
    We employed the BFI technique in the design of the                Updates in a Byzantine Environment.” 18th SRDS, 1999.
Farsite distributed file system, a large and complex peer-     [18]   T. Masuzawa, S. Tixeuil. “A Self-Stabilizing Link-
to-peer system designed to scale to 105 machines. BFI                 Coloring Protocol Resilient to Unbounded Byzantine
guided us toward adding specific redundancies, enriching              Faults in Arbitrary Networks.” Laboratoire de Recherche
messages, restricting authority, and constraining the order           en Informatique Report #1396, 2005.
of distributed operation steps. Using BFI, we related          [19]   M. G. Merideth. “Enhancing Survivability with Proactive
these design changes to the system’s semantics, thereby               Fault-Containment.” DSN Student Forum, 2003.
showing that file corruption cannot spread to unrelated        [20]   Y. M. Minsky, F. B. Schneider. “Tolerating Malicious
regions of the file-system namespace.                                 Gossip.” Distributed Computing, 16(1), 2003.
    Prior to BFI, no technique has addressed how to            [21]   M. Nesterenko, A. Arora. “Tolerance to Unbounded
interconnect multiple BFT groups in a way that isolates               Byzantine Faults.” SRDS, 2002.
                                                               [22]   V. Pappas, B. Zhang, A. Terzis, L. Zhang, “Fault-
Byzantine faults.
                                                                      Tolerant Data Delivery for Multicast Overlay Networks.”
                                                                      ICDCS, 2004.
REFERENCES                                                     [23]   M. Pease, R. Shostak, L. Lamport. “Reaching Agreement
[1]   M. Abd-El-Malek, G. Ganger, G. Goodson, M. Reiter, J.           in the Presence of Faults.” JACM 27(2), 1980.
      Wylie.     “Fault-scalable  Byzantine   Fault-Tolerant   [24]   M. K. Reiter. “The Rampart Toolkit for Building High-
      Services.” 20th SOSP, 2005.                                     Integrity Services.” TPDS (LNCS 938), 1995.
[2]   A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R.         [25]   Y. Sakurai, F Ooshita, T. Masuzawa. “A Self-stabilizing
      Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M.              Link-Coloring Protocol Resilient to Byzantine Faults in
      Theimer, R. P. Wattenhofer. “FARSITE: Federated,                Tree Networks.” 8th OPODIS, 2004.
      Available, and Reliable Storage for an Incompletely      [26]   F. Schneider. “Implementing Fault-Tolerant Services
      Trusted Environment.” 5th OSDI, 2002.                           Using the State Machine Approach: A Tutorial.” ACM
[3]   Y. Azar, S. Kutten, B. Patt-Shamir. “Distributed Error          Computing Surveys, 22(4), 1990.
      Confinement.” PODC, 2003.