A 161126
A 161126
1. Introduction
Checkpointing and rollback-recovery are well-known techniques that allow
processes to make progress in spite of failures [Rand78]. The failures under con-
sideration are transient problems such as hardware errors and transaction aborts,
i.e., those that are unlikely to recur when a process restarts. With this scheme, a
process takes a checkpoint from time to time by saving its state on stable storage
[Lamp79]. When a failure occurs, the process rolls back to its most recent check-
point, assumes the state saved in that checkpoint, and resumes execution.
We first identify consistency problems that arise in applying this technique to
a distributed system. We then propose a checkpoint algorithm and a rollback-
recovery algorithm to restart the system from a consistent state when failures
occur. Our algorithms prevent the well-known "domino effect" as well as liveness
problems associated with rollback-recovery. In contrast to previous algorithms,
they are fault-tolerant and involve a minimal number of processes. With our
approach, each process stores at most two checkpoints in stable storage. This
storage requirement is shown to be minimal under general assumptions.
The paper is organized as follows: We discuss the notion of consistency in a
distributed system in section 2 and describe our system model in section 3. In sec-
tion 4 we identify the problems to be solved. Sections 5 and 6 contain the check-
point and rollback-recovery algorithms respectively. The algorithms are extended
for concurrent executions in section 7. In section 8 we consider optimizations. Sec-
tions 9 and 10 contain a discussion and our conclusion.
The transitive closure of the directly happens before relation is the happens before
relation. If event a happens before event b, b happens after a. (We abbreviate hap-
pens before, "before" and happens after, 'after".)
The local state of a process at time 0 is its initial state; the local state of a pro-
ces at time t is the state resulting from applying the sequence of events occurring
in(, t to its initial state. If a process has failed by time t, its local state at t is
undefined. A global state of a system: at time t is the set of all processes' local
states at t. The state of a channel at time t is the set of messages sent over that
channel but not yet received at t. We car depict the occurrences of events over
time with a time diagram, in which horizontal lines are time axes of processes,
points are events, and arrows represent messages from the sending process to the
receiving process. In this representation, a global state is a cut dividing the time
diagram into two halves. The channel states are the arrows (messages) that cross
the cut. Figure 1 is a time diagram for a system of four processes.
Informally, a cut (global state) in the time diagram is consistent if no arrow
starts on the right hand side and ends on the left hand side of it. This notion of
consistency fits the observation that a message cannot be received before it is sent
in any temporal frame of reference. For example, the cuts c and c' in Figure 1 are
consistent and inconsistent cuts, respectively. The channel states corresponding to
cut c consists of one message in the channel from p to q, and one in the channel
from s to r. Readers are referred to [Chan85] for a formal discussion of consistent
global states.
3. System Model
The distributed system considered in this paper has the following characteris-
tics:
I..
,
. , -
.-.
-- , * - ,,, * m
*, . ,. .~'~t, .ia.iadlt * .. .. ..
. . . . . - , - . . . . . . . -. . . . . - . . . . -F. i . 7 , - , r : , - . 2
Page 4
(1) Processes do not share memory or clocks. They communicate via messages
sent through reliable first-;n-first-out (FIFO) channels with variable non-zero
transmission time.
(2) Processes fail by stopping, and whenever a process fails, all other processes are
informed of the failure in finite time. We assume that processes' failures
never partition the communication network.
We want to develop our algorithms under the weakest possible set of assump-
tions. In particular, we do not assume that the underlying system is a database
transaction system ([Fisc82] and [Jose85]). This special case admits simpler solu-
tions: the mechanisms that ensure atomicity of transactions can hide inconsisten-
cies introduced by the failure of a transaction. Furthermore, we do not assume
that processes are deterministic: this simplifying assumption is made in previous
results (e.g., [Stro85] and [Jose85]).
4. Identification of Problems
A checkpoint is a saved state of a process. A set of checkpoints, one per process
in the system, is consistent if the saved states form a consistent global state. For
example, consider the system history in Figure 2. Process p takes a checkpoint at
time X and sends a message to q some time later. After receiving this message, q
takes a checkpoint at time Y. Subsequently, p fails and restarts from the check-
point taken at X. The global state at p's restart is inconsistent because p's local
state shows that no message has been sent to q, while q's local state shows that a
message from p has been received. If p and q are processes supervising a
customer's accounts at different banks, and the message transfers funds from p to
q, the customer will have the funds at both banks when p restarts. This incon-
sistency persists even if q is forced to roll back and restart from its checkpoint
taken at Y.
X failure
P
The problem of ensuring that the system recovers to a consistent state after
transient failures has two components: checkpoint creation and rollback-recovery;
we examine each one in turn.
4.1. Checkpoint Creation
There are two approaches to creating checkpoints. With the first approach,
processes take checkpoints independently and save all checkpoints on stable
storage. Upon a fMiiure; processes must find and agree upon a. consistent set of
checkpoints among the saved ones. The system is then rolled back to and restarted
from this set of checkpoints (Ande79, Russ8O, Wood81, Hadz82J.
With the second approach, processes coordinate their checkpointing actions
such that each process saves only its most recent checkpoint, and the set of check-
points in the system is guaranteed to be consistent. When a failure occurs, the sys-
tem restarts from these checkpoints ITami84].
A disadvantage of the first approach has long been recognized (Rand75,
Pres831 and is named. the "domino effect". We illustrate this effect in Figure 3. In
this example, processes p and q have independently taken a sequence of check-
points. The interleaving of messages and checkpoints leaves no consistent set of
checkpoints for p and q, except the initial one at {Xo, Y 0}. Consequently, after p
fails, both p and q must roll back to the starting point of the computation. For
time-critical applications that require a guaranteed rate of progress, such as real
time process control, this behavior results in unacceptable delays. An additional
disadvantage of independent checkpoints is the large amount of stable storage
required for the saved states.
To avoid, these drawbacks, we pursue the second approach. In contrast to
£Tami84], our method ensures that when a process takes a checkpoint, a minimal
number of additional processes are forced. to take checkpoints.
X Xt X2 X3 faildu
qI
O r2Y' Y.
"..-.
-. .-. '. . . .-' ---................-..-....- ,.."..-.....-".,.."........'.'
.-'.-'.-. ..............-'.-..,..,..-,.-.....-.., -.
Page 6
4.2. Rollback-Recovery
Rollback- recovery from a consistent set of checkpoints appears deceptively
simple. The following scheme seems to work: Whenever a process rolls back to its
checkpoint, it notifies all other processes to also roll back to their respective check-
points. It then installs its checkpointed state and resumes execution. Unfor-
tunately, this simple recovery method has a major flaw. In the absence of syn-
chronization, processes cannot all recover (from their respective checkpoints) simul-
taneously. Recovering processes at different times introduces a liveness problem as
illustrated below.
Consider two processes p and q. Figure 4 illustrates their histories up to the
time p fails. Process p fails before receiving the message n1 , rolls back to its
checkpoint, and notifies q. Then p recovers, it sends m., and receives n 1. After p's
recovery, p has no record of sending m 1, while q has a record of its receipt. There-
fore, the global state is inconsistent. To restore consistency, q must also roll back
(to "forget" the receipt of ml1 ), and notify p. Note that after q rolls back, q has no
record of sending n, while p has a record of its receipt. Hence, the global state is
inconsistent again, and upon notification of q's rollback, p must roll back a second
time. After q recovers, q sends n2 and receives in2 . Suppose p rolls back before
receiving n2 as shown in Figure 5. With the second rollback of p, the sending of
M2 is "forgotten". To restore consistency, q must roll back a second time. After p
recovers it receives n2 , and upon notification of q's rollback, it must roll back a
third time. It is now clear that p and q can be forced to roll back forever, even
though no additional failures occur.
Our ro llback-recovery algorithm solves this liveness problem. It tolerates
failures that occur during its execution, and forces a minimal number of processes
to roll back after a failure. In [Tami84], a single failure forces the system to roll
failure
checkpoints/V
back as a whole. Furthermore, the system crashes (and does not recover) if a
failure occurs while it is rolling back.
S. Checkpoint Creation
53.L Motivation
To create consistent checkpoints, processes can execute an algorithm that is
patterned on two-phase-commit protocols. In the first phase, the initiator q takes a
tentative checkpoint and requests all processes to take tentative checkpoints. If q
learns that all processes have taken tentative checkpoints, q decides all tentative
checkpoints should be made permanent; otherwise, q decides tentative checkpoints
should be discarded. In the second phase, q's decision is propagated and carried out
by all processes. Since all or none of the processes take permanent checkpoints, the
most recent set of checkpoints is always consistent.
However, our goal is to force a minimal number of processes to take check-
points. The above algorithm is modified as follows: A process p takes a tentative
checkpoint after it receives a checkpoint request from q only if q's tentative check-
point records the receipt of a message from p, while p's latest permanent check-
point does not record the sending of that message. Process p determines whether
this condition is true using the label appended to q's request. This labeling scheme
is described below.
Messages that are not sent by the checkpoint or rollback-recovery algorithms
are system messages. Every system message n contains a label m.l. Each process
appends outgoing system messages with monotonically increasing labels. We
define I and 7 to be the smallest and largest labels, respectively. For any
processes r and p, let m be the last message that r received from p after r took its
last permanent or tentative checkpoint. Define:
-- ",' ' . ,' . " . ' " ." . . € "' - " m . . . . . .. .m .. ... ' " -. . . . . .i''
Page 10
.-if m exists
first_smsgr(p) _ otherwise
5.3.2. Description
Process p is a ckpt-cohort of q if q has taken a tentative checkpoint, and
last_rmsgq(p)>1_ before the tentative checkpoint was taken. The set of
ckptcohorts of q is denoted ckpt.cohortq. Every process p keeps a variable
willing.to-ckpt, to denote its willingness to take checkpoints. Whenever p cannot
be interrupted to run the checkpoint algorithm, willing-to.ckptP is "no". The ini-
tiator q starts the checkpoint algorithm by making a tentative checkpoint and
sending a request "take a tentative checkpoint and last_rmsgq(p)" to all
pEckpt-cohortq. A process p znherits this request if willing_.tockptp is "yes" and
lastrrmsgq(p)-firstsmsgp(q)>-L. After p inherits a request, it takes a tentative
checkpoint and sends "take a tentative checkpoint and last-rmsgp(r)" requests to
all rEckpt-cohort.. If p receives but does not inherit a request from q, p replies
willing.to-ckpt, to q.
After p sends out its requests, it waits for replies that can be either "yes" or
-no", indicating a ckpLcohort's acceptance or rejection of p's request. If at least one
reply is "no", willtng.to-ckptp becomes "no"; otherwise willing-to-ckptp is
unchanged. Process p then sends willing__to-ckptp to the process whose request p
has inherited.
If all the replies from its ckpLcohorts arrive and are all "yes", the initiator
decides to take all tentative checkpoints permanent. Otherwise the decision is to
undo all tentative checkpoints. This decision is propagated in the same fashion as
the request "take a tentative checkpoint" was delivered. Between the times a pro-
cess p takes a tentative checkpoint and it receives the decision from the initiator, p
does not send any system messages. Also, after processes take new permanent
checkpoints, they may discard their previous checkpoints.
The algorithm is presented in Figure 7. For simplicity, we create a fictitious
process called daemon to assume the initiation and decision tasks of -he initiator.
In practice, daemon is a part of the initiator process.
Daemon process:
All processes :
INITIAL STATE:
first.smsg,(daemon) = T-,
w'yest if p is willing to take a checkpoint
wiling..to..ckpt "no" otherwise
. .. . *.
Page 12
:-""-:-
- : -.'.:
: "' "- :.''.'' i" . '':/'-:- :.- ',' -,.-." - "'.. . . . . . "-"":"
. . ..-. - .:?'"' "
" "
_ :.,: , " "-. . .,, ***** .***.* iai"d m non - '" "' ' "' '.. - '- '" " "n .. "" .
"b. '. - '"
. u. - W w :r ".- 7. . -_ - - . .. .' . . .. .-- .
-r .,., . .. -.. i
-.. . , .,
Page 13
Since every process receives replies from all its ckpt-cohorts, the initia-
tor will receive replies from all its ckpt-cohorts to decide on the tenta-
tive checkpoints. Its decision is guaranteed to reach all processes that
have taken tentative checkpoints because all processes will pass on the
decision and messages are always delivered. Thus we have shown that
n& process waits forever for replies from its ckpt-cohorts or the
initiator's decision. 0
The next lemma shows that C1 takes a consistent set of checkpoints.
Lemma 4: If the set of checkpoints in the system is consistent before the execution
of Algorithm CI, the set of checkpoints in the system is consistent after
the termination of CL
Proof. Without loss of generality, assume new checkpoints are taken in C1.
The proof is by contradiction. Suppose the set of checkpoints after C1
terminate is not consistent. Then there must exist two processes p and
q suchr that p sent q a message m after making its permanent check-
point, and q received m before making its permanent checkpoint. Since
al checkpoints are consistent before the execution of C1, q must have
taken its permanent checkpoint during this execution. Before q took a
tentative checkpoint in C1, last..rmsgq(p) ti.1; therefore, p was in
ckpt-cohortq and received a request to take a tentative checkpoint from
q. When p received the request, willingto.ckpt, had to be "yes"
because q cannot have taken its tentative checkpoint permanent other-
wise. Moreover, if p had not taken a tentative checkpoint when q's
request arrived, t8Lrrmg (p) afirs .smsg.(q) because
first.smsgp(q) 5m. Hence, process p took a tentative checkpoint after
sending m. Process p, however, must take its tentative checkpoint per-
manent if q takes its permanent. Consequently p takes a permanent
checkpoint after sending m, a contradiction. 0
We now show that the nuriber of processes that take new permanent check-
points during the execution of Algorithm C1 is minimal- Let P-pol,Pi -. , Pk}
be the set of processes that take new permanent checkpoints in CI, where Po is the
initiator of CL Let C(P)={c(po), c(p 1 ), • c(pk)j be the permanent checkpoints
Page 14
•
*. . - . . . . .. •,. . . o. . o . . o . q . . . . . . - . . i - o - - . ° . .' - °
Page 15
Suppose that a process p does not receive the decision regarding its tentative
checkpoint. Ifp undoes its tentative checkpoint or takes it permanent, it risks con-
tradicting the initiator. A common practice in this situation is to have p blocked
until it discovers the initiator's decision [Skee82I. We will discuss ways to obviate
blocking in section S.
We now consider the recovery of faulty processes. When a process restarts
after a failure, its latest checkpoint on stable storage may be tentative or per-
manent. If this checkpoint is tentative, the recovering process must decide whether
to discard it or to take it permanent. The decision is made as follows:
Suppose the recovering process is the initiator. The initiator knows that every
process that has taken a tentative checkpoint is still blocked waiting for its deci-
sion. Hence it is safe for the initiator to decide to undo the tentative checkpoints
and send this decision to its ckpt.coharta
If the recovering process is not the initiator, it must discover the initiator's
decision regarding tentative checkpoints. It may contact either the initiator or
those processes of which it is a ckpt-cohort; it follows the decision accordingly.
Now the recovering process is left with one permanent checkpoint on stable
storage. Recovery is complete when it uses the rollback-recovery algorithm to be
presented in section .6 to restart from this checkpoint.
Let C2 be the Algorithm C1 as modified above. C2 terminates if all processes
that fail during the execution of C2 recover. At termination, the set of checkpoints
in the system is consistent, and the number of processes that took new permanent
checkpoints is minimal The proofs for these properties are similar to those of CI
and are omitted.
6. Rollack-Recovery
We assume that the algorithm is invoked by a single process that wants to roll
back and recover (henceforth. denoted restart). We also assume that the checkpoint
algorithni and the rollback-recovery algorithm are not invoked concurrently. Con-
current invocations of the algorithms are described in section 7.
6.1. Motivation
The rollback-recovery algorithm is patterned on two-phase-commit protocols.
In the first phase, the initiator q requests all processes to indicate their willingness
to restart from their checkpoints. Process q decides to restart all the processes if
and only if they are all willing to restart. In the second phase, q's decision is pro-
pagated and carried out by all processes. We will prove that the two-phase struc-
ture of this algorithm prevents the liveness problem discussed in section 4.2. Since
all or none of the processes restart, when the rollback-recovery algorithm ter-
minates the global state is consistent.
However, our goal is an algorithm that rolls back a minimal number of
processes in order to recover from a failure. If a process p rolls back to a state
saved before an event e occurred, we say that e is undone by p. With our algo-
rithm, process p must restart only if q's rollback will undo the sending of a mes-
sage to p. Process p determines if it must restart using the label appended to q's
request.
For any processes r and p, let m be the last message that r sent to p before r
took its latest permanent checkpoint. Define
m.m if m exists
last-smsgr(p) T otherwise
When q requests p to restart, it appends last_smsgq(p) to its request. Process p
restarts from its permanent checkpoint if last-rmsgp(q) >last.smsgq(p).
6.2. Description
Process p is a roll -cohort of q if q can send messages to it. The set of roll-
cohorts of q is roll-cohortq2 . Every process p keeps a variable willing.Jo...rollP to
denote its willingness to roll back. The initiator q starts the rollback-recovery algo-
rithm by sending a request "prepare to roll back and lastsmsgq(p)" to all
pEroll-cohortq. A process p inherits this request if willing.to..rollp is "yes",
lastrmsgp(q)>last__smsgq(p), and p has not already inherited another request to
roll back. After p inherits the request, it sends "prepare to roll back and
last.smsgp(r)" to all. rEroll -cohort ,; otherwise, it replies willing._to-rollP to q.
'The relationship between roll -cohort and ckpt-cohort is not symmetric. if p is a ckpt-cohort of
q, lastrm.s 4 (p)>I and q must then be a roll -cohort of p. On the other hand, it is possible that
pfckpt-cohortq but qEroll -cohort:, because p can but does not send messages to q.
Page 17
After p sends out its requests, it waits for replies from each process in
rol-cohort.,. The reply can be an explicit "ys or "no message, or an implicit
ttno" when p discovers that r has failed. If at least one reply is "no",
willing-Jo.roll,, becomes "no", otherwise willingjo-rollp is uinchanged. Process p
then sends wiling-to..roll, to the process whose request p inherits. Between the
times p inherits the rollback request and it receives the decision from the initiator,
it does not send any system messages.
If all the replies from its roll-cohorts arrive and are all "yes", the initiator
decides the rollbacks will proceed, otherwise it decides no process will roll back.
This decision is propagated to all processes in the same fashion as the request
"tprepare to roll back"' is delivered. Process p blocks waiting for the discovery of the
initiator's decision, if failures prevent the decision from reaching p. We discuss
non-blocking algorithms in section 8.
The rollback-recovery algorithm is presented in Figure 8. Like the presenta-
tion of Algorithm C1, we introduce a fictitious process called daemon to perform
functions that are unique to the initiator of the algorithm.
Daemon process:
All processes p:
INITIAL STATE:
ready_to_.ro1I =
" true;
ast..rmsg, daenon) = T;
="yes" if p is willing to roll back
Lwdhnlg.torol1 = l~no" otherwise
-- . . . .. . . .J
Page 19
Lemma : After every process has terminated its execution of Algorithm R, for
each. send event that was undone, its corresponding receive event was
also undone.
Proof. Without loss of generality, assume that the initiator decides to roll
back. The proof is by contradiction. Suppose that after Algorithm R
terminates, there exists a message m such that the receiver p did not
undo the receipt of n while the sender q undid the sending of M. First,
we show that p inherited a request to roll back. Since q cannot send
system messages after inheriting a rollback request, q must have sent
mnbefore inheriting the request. And since q undid the sending of m,
rn.I >Iastsmsgq(p). Therefore, when p receives q's request,
lasLtrmsg,(q) 2.m.1 >las.snsgq(p). In addition, the variable
willing.toroll, must have been "yes"; otherwise the initiator cannot
have decided to roll back. Consequently, when q's request reached p,
either p had. already inherited a rollback request or it inherited q's
request.
. . . . . •. - o . . . . .
.- ° , .-. . . . . . .-. . . .• ... ,. , ..-. •-.-.-.,. o . . ., , - -
Page 20
Proof Without loss of generality, assume IPI a:2. The if part is by lemma 8.
We show the only if part by contradiction. Suppose that there exists a
Q such that even if all processes in Q do not roll back, for each send
event that is undone by Algorithm R, its corresponding receive event is
undone. For any processes p and q, if p inherits a rollback request
from q, ready-to-rollq becomes true before ready-oroll, becomes true.
Therefore, the inherit relation is non-circular. Because of this non-
circularity and the fact that the initiator is in Q, there exists q EQ such
that q inherits a rollback request from another process p outside of Q.
Since q EP, p EP. When q inherits p's request,
lastrmsgq(p)>last_.smsgp(q). Let m be the message such that
m.l=last_.rmsgq(p). If processes in Q do not roll back while those in
P-Q do, p undoes the sending of m while q does not undo the receipt
of m, a contradiction.
7. Interference
In this section, we consider concurrent invocations of the checkpoint and
rollback-recovery algorithms. An execution of these algorithms by process p is
interfered with if any of the following events occur:
(1) Process p receives a rollback request from another process q while executing
the checkpoint algorithm.
(2) Process p receives a checkpoint request from q while executing the rollback-
recovery algorithm.
(3) Process p, while executing the checkpoint algorithm for initiator i, receives a
checkpoint request from q, but q's request originates from a different initiator
than i.
(4) Process p, while executing the rollback-recovery algorithm for initiator i,
receives a rollback request from q, but q's request originates from a different
initiator than i.
One single rule handles the four cases of interference: once p starts the execu-
tion of a checkpoint [rollback] algorithm, p is unwilling to take a tentative check-
point [roll back] for another initiator, or to roll back [take a tentative checkpoint].
As a result, in all four cases, p replies "no" to q. We can show that this rule
suffices to guarantee that all previuas lemmas and theorems hold despite con-
current invocations of the algorithms. This rule can, however, be modified to per-
mit more concurrency in the system. The modification is that in case (1), instead of
Page 21
8. Optimization
When the initiator of the checkpoint or of the rollback-recovery algorithm fails
before propagating its decision to its cohorts, it is desirable for processes not to
block waiting for its recovery. To prevent processes from blocking, we can modify
our algorithms by replacing the underlying two-phase commit protocol with a non-
blocking three-phase commit protocol [Skee82J. However, non-blocking protocols
are inherently more expensive than blocking ones [Dwor83].
We now address the following problem: after a ckpt-cohort q of a process p
fails, p is unable to take a permanent checkpoint until q recovers (p cannot know if
the latest checkpoint of q records the sendings of all messages it received from q).
To avoid waiting for q's recovery, p can remove q from ckpt-cohortp by restarting
frour its checkpoint (using the rollback-recovery algorithm). Thereafter, process p
can take checkpoints.
9. Message Loss.
Rollback-recovery can cause message loss as illustrated in Figure 9. When p
is rolled back to X following a failure at F, the global state is consistent, but the
message m from q is lost It is lost because the set of checkpoints {X, Y}
corresponds to a consistent global state with m in the channel.
One method to circumvent message loss requires that processes use transmis-
sion protocols that transform lossy channels to virtual error-free channels, e.g.,
sliding window protocols [Tane81]. Another method is to ensure that the most
recent set of checkpoints corresponds to a consistent global state with no messages
railum
q
F
FIG. 9. Message loss following p's rollback to X.
10. Conclusion
We have presented a checkpoint algorithm and a rollback-recovery algorithm
to solve the problem of bringing a distributed system to a consistent state after
transient failures. In contrast to previous algorithms, they tolerate failures that
occur during their executions. Furthermore, when a process takes a checkpoint, a
minimal number of additional processes are forced to take checkpoints. Similarly,
when a process restarts after a failure, a minimal number of additional processes
are forced to restart with it. We also show that the stable storage requirement of
our algorithms is minimal.
Acknowledge ments We would like to thank Amr El Abbadi, Ken Birman, Rance
Cleaveland, and Jennifer Widomn for commenting on earlier drafts of this paper.
Page 23
Bibliography
-A
Page 24
• 5"..
. "- -.'.. ..
.""-••- -. '.............. ....... ..- .. i.......:- -.... '..... ... :.-...]
:. .....
- . .,.G
. .. .-- --.... '. a ,.::. -, .- : -: ,, ,,.-. . - -. . - . . -.. . .
, .
,,