University of Pennsylvania
CIS 5050
Software Systems
Linh Thi Xuan Phan
Department of Computer and Information Science
University of Pennsylvania
Lecture 13: Fault tolerance
March 28, 2024
©2016-2024 Linh Thi Xuan Phan
Announcements
• Each team please email me the GitHub IDs of your
team members
• Needed for creating the team GitHub repository
2
©2016-2024 Linh Thi Xuan Phan
Recall: Replication
• How does replication help?
– Durability: Once data is stored, the system should not lose it
– Availability: Data should be accessible whenever we need it
– Speed: Accessing the data should be fast
– Scalability: Distribute load and functionality across machines
• What do we need for replication?
– A replication protocol (how to propagate updates to all replicas)
– A consistency model (what is the ‘right’ value to return to the client)
• Are there any new challenges with replication?
3
©2016-2024 Linh Thi Xuan Phan
Plan for today
• Distributed commit NEXT
– Two-phase commit (2PC)
– Three-phase commit (3PC)
• Logging and recovery
– Centralized checkpointing
– Chandy-Lamport algorithm
4
©2016-2024 Linh Thi Xuan Phan
Why distributed commit?
• Suppose a large bank is operating a ‘shared’
account database
– Example: Node #1 has account data for customers whose first
names start with A, node #2 has B, node #3 C, ...
• Now suppose Alice wants to send $100 to Bob
– This involves changes on two separate nodes: Adding $100 to Bob's
account (node #2), and taking $100 from Alice's account (node #1)
– What if node #2 already finished the former, and then node #1
notices that there isn't enough money in Alice's account?
– What if node #1 finishes its part, but node #2 crashes before it can
complete the operation?
– ...
5
©2016-2024 Linh Thi Xuan Phan
Atomicity
• Goal: We need to ensure atomicity
– Either all parts of the transaction are completed, or none of them is!
– This is one of the four classical ACID properties from databases
• Atomicity, Consistency, Isolation, Durability
• How can we ensure atomicity?
• Idea: Let's do one-phase commit
– We elect one node as the coordinator; the others are subordinates
– The coordinator tells all the subordinates whether to finalize their part of
the transaction (commit), or whether to undo it (abort)
– The subordinates do what they are told, and then acknowledge
• Is this a good solution?
6
©2016-2024 Linh Thi Xuan Phan
Why one-phase commit fails
Coordinator crashes
in the middle
CO
T
CO MMI M
M But I already
IT
COMMIT
??? aborted!
B
C
OK!
A
• Problem #1: Subordinate cannot independently abort the
transaction
• Problem #2: Coordinator might crash before subordinate
receives the message; partial results are lost
7
©2016-2024 Linh Thi Xuan Phan
Two-Phase Commit (2PC)
• Idea: We need two rounds of communication!
• First round: Voting
– Coordinator sends prepare message to each subordinate
– Each subordinate responds with yes if it is able to commit its part of
the transaction, otherwise with no
• If the subordinate needs locks to commit, it needs to acquire them before
responding with yes (why?)
• Second round: Decision
– Coordinator sends commit or abort to each subordinate
– Each subordinate responds with an ack
• Result: Any site can decide to abort a transaction!
8
©2016-2024 Linh Thi Xuan Phan
2PC: What about crashes?
• We also need to handle the case where a node
crashes in the middle
– Nodes need to be able to 'remember' some information!
• Idea: Each node is given a local, persistent log
– This log could be stored on disk, or in NVRAM
• Is this enough to ensure that data remains persistent despite a possible crash?
– When a node recovers after a crash, it can look at its local log to
figure out what the next steps should be
9
©2016-2024 Linh Thi Xuan Phan
2PC: Steps in more detail
• When a transaction wants to commit:
– The coordinator sends a prepare message to each subordinate
– Subordinate force-writes an abort or prepare log record, then
sends a no (abort) or yes (prepare) message to coordinator
– The coordinator then considers the votes:
• If everyone has voted yes, it force-writes a commit log record and sends
commit message to all subordinates
• Else, it force-writes an abort log record and sends an abort message
– The subordinates force-write abort/commit log records based on
the message they get, and then send an ack message to
coordinator
– The coordinator writes an end log record after getting all the
acks
• Why is the 'end' record useful?
Messages in red
Log records in green 10
©2016-2024 Linh Thi Xuan Phan
2PC: Protocol (1/2)
void coordinator(Transaction t, Set nodes)
{
log.write("BEGIN");
log.write(result);//commits the result
foreach (n : nodes)
foreach (n : nodes)
send(n, "PREPARE");
send(n, result);
if (result == "COMMIT")
Set responses = new Set();
t.performLocalPart();
bool allInFavor = true;
if (!t.localPartCanCommit())
Set finished = new Set();
allInFavor = false;
while (!finished.equals(nodes)){
while (!responses.equals(nodes) &&
Node sender;
!timeout() && allInFavor){
Message msg = recv(&sender);
Node sender;
if (msg == "STATUS?")
Message msg = recv(&sender);
send(sender, result);
responses.add(sender);
if (msg == "ACK")
if (msg == "NO")
finished.add(sender);
allInFavor = false;
}
}
if (timeout())
log.write("END");
allInFavor = false;
}
String result;
if (allInFavor)
result = "COMMIT";
else
result = "ABORT";
11
©2016-2024 Linh Thi Xuan Phan
2PC: Protocol (2/2)
void subordinate(Transaction t, Node coordinator)
{
log.write("BEGIN");
while (true) {
Message msg = recvFrom(coordinator);
if (msg == "PREPARE") {
if (t.localPartCanCommit()) {
log.write("PREPARE");
send(coordinator, "YES");
} else {
log.write("ABORT");
send(coordinator, "NO");
}
} else if (msg == "COMMIT") {
log.write("COMMIT");
t.performLocalPart();
log.write("END");
send(coordinator, ”ACK");
break;
} else if (msg == "ABORT") {
log.write("ABORT");
log.write("END");
send(coordinator, ”ACK");
break;
}
}
} 12
©2016-2024 Linh Thi Xuan Phan
2PC: Illustration
Coordinator Subordinate 1 Subordinate 2
force-write
begin log entry
send “prepare”
send “prepare”
force-write force-write
prepare log entry prepare log entry
send “yes”
send “yes”
force-write
commit log entry commit point
send “commit”
send “commit”
force-write force-write
commit log entry commit log entry
send “ack”
send “ack”
write
end log entry
©2016-2024 Linh Thi Xuan Phan 13
2PC: Some observations
• All log records for a transaction contain its ID and
the coordinator’s ID
– The coordinator’s abort/commit record also includes IDs of all
subordinates (why?)
• Every message reflects a decision by the sender
– To ensure that this decision survives failures, it is first recorded in
the local log
• There exists no distributed commit protocol that
can recover without communicating with other
processes, in the presence of multiple failures!
©2016-2024 Linh Thi Xuan Phan
What if a node fails in the middle?
• Suppose we find a commit or abort log record for
transaction T, but not an end record?
– Need to redo/undo T
– If this node is the coordinator for T, keep sending commit / abort
messages to subordinates until acks have been received
• Suppose we find a prepare log record for
transaction T, but not commit/abort?
– This node is a subordinate for T
– Repeatedly contact the coordinator to find status of T, then write
commit/abort log record; redo/undo T; and write end log record
• Suppose we don’t find even a prepare record for T?
– Unilaterally abort and undo T
– This site may be coordinator! If so, subordinates may send
messages and need to be undone as well
©2016-2024 Linh Thi Xuan Phan
Coordinator failure
• What should a subordinate do when the
coordinator fails after it has already voted yes?
– Problem: Cannot decide whether to commit or abort T until
coordinator recovers - T is blocked!
– Consequences?
• Suppose all the subordinates know each other?
– Can be implemented (requires extra overhead in prepare msg)
– But: They are still blocked, unless one of them voted no, or one of
them has already received the coordinator's decision
©2016-2024 Linh Thi Xuan Phan
Link and remote site failures
• What should a node do when a remote node does
not respond during the commit protocol for
transaction T (either because the remote node
failed or the link failed)?
– If the current node is the coordinator for T, should abort T
– If the current node is a subordinate, and has not yet voted yes, it
should abort T
– If the current node is a subordinate and has voted yes, it is blocked
until the coordinator responds!
• Can we do better?
©2016-2024 Linh Thi Xuan Phan
Plan for today
• Distributed commit
– Two-phase commit (2PC)
– Three-phase commit (3PC) NEXT
• Logging and recovery
– Centralized checkpointing
– Chandy-Lamport algorithm
18
©2016-2024 Linh Thi Xuan Phan
How can we improve 2PC?
• What is the real reason why 2PC can block?
– Suppose both the coordinator and a subordinate crash
– The decision could have been COMMIT, and the subordinate may
have already completed the operation
– But the other subordinates have no way to distinguish this from the
situation where the decision was ABORT
• Idea: Let's make sure that the subordinates know
the decision before they execute anything!
– When the coordinator has received yes votes from all the subordi-
nates, it tells them that there will be a COMMIT ("PRECOMMIT")
– Once everyone (or at least a large number) acknowledges the
receipt, the coordinator then sends the actual COMMIT message
– What happens now when the coordinator fails?
• ... during the PRECOMMIT phase? ... during the actual COMMIT phase? 19
©2016-2024 Linh Thi Xuan Phan
Three-phase commit (3PC)
• Phase 1: Voting
– Coordinator sends PREPARE to each subordinate
– Each subordinate votes either YES or NO
• Phase 2: Precommit
– If at least one vote is NO, the coordinator sends ABORT as before
– If all votes are YES, the coordinator sends PRECOMMIT to
at least k subordinates (where k is a tunable parameter)
• OK to send more PRECOMMITs if there are not enough responses
– Each subordinate replies with an ACK
• Phase 3: Commit
– Once the coordinator has received k ACKs, it sends COMMIT to
each subordinate
– Each subordinate responds with an ACK
20
©2016-2024 Linh Thi Xuan Phan
3PC: Handling coordinator failures
• What if some nodes fail, including the coordinator?
– Remaining nodes ask each other what the coordinator has told them
• Situation #1: Nobody has seen a PRECOMMIT
– 2PC would have block in this case!
– But with 3PC, the remaining nodes can safely ABORT, since the
failed nodes could not have made any changes yet
• ... at least unless more than k nodes have failed! (why?)
• Situation #2: At least one PRECOMMIT or COMMIT
– The remaining subordinates know that the decision was (or was
going to be) COMMIT
– They can all decide to go ahead and COMMIT; once the other nodes
come back up, they can learn about this decision and COMMIT too
• Situation #3: Network partition
– 3PC isn't safe in this case! (why?)
21
©2016-2024 Linh Thi Xuan Phan
Recap: Distributed commit
• Goal: Do something on all nodes, or none of them
– A very common requirement in distributed systems
– Naïve solution (one-phase commit) isn't safe
• We have seen two solutions: 2PC and 3PC
– The key idea is for the nodes to 'vote' whether to commit or abort
– A coordinator is elected to collect votes and broadcast the decision
– 2PC is much simpler, but can block when the coordinator fails
– 3PC doesn't have that problem, but it is more complicated
– Neither 2PC nor 3PC can tolerate network partitions
22
©2016-2024 Linh Thi Xuan Phan
Plan for today
• Distributed commit
– Two-phase commit (2PC)
– Three-phase commit (3PC)
• Logging and recovery NEXT
– Centralized checkpointing
– Chandy-Lamport algorithm
23
©2016-2024 Linh Thi Xuan Phan
Why logging and recovery?
• Suppose a distributed system fails in the middle of
a long and expensive computation
• What should we do?
– Restarting from scratch may be expensive, or even impossible!
• Idea: Periodically record a checkpoint of the system
– Each node writes down ("logs") its current state, e.g., on its local disk
– If something bad happens, the nodes can go back to the latest
checkpoint and then continue from there!
24
©2016-2024 Linh Thi Xuan Phan
Message logging
• What if the latest checkpoint was some time ago?
– Rolling back the entire system would destroy a lot of useful work
• Idea: Use message logging + deterministic replay
– Whenever a message is sent, it is saved somewhere – either by the
sender (sender-based logging) or by the recipient (receiver-based)
– Nodes also record all nondeterministic events
• What are these? Why do they need to be remembered?
– When a node crashes, we can roll back only that node to its latest
checkpoint, and then feed it all the recorded messages and
nondeterministic events
– This will bring the node back to the state it was before the crash
• What are we assuming about the software on the node?
• Does this assumption generally hold? What does it take to make it hold?
25
©2016-2024 Linh Thi Xuan Phan
Checkpointing on a single node
• How would you record a checkpoint on one node?
• Idea #1: Just write its memory contents to disk!
– Problem: Write can take a long time!
– If the node keeps running during that time, the checkpoint will be
inconsistent: the state at the beginning is older than that at the end
– The node may never have been in the state that the checkpoint
describes (at least not at any given time)
• Idea #2: Stop the node during the checkpoint
– Not ideal either – system will be unresponsive during that time!
– Actual checkpointing systems use techniques like copy-on-write
(CoW) to take a checkpoint in memory very quickly (a few ms)
• This can then be written to disk asynchronously!
– But this trick won't work if we have a distributed system!
26
©2016-2024 Linh Thi Xuan Phan
Consistent vs. inconsistent cuts
Event A0 Event A1 Event A2 Event A3
A
Me
ssa
2
em
ge
m
ag
1
ss
Me
B
Event B0 Event B1 Event B2 Event B3
Inconsistent cut Consistent cut
• We can define a cut of the distributed execution
– Basically, one prefix of each node's local execution, taken together
• When can we call a cut consistent?
– If, for every event it contains, it also contains all events that
'happened before' that event.
– In particular, every received message must have been sent
• Which of the above cuts are consistent?
27
©2016-2024 Linh Thi Xuan Phan
Recovery lines
= Checkpoint
Failure
Initial state Recovery line Inconsistent cut
• Suppose we want to roll back the entire system
– Can we simply roll back each individual node to a recent checkpoint?
– Problem: Checkpoints could describe an inconsistent cut!
• A recovery line is a consistent set of checkpoints
– How can we find such a recovery line?
28
©2016-2024 Linh Thi Xuan Phan
The Domino Effect
= Checkpoint
Failure
Initial state Recovery line
• What if a set of checkpoints is not consistent?
– Need to roll back some of the nodes to an even earlier checkpoint!
– But that could create further inconsistencies!
• Problem: Cascading rollbacks! Why?
– In this example, the nodes just took their checkpoints individually
– Thus, it is unlikely that they would (ever) form a consistent cut
– The nodes need to coordinate their checkpointing!
29
©2016-2024 Linh Thi Xuan Phan
Centralized checkpointing
• Similar to 2PC with a coordinator
• Coordinator first multicasts a CHECKPOINT
message to all processes
• When a process receives the message:
– Takes a local checkpoint
– Queues any subsequent outgoing messages (why?)
– Responds with an ACK
• When the coordinator receives all the ACKs, it
sends a DONE message
– At that point, the queued messages are sent, and the execution
proceeds normally again
©2016-2024 Linh Thi Xuan Phan
Can we do better?
• Centralized checkpointing has several drawbacks
• Problem #1: Need for a central coordinator
– Ideally, we would like a fully distributed protocol
– Any node should be able to initiate a checkpoint at any time
• Problem #2: Queueing
– This is necessary to ensure that the cut is consistent
– But it also holds up the system for a while!
• Problem #3: Messages are not captured
– Some messages may still be 'in flight' at checkpoint time
– If we roll back every node to the checkpoint, these messages will
not be re-sent!
31
©2016-2024 Linh Thi Xuan Phan
Snapshots
• Can we get a consistent "snapshot"
of the system?
• This should include:
– A checkpoint for each node
– The set of messages that were "in flight"
at the time (i.e., sent but not yet received)
• Can we expect to capture a state of the entire
system as of some particular time?
– Not unless clocks are perfectly synchronized!
– All we can hope for is a state that the system could have been in,
and from which the actual state is reachable (why?)
32
©2016-2024 Linh Thi Xuan Phan
What are in flight messages from B to A?
Node A Node B
m1: sent and received before m1
both A’s and B’s checkpoints
m2
m2 , m3: sent before B’s checkpoint
received after A’s checkpoint
m3
m4: sent and received after
both A’s and B’s checkpoints m4
33
©2016-2024 Linh Thi Xuan Phan
Some simplifying assumptions
• To simplify our discussion,
we will assume that:
– Each node Ni has a direct "channel" cij
to every other node Nj
– Channels are unidirectional and
deliver messages in FIFO order
– Channels are reliable, i.e., messages
are never lost
• How can we make these assumptions true?
– Direct channel: This is just an abstraction; all we need is that every
node can send messages to every other node
– FIFO order: Can use sequence numbers
– Reliable: Can use retransmissions
34
©2016-2024 Linh Thi Xuan Phan
K. Mani Chandy and Leslie Lamport: "Distributed Snapshots: Determining Global States of Distributed Systems", ACM TOCS 3(1):63-75
Chandy-Lamport algorithm
• When a node Ni wants to initiate a snapshot:
– it takes a local checkpoint,
– it sends a special marker message on each of its outgoing channels, and
– it begins recording messages that arrive on its incoming channels
• When a node Nj receives a marker on channel cij:
– If Nj has not yet taken a checkpoint:
• it takes a local checkpoint and sends a special marker on each outgoing channel,
• it records the state of cij as the empty set, and
• it begins recording messages that arrive on any
other incoming channel ckj
– If Nj has already taken a checkpoint:
• it stops recording messages for cij
• Termination condition:
– The node initiated the snapshot has received a marker from every other node
• What are the 'recorded' sets of messages?
– These are the messages that are "in flight" at the snapshot time
35
©2016-2024 Linh Thi Xuan Phan
Chandy-Lamport: Example
Alice: $500 Alice: $600 Alice: $350 Alice: $400
A
0
10
:$
ce
0
: $5
lice
Ali
→ A
b→
Bob
Bob: $300 Bo
Bob: $200 Bob: $50
B
Bob: $150
Bo $1
Ali
b→ 00
c
e→ 250
Ch
$
Ch
arl
arl
ie:
Charlie: $100
ie:
C
Charlie: $350 Charlie: $450
Node C: Chk="Charlie: $100"; SAC = { Alice→Charlie: $250 }; SBC = {};
Node B: Chk=”Bob: $150"; SAB = {}; SCB = {};
Node A: Chk=”Alice: $350"; SBA = { Bob → Alice: $50 } SCA = {};
• What happens if we roll back & replay messages?
– We arrive at the point indicated by the orange dots! 36
©2016-2024 Linh Thi Xuan Phan
Recap: Chandy-Lamport algorithm
• Goal: A consistent "snapshot" of the system
– One checkpoint per node, and all the messages that were "in flight"
– Not necessarily a state that the system ever was in at any given
(wallclock) time
• Properties of the algorithm:
– Fully distributed; no need for a central coordinator
– Any node can trigger a snapshot at any time
– Execution can continue; no need to "queue" messages
• How can we use this?
– Example: Stable property detection
• Some of the algorithms we have discussed (e.g., deadlock detection via circular
wait) assume that we have a consistent snapshot of the global system state
– Example: Failure recovery
37
©2016-2024 Linh Thi Xuan Phan