0% found this document useful (0 votes)
36 views33 pages

Distributed Checkpointing Guide

Study Notes

Uploaded by

menakababu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views33 pages

Distributed Checkpointing Guide

Study Notes

Uploaded by

menakababu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

CS 3551

DISTRIBUTED
COMPUTING
Checkpoint Based
Recovery
1. Uncoordinated checkpointing
2. Coordinated checkpointing
a. Blocking coordinated checkpointing
b. Non-blocking checkpoint coordination
3. Impossibility of min-process non-blocking
checkpointing
4. Communication-induced checkpointing
a. Model-based checkpointing
b. Index-based checkpointing
Checkpoint Based
Recovery
● In the checkpoint-based recovery approach, the state of each
process and the communication channel is checkpointed frequently
so that, upon a failure, the system can be restored to a globally
consistent set of checkpoints.
● It does does not need to detect, log, or replay non-
deterministic events. Checkpoint-based protocols are
therefore less restrictive and simpler to implement than log-
based rollback recovery.
● However, checkpoint-based rollback recovery does not
guarantee that prefailure execution can be deterministically
regenerated after a rollback.
● It may not be suitable for applications that require frequent
1. Uncoordinated
Checkpointing
● Each process has autonomy in deciding when to take
checkpoints.
● Synchronization overhead is minimal as there is no need for
coordination between processes. ( Lower runtime overhead).
● Autonomy in taking checkpoints also allows each process
to select appropriate checkpoints positions

Drawbacks:
1. Domino effect may occur during a recovery.
2. Recovery is slow because processes need to iterate to find a consistent set
of checkpoints.
3. Useless Checkpoint:
a. Since no coordination is done at the time the checkpoint is taken, checkpoints taken by a process
may be useless checkpoints.
b. Useless checkpoints are undesirable because they incur overhead and do not contribute to
advancing the
recovery line.
4. forces each process to maintain multiple checkpoints, and to periodically
invoke a garbage collection algorithm to reclaim the checkpoints that are no
How consistent global checkpoint is
determined? Steps:

1. When a failure occurs, the recovering


process initiates rollback by
broadcasting a dependency request
message to collect all the dependency
information maintained by each
process.
2. When a process receives this message,
it
stops its execution and replies with the
dependency information saved on the
stable storage as well as with the
dependency information, if any, which
is associated with its current state.
3. The initiator then calculates the
recovery line
based on the global dependency
information and broadcasts a rollback
request message containing the
recovery line.
4. Upon receiving this message, a
process
whose current state belongs to the
Blocking Coordinated
Checkpoint
Non-Blocking Coordinated
Checkpoint
● In this approach the processes need not stop their execution
while taking checkpoints.
● A fundamental problem in coordinated checkpointing is to prevent
a process from receiving application messages that could make
the checkpoint inconsistent.
Solution 1

If channels are FIFO,


this problem can be
avoided by preceding
the first
post-checkpoint message
on each channel by a
checkpoint request,
forcing each process to
take a checkpoint before
receiving the first
post-checkpoint message,
Solution
2
● If the channels are non-FIFO, the following two approaches can be
used: first, the marker can be piggybacked on every post-
checkpoint message.
● When a process receives an application message with a marker, it
treats it as if it has received a marker message, followed by the
application message.
● Alternatively, checkpoint indices can serve the same role as
markers, where a checkpoint is triggered when the receiver’s local
checkpoint index is lower than the piggybacked checkpoint index.
Impossibility of min-process non-blocking
checkpointing
● A min-process, non-blocking checkpointing algorithm is one that forces only a
minimum number of processes to take a new checkpoint, and at the same
time it does not force any process to suspend its computation.
● Clearly, such checkpointing algorithms will be very attractive. Cao and Singhal
[7] showed that it is impossible to design a min-process, non-blocking
checkpointing algorithm.
● Possible Algorithm:
● Phase 1:
○ checkpoint initiator identifies all processes with which it has communicated since the last
checkpoint and sends them a request. Upon receiving the request, each process in turn identifies
all processes it has communicated with since the last checkpoint and sends them a request, and
so on, until no more processes can be identified.
● Phase 2:
○ all processes identified in the first phase take a checkpoint.
○ The result is a consistent checkpoint that involves only the participating processes.
○ In this protocol, after a process takes a checkpoint, it cannot send any message until the
second phase terminates successfully, although receiving a message after the checkpoint
has been taken is allowable.
● Based on a concept called “Z-dependency,” Cao and Singhal proved that there
does not exist a non-blocking algorithm that will allow a minimum number of
Communicati
on Induced
Checkpoint
Model
Based
Checkpoint
Index-based
checkpointing
● Index-based communication-induced checkpointing assigns
monotonically increasing indexes to checkpoints, such that the
checkpoints having the same index at different processes form a
consistent state.
● Inconsistency between checkpoints of the same index can be
avoided in a lazy fashion if indexes are piggybacked on
application messages to help receivers decide when they should
take a forced a checkpoint.
CS 3551
DISTRIBUTED
COMPUTING
Koo–Toueg coordinated checkpointing
algorithm
Koo–Toueg coordinated checkpointing
algorithm
Koo–Toueg coordinated checkpointing
algorithm
Objective:

● Takes a consistent set of checkpoints and avoids the domino effect and livelock problems during the
recovery.
● Processes coordinate their local checkpointing actions such that the set of all checkpoints in the
system is consistent.

Assumptions of Checkpointing Algorithm:

● Processes communicate by exchanging messages through communication channels. Communication


channels are FIFO.
● It is assumed that end-to-end protocols (such as the sliding window protocol) exist to cope with
message loss due to rollback recovery and communication failure.
● Communication failures do not partition the network.

Permanent vs Tentative:

1. A permanent checkpoint is a local checkpoint at a process and is a part of a consistent global


checkpoint.
2. A tentative checkpoint is a temporary checkpoint that is made a permanent checkpoint on the
successful termination of the checkpoint algorithm.
3. In case of a failure, processes roll back only to their permanent checkpoints for recovery.
Checkpoint - Phase 1 Checkpoint - Phase 2
1. An initiating process Pi takes a
tentative checkpoint and 1. Pi informs all the processes of
requests all other processes to the decision it reached at the
take tentative checkpoints. end of the first phase.
2. Each process informs Pi whether
it 2. A process, on receiving the
succeeded in taking a message from Pi, will
tentative checkpoint. act accordingly.
3. A process says “no” to a
request if it 3. Therefore, either all
fails to take a tentative or none of the
checkpoint, which could be processes advance the
due to several reasons,
depending upon the
checkpoint by taking
underlying application. permanent checkpoints.
4. If Pi learns that all the 4. The algorithm requires that
processes
after a
have successfully taken
tentative checkpoints, Pi process has taken a tentative
decides that all tentative checkpoint, it cannot send
checkpoints should be made messages related to the
permanent; otherwise, Pi
decides that all the tentative underlying computation until
checkpoints should be it is informed of Pi’s decision.
Optimizati
Correctnes
on
s
● A set of permanent
checkpoints taken by this
algorithm is consistent
because of the following
two reasons:
○ Either all or none of the
processes
take permanent checkpoints;
○ no process sends a message
after taking a tentative
checkpoint until the receipt of
the initiating process’s
decision, as by then all
processes would have taken
checkpoints.
● Thus, a situation will not
arise
where there is a record of
Assumption

● a single process invokes the algorithm.


Rollback Recovery ● It also assumes that the checkpoint and the
rollback recovery algorithms are not invoked

Algorithm
Phase 1
concurrently

Phase 2
● An initiating process Pi sends a
message to all other processes to ● Pi propagates its
check if they all are willing to decision to all the
restart from their previous
checkpoints.
processes.
● A process may reply “no” to a ● On receiving Pi’s
restart request due to any reason decision, a
(e.g., it is already participating in a
checkpoint or recovery process
process acts accordingly.
initiated by some other process). ● During the execution of
● If Pi learns that all processes are the recovery algorithm,
willing to restart from their previous
checkpoints, Pi decides that all a process cannot send
processes should roll back to their messages related to the
previous checkpoints. Otherwise, Pi underlying computation
aborts the rollback attempt and it
may attempt a recovery at a later while it is waiting for Pi’s
time. decision.
Optimizati
Correctnes
on
s
● All processes restart
from an appropriate
state because, if they
decide to restart, they
resume execution from a
consistent state (the
checkpointing algorithm
takes a consistent set of
checkpoints).
CS 3551
DISTRIBUTED
COMPUTING
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
System Model and Assumptions:
● Communication channels are reliable, deliver the messages in FIFO order, and have
infinite buffers.
● The message transmission delay is arbitrary, but finite.
● The processors directly connected to a processor via communication channels
are called its neighbors.
● The underlying computation or application is assumed to be event-driven: a processor P
waits until a message m is received, it processes the message m, changes its state
from s to s , and sends zero or more messages to some of its neighbors.
● Then the processor remains idle until the receipt of the next message.
● The new state s and the contents of messages sent to its neighbors depend on state s
and the contents of message m. The events at a processor are identified by unique
monotonically increasing numbers, ex0, ex1, ex2,
● Storage can be:
○ Volatile Log - Less time but data lost when power is lost
○ Stable Storage - More time but data not lost.
Asynchronous Checkpointing

● After executing an event, a processor records a triplet (s, m,


msgs_sent) in its volatile storage,
○ s is the state of the processor before the event
○ m is the message (including the identity of the sender of m, denoted as
m.sender) whose arrival caused the event
○ msgs_sent is the set of messages that were sent by the processor during the
event.
● Therefore, a local checkpoint at a processor consists of the record of
an event occurring at the processor and it is taken without any
synchronization with other processors.
● Periodically, a processor independently saves the contents of the
volatile log
in the stable storage and clears the volatile log. (Equivalent to
taking a local checkpoint)
Recovery
Algorithm

You might also like