Distributed Checkpointing Guide

Study Notes

Uploaded by

menakababu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views33 pages

Distributed Checkpointing Guide

Study Notes

Uploaded by

menakababu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

CS 3551

DISTRIBUTED
COMPUTING
Checkpoint Based
Recovery
1. Uncoordinated checkpointing
2. Coordinated checkpointing
a. Blocking coordinated checkpointing
b. Non-blocking checkpoint coordination
3. Impossibility of min-process non-blocking
checkpointing
4. Communication-induced checkpointing
a. Model-based checkpointing
b. Index-based checkpointing
Checkpoint Based
Recovery
● In the checkpoint-based recovery approach, the state of each
process and the communication channel is checkpointed frequently
so that, upon a failure, the system can be restored to a globally
consistent set of checkpoints.
● It does does not need to detect, log, or replay non-
deterministic events. Checkpoint-based protocols are
therefore less restrictive and simpler to implement than log-
based rollback recovery.
● However, checkpoint-based rollback recovery does not
guarantee that prefailure execution can be deterministically
regenerated after a rollback.
● It may not be suitable for applications that require frequent
1. Uncoordinated
Checkpointing
● Each process has autonomy in deciding when to take
checkpoints.
● Synchronization overhead is minimal as there is no need for
coordination between processes. ( Lower runtime overhead).
● Autonomy in taking checkpoints also allows each process
to select appropriate checkpoints positions

Drawbacks:
1. Domino effect may occur during a recovery.
2. Recovery is slow because processes need to iterate to find a consistent set
of checkpoints.
3. Useless Checkpoint:
a. Since no coordination is done at the time the checkpoint is taken, checkpoints taken by a process
may be useless checkpoints.
b. Useless checkpoints are undesirable because they incur overhead and do not contribute to
advancing the
recovery line.
4. forces each process to maintain multiple checkpoints, and to periodically
invoke a garbage collection algorithm to reclaim the checkpoints that are no
How consistent global checkpoint is
determined? Steps:

1. When a failure occurs, the recovering

process initiates rollback by
broadcasting a dependency request
message to collect all the dependency
information maintained by each
process.
2. When a process receives this message,
it
stops its execution and replies with the
dependency information saved on the
stable storage as well as with the
dependency information, if any, which
is associated with its current state.
3. The initiator then calculates the
recovery line
based on the global dependency
information and broadcasts a rollback
request message containing the
recovery line.
4. Upon receiving this message, a
process
whose current state belongs to the
Blocking Coordinated
Checkpoint
Non-Blocking Coordinated
Checkpoint
● In this approach the processes need not stop their execution
while taking checkpoints.
● A fundamental problem in coordinated checkpointing is to prevent
a process from receiving application messages that could make
the checkpoint inconsistent.
Solution 1

If channels are FIFO,

this problem can be
avoided by preceding
the first
post-checkpoint message
on each channel by a
checkpoint request,
forcing each process to
take a checkpoint before
receiving the first
post-checkpoint message,
Solution
2
● If the channels are non-FIFO, the following two approaches can be
used: first, the marker can be piggybacked on every post-
checkpoint message.
● When a process receives an application message with a marker, it
treats it as if it has received a marker message, followed by the
application message.
● Alternatively, checkpoint indices can serve the same role as
markers, where a checkpoint is triggered when the receiver’s local
checkpoint index is lower than the piggybacked checkpoint index.
Impossibility of min-process non-blocking
checkpointing
● A min-process, non-blocking checkpointing algorithm is one that forces only a
minimum number of processes to take a new checkpoint, and at the same
time it does not force any process to suspend its computation.
● Clearly, such checkpointing algorithms will be very attractive. Cao and Singhal
[7] showed that it is impossible to design a min-process, non-blocking
checkpointing algorithm.
● Possible Algorithm:
● Phase 1:
○ checkpoint initiator identifies all processes with which it has communicated since the last
checkpoint and sends them a request. Upon receiving the request, each process in turn identifies
all processes it has communicated with since the last checkpoint and sends them a request, and
so on, until no more processes can be identified.
● Phase 2:
○ all processes identified in the first phase take a checkpoint.
○ The result is a consistent checkpoint that involves only the participating processes.
○ In this protocol, after a process takes a checkpoint, it cannot send any message until the
second phase terminates successfully, although receiving a message after the checkpoint
has been taken is allowable.
● Based on a concept called “Z-dependency,” Cao and Singhal proved that there
does not exist a non-blocking algorithm that will allow a minimum number of
Communicati
on Induced
Checkpoint
Model
Based
Checkpoint
Index-based
checkpointing
● Index-based communication-induced checkpointing assigns
monotonically increasing indexes to checkpoints, such that the
checkpoints having the same index at different processes form a
consistent state.
● Inconsistency between checkpoints of the same index can be
avoided in a lazy fashion if indexes are piggybacked on
application messages to help receivers decide when they should
take a forced a checkpoint.
CS 3551
DISTRIBUTED
COMPUTING
Koo–Toueg coordinated checkpointing
algorithm
Koo–Toueg coordinated checkpointing
algorithm
Koo–Toueg coordinated checkpointing
algorithm
Objective:

● Takes a consistent set of checkpoints and avoids the domino effect and livelock problems during the
recovery.
● Processes coordinate their local checkpointing actions such that the set of all checkpoints in the
system is consistent.

Assumptions of Checkpointing Algorithm:

● Processes communicate by exchanging messages through communication channels. Communication

channels are FIFO.
● It is assumed that end-to-end protocols (such as the sliding window protocol) exist to cope with
message loss due to rollback recovery and communication failure.
● Communication failures do not partition the network.

Permanent vs Tentative:

1. A permanent checkpoint is a local checkpoint at a process and is a part of a consistent global

checkpoint.
2. A tentative checkpoint is a temporary checkpoint that is made a permanent checkpoint on the
successful termination of the checkpoint algorithm.
3. In case of a failure, processes roll back only to their permanent checkpoints for recovery.
Checkpoint - Phase 1 Checkpoint - Phase 2
1. An initiating process Pi takes a
tentative checkpoint and 1. Pi informs all the processes of
requests all other processes to the decision it reached at the
take tentative checkpoints. end of the first phase.
2. Each process informs Pi whether
it 2. A process, on receiving the
succeeded in taking a message from Pi, will
tentative checkpoint. act accordingly.
3. A process says “no” to a
request if it 3. Therefore, either all
fails to take a tentative or none of the
checkpoint, which could be processes advance the
due to several reasons,
depending upon the
checkpoint by taking
underlying application. permanent checkpoints.
4. If Pi learns that all the 4. The algorithm requires that
processes
after a
have successfully taken
tentative checkpoints, Pi process has taken a tentative
decides that all tentative checkpoint, it cannot send
checkpoints should be made messages related to the
permanent; otherwise, Pi
decides that all the tentative underlying computation until
checkpoints should be it is informed of Pi’s decision.
Optimizati
Correctnes
on
s
● A set of permanent
checkpoints taken by this
algorithm is consistent
because of the following
two reasons:
○ Either all or none of the
processes
take permanent checkpoints;
○ no process sends a message
after taking a tentative
checkpoint until the receipt of
the initiating process’s
decision, as by then all
processes would have taken
checkpoints.
● Thus, a situation will not
arise
where there is a record of
Assumption

● a single process invokes the algorithm.

Rollback Recovery ● It also assumes that the checkpoint and the
rollback recovery algorithms are not invoked

Algorithm
Phase 1
concurrently

Phase 2
● An initiating process Pi sends a
message to all other processes to ● Pi propagates its
check if they all are willing to decision to all the
restart from their previous
checkpoints.
processes.
● A process may reply “no” to a ● On receiving Pi’s
restart request due to any reason decision, a
(e.g., it is already participating in a
checkpoint or recovery process
process acts accordingly.
initiated by some other process). ● During the execution of
● If Pi learns that all processes are the recovery algorithm,
willing to restart from their previous
checkpoints, Pi decides that all a process cannot send
processes should roll back to their messages related to the
previous checkpoints. Otherwise, Pi underlying computation
aborts the rollback attempt and it
may attempt a recovery at a later while it is waiting for Pi’s
time. decision.
Optimizati
Correctnes
on
s
● All processes restart
from an appropriate
state because, if they
decide to restart, they
resume execution from a
consistent state (the
checkpointing algorithm
takes a consistent set of
checkpoints).
CS 3551
DISTRIBUTED
COMPUTING
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
Juang–Venkatesan algorithm for asynchronous
checkpointing and recovery
System Model and Assumptions:
● Communication channels are reliable, deliver the messages in FIFO order, and have
infinite buffers.
● The message transmission delay is arbitrary, but finite.
● The processors directly connected to a processor via communication channels
are called its neighbors.
● The underlying computation or application is assumed to be event-driven: a processor P
waits until a message m is received, it processes the message m, changes its state
from s to s , and sends zero or more messages to some of its neighbors.
● Then the processor remains idle until the receipt of the next message.
● The new state s and the contents of messages sent to its neighbors depend on state s
and the contents of message m. The events at a processor are identified by unique
monotonically increasing numbers, ex0, ex1, ex2,
● Storage can be:
○ Volatile Log - Less time but data lost when power is lost
○ Stable Storage - More time but data not lost.
Asynchronous Checkpointing

● After executing an event, a processor records a triplet (s, m,

msgs_sent) in its volatile storage,
○ s is the state of the processor before the event
○ m is the message (including the identity of the sender of m, denoted as
m.sender) whose arrival caused the event
○ msgs_sent is the set of messages that were sent by the processor during the
event.
● Therefore, a local checkpoint at a processor consists of the record of
an event occurring at the processor and it is taken without any
synchronization with other processors.
● Periodically, a processor independently saves the contents of the
volatile log
in the stable storage and clears the volatile log. (Equivalent to
taking a local checkpoint)
Recovery
Algorithm

Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
16 - Issues in Failure Recovery
No ratings yet
16 - Issues in Failure Recovery
5 pages
Recovery DC
No ratings yet
Recovery DC
6 pages
Define The Terms: Rollback Propagation.: Coordinated Checkpointing
No ratings yet
Define The Terms: Rollback Propagation.: Coordinated Checkpointing
5 pages
Distributed Computing Techniques
No ratings yet
Distributed Computing Techniques
3 pages
DC 4unit
No ratings yet
DC 4unit
8 pages
Unit 4
No ratings yet
Unit 4
94 pages
Checkpoints Recovery
No ratings yet
Checkpoints Recovery
35 pages
Checkpoint Recovery in Distributed Systems
100% (1)
Checkpoint Recovery in Distributed Systems
26 pages
A Low Overhead Minimum Process Global Snapshop Collection Algorithm For Mobile Distributed System
No ratings yet
A Low Overhead Minimum Process Global Snapshop Collection Algorithm For Mobile Distributed System
19 pages
Lm2-Rollback & Recovery
No ratings yet
Lm2-Rollback & Recovery
34 pages
CheckpointingRecovery ds14
No ratings yet
CheckpointingRecovery ds14
35 pages
A Minimum-Process Coordinated Checkpointing Protocol For Mobile Distributed System
No ratings yet
A Minimum-Process Coordinated Checkpointing Protocol For Mobile Distributed System
10 pages
Lm3 Checkpointing Algorithm
No ratings yet
Lm3 Checkpointing Algorithm
40 pages
Concurrent Checkpointing and Recovery in Distributed Systems
No ratings yet
Concurrent Checkpointing and Recovery in Distributed Systems
61 pages
Unit Iv Recovery
No ratings yet
Unit Iv Recovery
27 pages
Distributed System Recovery Guide
No ratings yet
Distributed System Recovery Guide
119 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
21 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
21 pages
Deadlock Detection & Recovery Guide
No ratings yet
Deadlock Detection & Recovery Guide
30 pages
DC Unit4
No ratings yet
DC Unit4
32 pages
DC Unit4
No ratings yet
DC Unit4
33 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Cs3551 Unit IV Notes
No ratings yet
Cs3551 Unit IV Notes
34 pages
12 JuangVenkatesan
No ratings yet
12 JuangVenkatesan
4 pages
Dc-3551 Unit IV Notes
No ratings yet
Dc-3551 Unit IV Notes
32 pages
Distributed Systems Checkpointing
No ratings yet
Distributed Systems Checkpointing
2 pages
DS NOTES Unit 4 PDF
No ratings yet
DS NOTES Unit 4 PDF
36 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
Distributed Computing Q&A Guide
No ratings yet
Distributed Computing Q&A Guide
5 pages
System Recovery
No ratings yet
System Recovery
38 pages
Unit - Iv
No ratings yet
Unit - Iv
10 pages
Algorithm For Asynchronous Check Pointing and Recovery
No ratings yet
Algorithm For Asynchronous Check Pointing and Recovery
4 pages
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
No ratings yet
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
14 pages
DC Quiz 2
No ratings yet
DC Quiz 2
31 pages
6.2 Lamport 1 Logical
No ratings yet
6.2 Lamport 1 Logical
27 pages
Distributed Deadlock Detection: COP 4610 Notes On Deadlock
No ratings yet
Distributed Deadlock Detection: COP 4610 Notes On Deadlock
16 pages
4.1.4. Checkpoint Based Recovery-1
No ratings yet
4.1.4. Checkpoint Based Recovery-1
10 pages
Fault Tolerant Systems: Part 17 - Checkpointing II Chapter 6 - Checkpointing
No ratings yet
Fault Tolerant Systems: Part 17 - Checkpointing II Chapter 6 - Checkpointing
34 pages
Coordinated Checkpoint vs. Message Log
No ratings yet
Coordinated Checkpoint vs. Message Log
27 pages
A 161126
No ratings yet
A 161126
26 pages
Unit 4
No ratings yet
Unit 4
32 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
CST402 Scheme
No ratings yet
CST402 Scheme
9 pages
Coordination Algorithms
No ratings yet
Coordination Algorithms
30 pages
Chap 3 DC
No ratings yet
Chap 3 DC
13 pages
Lecture #9: Distributed Deadlock Detection
No ratings yet
Lecture #9: Distributed Deadlock Detection
17 pages
3 Synchronization
No ratings yet
3 Synchronization
45 pages
Mutual Exclusion RECORD
No ratings yet
Mutual Exclusion RECORD
7 pages
Iii Year/V Semester Question Bank Unit-Iv Part-A
No ratings yet
Iii Year/V Semester Question Bank Unit-Iv Part-A
5 pages
Cs3551 Unit 4 QB
No ratings yet
Cs3551 Unit 4 QB
5 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
Synchronous Systems With Failures
No ratings yet
Synchronous Systems With Failures
9 pages
Outline: - Distributed Mutual Exclusion
No ratings yet
Outline: - Distributed Mutual Exclusion
38 pages
DC Ict Test-2
No ratings yet
DC Ict Test-2
1 page
Ds6 Mutual Exclusion
No ratings yet
Ds6 Mutual Exclusion
41 pages
CS8603 DS Unit3 CompleteMaterial
No ratings yet
CS8603 DS Unit3 CompleteMaterial
95 pages
Midterm Solutions
No ratings yet
Midterm Solutions
8 pages
Rohini 836843492
No ratings yet
Rohini 836843492
3 pages
Stucor CS3391-ND
No ratings yet
Stucor CS3391-ND
293 pages
Unit 4 Part 1 B
No ratings yet
Unit 4 Part 1 B
31 pages
Unit 3 Part 3
No ratings yet
Unit 3 Part 3
30 pages
Unit 2 Message Passing Part 1
No ratings yet
Unit 2 Message Passing Part 1
35 pages
11 Removed
No ratings yet
11 Removed
30 pages
11
No ratings yet
11
35 pages
1 Removed
No ratings yet
1 Removed
24 pages
1
No ratings yet
1
34 pages
IRCTC Food Invoice April 2023
No ratings yet
IRCTC Food Invoice April 2023
1 page
Lecture 3b
No ratings yet
Lecture 3b
7 pages
Abacus Curriculum
No ratings yet
Abacus Curriculum
9 pages
Manual RK 45
No ratings yet
Manual RK 45
8 pages
Performer B2 - UPDATED - Soluzioni - Solu 2
No ratings yet
Performer B2 - UPDATED - Soluzioni - Solu 2
191 pages
Africa's Geography and Resources
100% (1)
Africa's Geography and Resources
41 pages
SAP Characteristic Management Guide
No ratings yet
SAP Characteristic Management Guide
4 pages
Quantum Algorithms For Moving-Target TSP
No ratings yet
Quantum Algorithms For Moving-Target TSP
41 pages
DCC30122 - Fluids Mechanics
No ratings yet
DCC30122 - Fluids Mechanics
50 pages
Ergonomics for Workplace Safety
No ratings yet
Ergonomics for Workplace Safety
42 pages
Six Famous Scientist
No ratings yet
Six Famous Scientist
6 pages
Canadian Intermediate Mathematics Contest: The Centre For Education in Mathematics and Computing Cemc - Uwaterloo.ca
No ratings yet
Canadian Intermediate Mathematics Contest: The Centre For Education in Mathematics and Computing Cemc - Uwaterloo.ca
4 pages
Advanced Scope-CC-2312
No ratings yet
Advanced Scope-CC-2312
28 pages
The Doctor Guide To Critical Appraisal 3rd Edition by Narinder Gosall, Gurpal Gosall ISBN 1905635818 9781905635818 PDF Download
100% (4)
The Doctor Guide To Critical Appraisal 3rd Edition by Narinder Gosall, Gurpal Gosall ISBN 1905635818 9781905635818 PDF Download
45 pages
AT9932 Automotive Boost Buck LED Lamp Driver IC Data Sheet 20005789A
No ratings yet
AT9932 Automotive Boost Buck LED Lamp Driver IC Data Sheet 20005789A
18 pages
Decathlon India: My Responsibilities As A Retail Logistician
50% (2)
Decathlon India: My Responsibilities As A Retail Logistician
2 pages
My Nursing Philosophy
100% (1)
My Nursing Philosophy
7 pages
Preboard and Annual Time Table Jan 2025 Edited Dec31
No ratings yet
Preboard and Annual Time Table Jan 2025 Edited Dec31
6 pages
The Basics of Oracle Architecture
No ratings yet
The Basics of Oracle Architecture
5 pages
AK-PC 551 Capacity Controller Manual
No ratings yet
AK-PC 551 Capacity Controller Manual
26 pages
T903X5 Data Sheet V18
No ratings yet
T903X5 Data Sheet V18
5 pages
Simple Briefing Note Template Download 2
No ratings yet
Simple Briefing Note Template Download 2
2 pages
Problem Solving Essay Topics List
100% (2)
Problem Solving Essay Topics List
8 pages
Amaras Pink Paper Plane FINAL
No ratings yet
Amaras Pink Paper Plane FINAL
2 pages
Network Equipment Inventory Overview
No ratings yet
Network Equipment Inventory Overview
7 pages
"Time Twins: Steven and Josh's Lives"
No ratings yet
"Time Twins: Steven and Josh's Lives"
16 pages
PE Investment in EdTech: Eupheus
No ratings yet
PE Investment in EdTech: Eupheus
7 pages
Evolvulus Nummularius (L.) L. (Convolvulaceae)
No ratings yet
Evolvulus Nummularius (L.) L. (Convolvulaceae)
10 pages
Variation and Selection Naotes
No ratings yet
Variation and Selection Naotes
11 pages
.1 Exercise
No ratings yet
.1 Exercise
4 pages

Distributed Checkpointing Guide

Uploaded by

Distributed Checkpointing Guide

Uploaded by

CS 3551

1. When a failure occurs, the recovering

If channels are FIFO,

Assumptions of Checkpointing Algorithm:

● Processes communicate by exchanging messages through communication channels. Communication

1. A permanent checkpoint is a local checkpoint at a process and is a part of a consistent global

● a single process invokes the algorithm.

● After executing an event, a processor records a triplet (s, m,

You might also like