0% found this document useful (0 votes)

8 views5 pages

Lecture 10

Parallel and Distributed Computing

Uploaded by

miraj gul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views5 pages

Lecture 10

Parallel and Distributed Computing

Uploaded by

miraj gul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Fault Tolerance in Distributed System

Fault tolerance in distributed systems is the capability to continue operating smoothly

despite failures or errors in one or more of its components. This resilience is crucial for
maintaining system reliability, availability, and consistency. By implementing strategies
like redundancy, replication, and error detection, distributed systems can handle various
types of failures, ensuring uninterrupted service and data integrity.

In distributed systems, three types of problems occur. All these three types of problems
are related.
 Fault: Fault is defined as a weakness or shortcoming in the system or any hardware
and software component. The presence of fault can lead to error and failure.
 Errors: Errors are incorrect results due to the presence of faults.
 Failure: Failure is the outcome where the assigned goal is not achieved.

What is Fault Tolerance?

Fault Tolerance is defined as the ability of the system to function properly even in the
presence of any failure. Distributed systems consist of multiple components due to
which there is a high risk of faults occurring. Due to the presence of faults, the overall
performance may degrade.

Types of Faults

 Transient Faults: Transient Faults are the type of faults that occur once and then
disappear. These types of faults do not harm the system to a great extent but are
very difficult to find or locate. Processor fault is an example of transient fault.

 Intermittent Faults: Intermittent Faults are the type of faults that come repeatedly.
Such as once the fault occurs it vanishes upon itself and then reappears again. An
example of intermittent fault is when the working computer hangs up.

 Permanent Faults: Permanent Faults are the type of faults that remain in the system
until the component is replaced by another. These types of faults can cause very
severe damage to the system but are easy to identify. A burnt-out chip is an example
of a permanent Fault.

Parallel & Distributed Computing 1 CS & IT, HU

Need for Fault Tolerance in Distributed Systems

Fault Tolerance is required in order to provide below four features.

1. Availability: Availability is defined as the property where the system is readily

available for its use at any time.

2. Reliability: Reliability is defined as the property where the system can work
continuously without any failure.

3. Safety: Safety is defined as the property where the system can remain safe from
unauthorized access even if any failure occurs.

4. Maintainability: Maintainability is defined as the property states that how easily and
fastly the failed node or system can be repaired.

Fault Tolerance in Distributed Systems

In order to implement the techniques for fault tolerance in distributed systems, the
design, configuration and relevant applications need to be considered. Below are the
phases carried out for fault tolerance in a distributed system.

Phases of Fault Tolerance in Distributed Systems

Parallel & Distributed Computing 2 CS & IT, HU

1. Fault Detection
Fault Detection is the first phase where the system is monitored continuously. The
outcomes are being compared with the expected output. During monitoring if any faults
are identified they are being notified. These faults can occur due to various reasons such
as hardware failure, network failure, and software issues. The main aim of the first
phase is to detect these faults as soon as they occur so that the work being assigned will
not be delayed.

2. Fault Diagnosis
Fault diagnosis is the process where the fault that is identified in the first phase will be
diagnosed properly in order to get the root cause and possible nature of the faults. Fault
diagnosis can be done manually by the administrator or by using automated Techniques
in order to solve the fault and perform the given task.

3. Evidence Generation
Evidence generation is defined as the process where the report of the fault is prepared
based on the diagnosis done in an earlier phase. This report involves the details of the
causes of the fault, the nature of faults, the solutions that can be used for fixing and
other alternatives and preventions that need to be considered.

4. Assessment
Assessment is the process where the damages caused by the faults are analyzed. It can
be determined with the help of messages that are being passed from the component
that has encountered the fault. Based on the assessment further decisions are made.

5. Recovery
Recovery is the process where the aim is to make the system fault free. It is the step to
make the system fault free and restore it to state forward recovery and backup
recovery. Some of the common recovery techniques such as reconfiguration and
resynchronization can be used.

Types of Fault Tolerance in Distributed Systems

1. Hardware Fault Tolerance: Hardware Fault Tolerance involves keeping a backup

plan for hardware devices such as memory, hard disk, CPU, and other hardware
peripheral devices. Hardware Fault Tolerance is a type of fault tolerance that does
not examine faults and runtime errors but can only provide hardware backup. The
two different approaches that are used in Hardware Fault Tolerance are fault-
masking and dynamic recovery.

Parallel & Distributed Computing 3 CS & IT, HU

2. Software Fault Tolerance: Software Fault Tolerance is a type of fault tolerance
where dedicated software is used in order to detect invalid output, runtime, and
programming errors. Software Fault Tolerance makes use of static and dynamic
methods for detecting and providing the solution. Software Fault Tolerance also
consists of additional data points such as recovery rollback and checkpoints.

3. System Fault Tolerance: System Fault Tolerance is a type of fault tolerance that
consists of a whole system. It has the advantage that it not only stores the
checkpoints but also the memory block, and program checkpoints and detects the
errors in applications automatically. If the system encounters any type of fault or
error, it does provide the required mechanism for the solution. Thus, system fault
tolerance is reliable and efficient.

Fault Tolerance Strategies

Fault tolerance strategies are essential for ensuring that distributed systems continue to
operate smoothly even when components fail. Here are the key strategies commonly
used:
 Redundancy and Replication
 Data Replication: Data is duplicated across multiple nodes or locations to
ensure availability and durability. If one node fails, the system can still
access the data from another node.

 Component Redundancy: Critical system components are duplicated so

that if one component fails, others can take over. This includes redundant
servers, network paths, or services.

 Failover Mechanisms
 Active-Passive Failover: One component (active) handles the workload
while another component (passive) remains on standby. If the active
component fails, the passive component takes over.

 Active-Active Failover: Multiple components actively handle workloads

and share the load. If one component fails, others continue to handle the
workload.

Parallel & Distributed Computing 4 CS & IT, HU

 Error Detection Techniques
 Heartbeat Mechanisms: Regular signals (heartbeats) are sent between
components to detect failures. If a component, stops sending heartbeats,
it is considered failed.

 Check pointing: Periodic saving of the system’s state so that if a failure

occurs, the system can be restored to the last saved state.

 Error Recovery Methods

 Rollback Recovery: The system reverts to a previous state after detecting
an error, using saved checkpoints or logs.

 Forward Recovery: The system attempts to correct or compensate for the

failure to continue operating. This may involve reprocessing or
reconstructing data.

Parallel & Distributed Computing 5 CS & IT, HU

Dis Sys
No ratings yet
Dis Sys
16 pages
اسلام 1
No ratings yet
اسلام 1
16 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
Fault Tolerance in Distributed Computing
No ratings yet
Fault Tolerance in Distributed Computing
32 pages
Ascs 04 0213
No ratings yet
Ascs 04 0213
5 pages
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
28 pages
Lesson 1 - Introduction To Fault-Tolerant Computing
No ratings yet
Lesson 1 - Introduction To Fault-Tolerant Computing
6 pages
Lecture 7 - FAULT-TOLERANT COMPUTING
No ratings yet
Lecture 7 - FAULT-TOLERANT COMPUTING
13 pages
WRL0004 TMP
No ratings yet
WRL0004 TMP
9 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
Fault-Tolerant Parallel Algorithms
No ratings yet
Fault-Tolerant Parallel Algorithms
16 pages
A Review On Fault Tolerance in Distributed Database
No ratings yet
A Review On Fault Tolerance in Distributed Database
4 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
No ratings yet
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
13 pages
CSC 308 Fault Tolerant Computing
No ratings yet
CSC 308 Fault Tolerant Computing
24 pages
Du3 1
No ratings yet
Du3 1
54 pages
Fault Tolerant Computing
No ratings yet
Fault Tolerant Computing
6 pages
Ijcse V11i4p101
No ratings yet
Ijcse V11i4p101
10 pages
Unit-5 Faults in RTOS
No ratings yet
Unit-5 Faults in RTOS
5 pages
Fault Tolerance
No ratings yet
Fault Tolerance
10 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Dependable Systems
No ratings yet
Dependable Systems
22 pages
Attributes of Fault-Tolerant Distributed File Systems
No ratings yet
Attributes of Fault-Tolerant Distributed File Systems
69 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
9 pages
Design Patterns For High Availability
No ratings yet
Design Patterns For High Availability
10 pages
Fault Tolerance Unit 3-4
No ratings yet
Fault Tolerance Unit 3-4
32 pages
DS Unit-3 Notes
No ratings yet
DS Unit-3 Notes
35 pages
Grid Computing Fault Tolerance
No ratings yet
Grid Computing Fault Tolerance
14 pages
Future Trends in Fault Tolerant (Lect.10)
No ratings yet
Future Trends in Fault Tolerant (Lect.10)
3 pages
Inductionn + Chapter 1 Part 1
No ratings yet
Inductionn + Chapter 1 Part 1
22 pages
Ds Chapter 7
No ratings yet
Ds Chapter 7
21 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
OS Security & Fault Tolerance
No ratings yet
OS Security & Fault Tolerance
24 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Lesson 2 - Fault and Error Modelling
No ratings yet
Lesson 2 - Fault and Error Modelling
7 pages
Redundancy in Instrumentations.
No ratings yet
Redundancy in Instrumentations.
3 pages
Lecture 7
No ratings yet
Lecture 7
57 pages
OS Presentattion
No ratings yet
OS Presentattion
15 pages
RTS UNiT 4
No ratings yet
RTS UNiT 4
19 pages
Distributed Systems Resilience
No ratings yet
Distributed Systems Resilience
25 pages
Fault Tolerance Computing 1
No ratings yet
Fault Tolerance Computing 1
59 pages
Introduction To Fault Tolerance
No ratings yet
Introduction To Fault Tolerance
20 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
Reference Book Principles of Distributed Database System Chapters
No ratings yet
Reference Book Principles of Distributed Database System Chapters
25 pages
Week 04
No ratings yet
Week 04
49 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
11 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
Adobe Scan Oct 11, 2023
No ratings yet
Adobe Scan Oct 11, 2023
23 pages
Chapter 1 - Intro
No ratings yet
Chapter 1 - Intro
31 pages
Icst 1011
No ratings yet
Icst 1011
6 pages
Fault Tolerance for Tech Experts
No ratings yet
Fault Tolerance for Tech Experts
61 pages
Understanding Fault-Tolerant Distributed Systems
No ratings yet
Understanding Fault-Tolerant Distributed Systems
23 pages
Research Paper
No ratings yet
Research Paper
63 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
STDcurs1 Merged
No ratings yet
STDcurs1 Merged
139 pages
Dependability in Computing Systems
No ratings yet
Dependability in Computing Systems
6 pages
Memory and Its Types
No ratings yet
Memory and Its Types
4 pages
PDC Outlines CLOs and Weekly Plan
No ratings yet
PDC Outlines CLOs and Weekly Plan
2 pages
CS100 - Week Plan
No ratings yet
CS100 - Week Plan
3 pages
Lecture 11-14
No ratings yet
Lecture 11-14
13 pages
Lecture 16
No ratings yet
Lecture 16
4 pages
Lecture 15
No ratings yet
Lecture 15
8 pages
DPT Itc 1
No ratings yet
DPT Itc 1
4 pages
Lecture 5
No ratings yet
Lecture 5
7 pages
Lecture 6
No ratings yet
Lecture 6
5 pages
Shop - Ease V.2
No ratings yet
Shop - Ease V.2
46 pages
Uloom Ul Quran Subhi Saleh
No ratings yet
Uloom Ul Quran Subhi Saleh
500 pages
Institute of Business Administration Karachi: Muhammadsaeed@iba - Edu.pk Saeed@uok - Edu.pk
No ratings yet
Institute of Business Administration Karachi: Muhammadsaeed@iba - Edu.pk Saeed@uok - Edu.pk
2 pages
Blood Tests Report11
No ratings yet
Blood Tests Report11
8 pages
Author Guidelines Online Submission
No ratings yet
Author Guidelines Online Submission
15 pages
CSB 31 Medical Form
No ratings yet
CSB 31 Medical Form
2 pages
Ubl Freelancer
No ratings yet
Ubl Freelancer
2 pages
Lecture 5: Basic Probability Theory: Donglei Du (Ddu@unb - Edu)
No ratings yet
Lecture 5: Basic Probability Theory: Donglei Du (Ddu@unb - Edu)
55 pages
Asm 16272
No ratings yet
Asm 16272
15 pages
Perineal+Care+and+Repair PPG v1 0 240304 155135 240305 073111
No ratings yet
Perineal+Care+and+Repair PPG v1 0 240304 155135 240305 073111
20 pages
Jio Airfiber 9.3.25
No ratings yet
Jio Airfiber 9.3.25
7 pages
Patent Dispute: Manzano vs. Madolaria
No ratings yet
Patent Dispute: Manzano vs. Madolaria
7 pages
Yokogawa DX2000 Manual PDF
No ratings yet
Yokogawa DX2000 Manual PDF
324 pages
CV - Ilaha Asadova
No ratings yet
CV - Ilaha Asadova
1 page
High/Low-Line Plenum Boxes for Diffusers
No ratings yet
High/Low-Line Plenum Boxes for Diffusers
1 page
Air Cisternography of The Cerebellopontine Angle Using High Resolution Computed Tomography
No ratings yet
Air Cisternography of The Cerebellopontine Angle Using High Resolution Computed Tomography
3 pages
Auto Engineering Students' Report
50% (2)
Auto Engineering Students' Report
27 pages
The Cambridge Guide To English Usage
82% (11)
The Cambridge Guide To English Usage
621 pages
Unit 1 Research and Preparation
100% (3)
Unit 1 Research and Preparation
9 pages
Group 5 Principles of Management
No ratings yet
Group 5 Principles of Management
18 pages
Florence Student Housing Guide
No ratings yet
Florence Student Housing Guide
1 page
Revision 2 - SHORT ANSWER KEY
No ratings yet
Revision 2 - SHORT ANSWER KEY
12 pages
Column Shear Strenght
No ratings yet
Column Shear Strenght
8 pages
Propeller Shaft: Section
No ratings yet
Propeller Shaft: Section
10 pages
Carretero and Lagaly 2007
No ratings yet
Carretero and Lagaly 2007
3 pages
Unit Plan Fitness 20 30
No ratings yet
Unit Plan Fitness 20 30
4 pages
NILM Data Collection & ML Techniques
No ratings yet
NILM Data Collection & ML Techniques
9 pages
Digital Modulation
No ratings yet
Digital Modulation
13 pages
Study Material Accountancy Class 11TH 2024-25
No ratings yet
Study Material Accountancy Class 11TH 2024-25
134 pages
Conscious Oracle Card Booklet
No ratings yet
Conscious Oracle Card Booklet
44 pages
DRAW10W - Assembly Drawing
No ratings yet
DRAW10W - Assembly Drawing
8 pages
Unit 3 PDF Forging Sheet Metal
No ratings yet
Unit 3 PDF Forging Sheet Metal
75 pages
The Advantages of Collaborative Learning in Scienc
No ratings yet
The Advantages of Collaborative Learning in Scienc
10 pages

Lecture 10

Uploaded by

Lecture 10

Uploaded by

Fault Tolerance in Distributed System

Fault tolerance in distributed systems is the capability to continue operating smoothly

What is Fault Tolerance?

Parallel & Distributed Computing 1 CS & IT, HU

Fault Tolerance is required in order to provide below four features.

1. Availability: Availability is defined as the property where the system is readily

Fault Tolerance in Distributed Systems

Phases of Fault Tolerance in Distributed Systems

Parallel & Distributed Computing 2 CS & IT, HU

Types of Fault Tolerance in Distributed Systems

1. Hardware Fault Tolerance: Hardware Fault Tolerance involves keeping a backup

Parallel & Distributed Computing 3 CS & IT, HU

Fault Tolerance Strategies

 Component Redundancy: Critical system components are duplicated so

 Active-Active Failover: Multiple components actively handle workloads

Parallel & Distributed Computing 4 CS & IT, HU

 Check pointing: Periodic saving of the system’s state so that if a failure

 Error Recovery Methods

 Forward Recovery: The system attempts to correct or compensate for the

Parallel & Distributed Computing 5 CS & IT, HU

You might also like