0% found this document useful (0 votes)
8 views5 pages

Lecture 10

Parallel and Distributed Computing

Uploaded by

miraj gul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

Lecture 10

Parallel and Distributed Computing

Uploaded by

miraj gul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Fault Tolerance in Distributed System

Fault tolerance in distributed systems is the capability to continue operating smoothly


despite failures or errors in one or more of its components. This resilience is crucial for
maintaining system reliability, availability, and consistency. By implementing strategies
like redundancy, replication, and error detection, distributed systems can handle various
types of failures, ensuring uninterrupted service and data integrity.

In distributed systems, three types of problems occur. All these three types of problems
are related.
 Fault: Fault is defined as a weakness or shortcoming in the system or any hardware
and software component. The presence of fault can lead to error and failure.
 Errors: Errors are incorrect results due to the presence of faults.
 Failure: Failure is the outcome where the assigned goal is not achieved.

What is Fault Tolerance?

Fault Tolerance is defined as the ability of the system to function properly even in the
presence of any failure. Distributed systems consist of multiple components due to
which there is a high risk of faults occurring. Due to the presence of faults, the overall
performance may degrade.

Types of Faults

 Transient Faults: Transient Faults are the type of faults that occur once and then
disappear. These types of faults do not harm the system to a great extent but are
very difficult to find or locate. Processor fault is an example of transient fault.

 Intermittent Faults: Intermittent Faults are the type of faults that come repeatedly.
Such as once the fault occurs it vanishes upon itself and then reappears again. An
example of intermittent fault is when the working computer hangs up.

 Permanent Faults: Permanent Faults are the type of faults that remain in the system
until the component is replaced by another. These types of faults can cause very
severe damage to the system but are easy to identify. A burnt-out chip is an example
of a permanent Fault.

Parallel & Distributed Computing 1 CS & IT, HU


Need for Fault Tolerance in Distributed Systems

Fault Tolerance is required in order to provide below four features.

1. Availability: Availability is defined as the property where the system is readily


available for its use at any time.

2. Reliability: Reliability is defined as the property where the system can work
continuously without any failure.

3. Safety: Safety is defined as the property where the system can remain safe from
unauthorized access even if any failure occurs.

4. Maintainability: Maintainability is defined as the property states that how easily and
fastly the failed node or system can be repaired.

Fault Tolerance in Distributed Systems


In order to implement the techniques for fault tolerance in distributed systems, the
design, configuration and relevant applications need to be considered. Below are the
phases carried out for fault tolerance in a distributed system.

Phases of Fault Tolerance in Distributed Systems

Parallel & Distributed Computing 2 CS & IT, HU


1. Fault Detection
Fault Detection is the first phase where the system is monitored continuously. The
outcomes are being compared with the expected output. During monitoring if any faults
are identified they are being notified. These faults can occur due to various reasons such
as hardware failure, network failure, and software issues. The main aim of the first
phase is to detect these faults as soon as they occur so that the work being assigned will
not be delayed.

2. Fault Diagnosis
Fault diagnosis is the process where the fault that is identified in the first phase will be
diagnosed properly in order to get the root cause and possible nature of the faults. Fault
diagnosis can be done manually by the administrator or by using automated Techniques
in order to solve the fault and perform the given task.

3. Evidence Generation
Evidence generation is defined as the process where the report of the fault is prepared
based on the diagnosis done in an earlier phase. This report involves the details of the
causes of the fault, the nature of faults, the solutions that can be used for fixing and
other alternatives and preventions that need to be considered.

4. Assessment
Assessment is the process where the damages caused by the faults are analyzed. It can
be determined with the help of messages that are being passed from the component
that has encountered the fault. Based on the assessment further decisions are made.

5. Recovery
Recovery is the process where the aim is to make the system fault free. It is the step to
make the system fault free and restore it to state forward recovery and backup
recovery. Some of the common recovery techniques such as reconfiguration and
resynchronization can be used.

Types of Fault Tolerance in Distributed Systems

1. Hardware Fault Tolerance: Hardware Fault Tolerance involves keeping a backup


plan for hardware devices such as memory, hard disk, CPU, and other hardware
peripheral devices. Hardware Fault Tolerance is a type of fault tolerance that does
not examine faults and runtime errors but can only provide hardware backup. The
two different approaches that are used in Hardware Fault Tolerance are fault-
masking and dynamic recovery.

Parallel & Distributed Computing 3 CS & IT, HU


2. Software Fault Tolerance: Software Fault Tolerance is a type of fault tolerance
where dedicated software is used in order to detect invalid output, runtime, and
programming errors. Software Fault Tolerance makes use of static and dynamic
methods for detecting and providing the solution. Software Fault Tolerance also
consists of additional data points such as recovery rollback and checkpoints.

3. System Fault Tolerance: System Fault Tolerance is a type of fault tolerance that
consists of a whole system. It has the advantage that it not only stores the
checkpoints but also the memory block, and program checkpoints and detects the
errors in applications automatically. If the system encounters any type of fault or
error, it does provide the required mechanism for the solution. Thus, system fault
tolerance is reliable and efficient.

Fault Tolerance Strategies

Fault tolerance strategies are essential for ensuring that distributed systems continue to
operate smoothly even when components fail. Here are the key strategies commonly
used:
 Redundancy and Replication
 Data Replication: Data is duplicated across multiple nodes or locations to
ensure availability and durability. If one node fails, the system can still
access the data from another node.

 Component Redundancy: Critical system components are duplicated so


that if one component fails, others can take over. This includes redundant
servers, network paths, or services.

 Failover Mechanisms
 Active-Passive Failover: One component (active) handles the workload
while another component (passive) remains on standby. If the active
component fails, the passive component takes over.

 Active-Active Failover: Multiple components actively handle workloads


and share the load. If one component fails, others continue to handle the
workload.

Parallel & Distributed Computing 4 CS & IT, HU


 Error Detection Techniques
 Heartbeat Mechanisms: Regular signals (heartbeats) are sent between
components to detect failures. If a component, stops sending heartbeats,
it is considered failed.

 Check pointing: Periodic saving of the system’s state so that if a failure


occurs, the system can be restored to the last saved state.

 Error Recovery Methods


 Rollback Recovery: The system reverts to a previous state after detecting
an error, using saved checkpoints or logs.

 Forward Recovery: The system attempts to correct or compensate for the


failure to continue operating. This may involve reprocessing or
reconstructing data.

Parallel & Distributed Computing 5 CS & IT, HU

You might also like