UNIT IV
Q.N                                                                                       CO’   Bloom’s
                                        Questions                                                Level
 o                                                                                         s
      What is checkpointing in distributed systems?                                       CO5     K1
      Its a technique used to save the system's state at specific intervals, allowing
 1.   recovery from failures. It helps ensure fault tolerance by enabling the system
      to roll back to the last checkpointed state after a failure, reducing the loss of
      computation and maintaining consistency across nodes.
      What are the types of checkpoint-based rollback recovery?                           CO5     K1
      The two types of checkpoints are:
         1. Coordinated Checkpoints: All processes take checkpoints
 2.         simultaneously, ensuring a consistent global state for recovery.
         2. Uncoordinated Checkpoints: Each process takes checkpoints
            independently, which can lead to inconsistencies and may require
            multiple rollbacks to achieve a consistent state.
      Why do we need checkpoints?                                                         CO5     K1
      Checkpoints are needed to ensure fault tolerance and efficient recovery.
 3.   They allow the system to save its state periodically, enabling it to roll back
      to a previous checkpoint in case of failures, minimizing data loss and
      reducing the need for restarting processes from the beginning.
      State the use of Rollback recovery.                                                 CO5     K1
      Rollback recovery is used to restore the system to a consistent state after
 4.   a failure. It helps in minimizing data loss and reducing the need to reprocess
      tasks by allowing the system to revert to a previously saved checkpoint,
      ensuring fault tolerance and continuity of operations.
      What is coordinated checkpointing?                                                  CO5     K1
      Coordinated checkpointing is a technique where all processes coordinate to
 5.   take checkpoints at the same time. This ensures a globally consistent state
      across the system, making recovery from failures straightforward without
      the need for complex rollbacks or dependency tracking between processes.
      What is the purpose of checkpointing?                                               CO5     K1
 6.
      The purpose of checkpointing is to ensure fault tolerance and efficient
      recovery. By periodically saving the system's state, checkpointing allows
     the system to recover from failures by rolling back to the last saved state,
     reducing data loss and avoiding the need to restart processes from the
     beginning.
     What are the advantages of checkpoint?                                            CO5   K1
     The advantages of checkpoints in distributed systems include:
7.       1.   Fault Tolerance
         2.   Reduced Downtime
         3.   Efficient Resource Utilization
         4.   Consistency
         5.   Scalability
         6.   Enhanced Reliability
     What are the issues in asynchronous checkpointing?                                CO5   K1
     The issues in asynchronous checkpointing in distributed systems include:
8.       1. Inconsistent States: Since processes checkpoint independently, it
            can result in inconsistent global states, complicating recovery.
         2. Domino Effect: A failure may force multiple processes to roll back
            to earlier checkpoints, leading to significant overhead and loss of
            progress.
     What is the difference between rollback and roll-forward CO5                            K1
     recovery?
     The difference between rollback and roll-forward recovery in distributed
     systems is:
9.           Rollback Recovery: This method involves reverting the system to a
              previous state (checkpoint) after a failure, restoring the system to a
              known good condition.
             Roll-Forward Recovery: This approach involves applying logged
              actions or operations that occurred after the last consistent state to
              recover the system to a later state, effectively moving forward from
              that point.
     Define the term rollback propagation.                                             CO5   K1
    Rollback propagation refers to the process by which a failure in one process
10. leads to the need for other dependent processes to also roll back to earlier
    checkpoints. This ensures a consistent global state across the system, but can
    result in significant overhead as multiple processes may need to revert to
    previous states to maintain consistency.
      What are the two types of checkpoints.                                         CO5   K1
      Local Checkpoint
            Definition: A local checkpoint is the state saved by an individual
             process in a distributed system, allowing it to recover from failures
             by restoring its own state.
            Key Point: Each process takes checkpoints independently, which can
             lead to inconsistent global states across the system.
11.
      Global Checkpoint
            Definition: A global checkpoint is a collection of local checkpoints
             from all processes in a distributed system, capturing a consistent
             global state.
            Key Point: Ensures a coordinated recovery, avoiding inconsistencies
             like the domino effect.
      Classify the faults based on failure of components and based CO5                     K1
      on faulty components
         1. Based on Failure of Components:
               o Crash Failures: Components become completely
                  unresponsive.
12.            o Omission Failures: Components fail to respond to messages
                  while remaining operational.
         2. Based on Faulty Components:
               o Byzantine Failures: Components behave arbitrarily,
                  providing incorrect or misleading information.
               o Transient Failures: Temporary faults that may resolve
                  themselves, such as brief network outages.
      Define fault and failure.                                                      CO5   K1
            Fault: A fault refers to an underlying condition or defect in a
             component that may lead to failure, such as hardware malfunctions,
13.          software bugs, or network issues.
            Failure: A failure occurs when a fault manifests, resulting in the
             component no longer performing its intended function, leading to
             incorrect behavior or a complete loss of service.
      State Byzantine agreement problem.                                             CO5   K1
14. The Byzantine Agreement Problem involves a group of nodes needing to
    agree on a single value despite some nodes acting arbitrarily or maliciously.
    The goal is for all non-faulty nodes to reach consensus on the same value,
     ensuring reliable decision-making in the presence of Byzantine faults.
                                                                                     CO5   K1
     What is a consensus algorithm?
    A consensus algorithm is a protocol used to achieve agreement among
15. multiple nodes on a single data value or state, even in the presence of faults
    or failures. It ensures that all non-faulty nodes reach the same decision,
    maintaining consistency and reliability across the system despite potential
    network partitions or malicious actors.
                                            PART - B
     Discuss the taxonomy of rollback recovery in distributed CO5
     system to ensure data consistency and system reliability?
     Show how the dominos effect could lead to a cascaded
     rollback.
     Rollback recovery in distributed systems is essential for ensuring data
     consistency and system reliability. It can be categorized into several types
     based on how the recovery process is managed. The primary categories are:
 1                                                                                         K2
         1. Coordinated Checkpointing:
               o All processes in the distributed system synchronize to take
                  checkpoints simultaneously.
               o This method ensures a consistent global state, making
                  recovery straightforward, as all processes can revert to their
                  corresponding checkpoints without worrying about
                  dependencies.
         2. Uncoordinated Checkpointing:
               o Each process independently takes checkpoints without
                  synchronization with other processes.
               o This method can lead to inconsistencies, as different
                  processes may have checkpoints that do not align, resulting in
                what is known as the "domino effect."
       3. Message Logging:
            o This involves logging messages exchanged between processes
                to ensure that, during recovery, processes can replay messages
                to restore their states.
            o There are two main types of message logging:
                     Pessimistic Logging: Messages are logged before
                        being sent, ensuring that processes can recover their
                        state completely.
                     Optimistic Logging: Messages are logged after they
                        are sent, which may lead to the need for rollback if
                        failures occur.
       4. Hybrid Approaches:
            o Some systems may use a combination of coordinated
                checkpointing and message logging to balance the trade-offs
                between performance and consistency.
    The Domino Effect and Cascaded Rollback
    The domino effect occurs in uncoordinated checkpointing when a failure in
    one process forces it to roll back to a previous checkpoint, which in turn
    requires other dependent processes to roll back as well. This can lead to a
    cascading rollback, where multiple processes are forced to revert to earlier
    states.
2
    Elaborate in detail about Log based Recovery system with its CO5               K2
    types.
Types:
    What are the different types of message may be generated on CO5
    the process of rollback. Explain each with space time diagram
    Rollback recovery treats a distributed system application as a collection of
    processes that communicate over a network. It achieves fault tolerance by
    periodically/ saving the state of a process during the failure-free execution,
    enabling it to restart from a saved state upon a failure to reduce the amount
3   of lost work. The saved state is called a checkpoint, and the procedure of       K2
    restarting from a previously checkpointed state is called rollback recovery
    A process failure and subsequent recovery may leave messages that were
    perfectly received (and processed) before the failure in abnormal states. This
    is because a rollback of processes for recovery may have to rollback the send
    and receive operations of several messages.
    Outline the direct dependency tracking technique .Give the                     CO5
    pros and cons of un-coordinated checkpointing.
    Direct Dependency Tracking (DDT) Technique
    The Direct Dependency Tracking (DDT) technique is used in distributed
    systems to detect causal dependencies between processes during message
    passing. The technique focuses on recording causal dependencies to aid in
    rollback recovery, ensuring that system states can be recovered consistently
    after a failure. The primary goal is to track dependencies and manage
    consistent global states without relying on synchronization.
4                                                                                        K2
    Steps in Direct Dependency Tracking:
       1. Message Passing: Each process in the distributed system sends and
      receives messages. These messages create causal dependencies
      between the processes.
   2. Dependency Recording: Each process maintains a log of the
      messages it sends and receives, which includes information on other
      processes it is causally dependent on.
   3. Tracking Dependencies: Instead of storing all previous states, only
      the causal dependencies (directly involved processes) are tracked.
      This reduces the storage overhead.
   4. Recovery: When a failure occurs, the system identifies the last
      consistent global state based on the dependency logs and rolls back to
      that point.
Benefits of DDT:
      Efficiency: Reduces the amount of state information that needs to be
       tracked compared to other methods.
      Scalability: Scales well with large distributed systems since only
       direct dependencies are tracked rather than maintaining a full global
       state.
      Faster Recovery: Recovery is often faster because only the
       processes that are causally dependent on the failed process need to be
       rolled back.
Limitations of DDT:
      Complexity in Failure Scenarios: Handling concurrent failures or
       complex communication patterns can be difficult.
      State Explosion: In some systems with high message passing rates,
       the number of dependencies may grow rapidly, increasing tracking
       overhead.
Uncoordinated Checkpointing
Uncoordinated checkpointing refers to a fault tolerance mechanism where
each process in a distributed system independently takes checkpoints at
different times without coordinating with other processes. If a process fails,
it can roll back to its last checkpoint. However, to ensure consistent global
recovery, all processes may need to roll back to a consistent set of
checkpoints across the system.
Pros of Uncoordinated Checkpointing:
   1. Low Overhead: Since there is no need for coordination, processes
      can take checkpoints based on their local state, reducing
      synchronization delays.
   2. Asynchronous Operation: Processes can continue to execute
      independently, which is beneficial for systems with high concurrency
      and where synchronization could introduce bottlenecks.
       3. Flexibility: Each process can determine its checkpointing frequency
          based on its workload and resource availability, allowing
          optimization for individual process requirements.
    Cons of Uncoordinated Checkpointing:
       1. Risk of Cascading Rollbacks (Domino Effect): One of the biggest
          drawbacks is the possibility of cascading rollbacks. If a failure
          occurs, the system may need to roll back multiple processes, leading
          to the "domino effect," where all processes might need to roll back to
          their initial state.
       2. Checkpointing Complexity: Determining a consistent global state
          after a failure becomes complex, as processes may have taken
          checkpoints at different times, leading to inconsistent snapshots.
       3. Message Log Overhead: To ensure consistency during recovery, a
          system must log all messages to reapply them after rolling back. This
          leads to additional storage and performance overhead.
       4. Increased Recovery Time: The process of determining a consistent
          recovery point can be time-consuming, especially when many
          processes are involved, increasing the overall recovery time.
    Discuss in detail about Consensus and Agreement Algorithms.
    Describe the Byzantine General Problem in detail.
5                                                                                  CO5   K2
Byzantine Generals Problem
   1. Scenario: Byzantine generals need to agree on a common plan
      (attack or retreat), but some generals may be traitors.
   2. Goal: Loyal generals must reach consensus on the same action,
      despite possible traitorous behavior.
   3. Challenge: Traitorous generals can send conflicting or false
      messages, making it difficult to agree.
   4. Communication: Generals can only communicate via messages, no
      direct knowledge of who is loyal or traitorous.
   5. Conditions:
          o All loyal generals must agree on the same action.
          o The system must tolerate some generals acting arbitrarily or
              maliciously.
Problem Requirements
   1. Consistency: All loyal nodes (generals) should agree on the same
          decision.
       2. Fault Tolerance: The system should work correctly even if some
          nodes (generals) act maliciously.
       3. Traitor Detection: It is difficult to distinguish between faulty and
          malicious nodes.
    Solution Concepts
       1. Byzantine Fault Tolerance (BFT): An agreement protocol designed
          to handle Byzantine failures.
       2. Majority Voting: Loyal generals forward messages to others and use
          majority voting to reach a consensus.
       3. Tolerance Limit: Byzantine consensus is possible if fewer than one-
          third of the nodes are faulty.
    Key Algorithms
       1. Lamport, Shostak, and Pease Algorithm:
             o Requires n > 3f generals (nodes) to tolerate f faults.
       2. Practical Byzantine Fault Tolerance (PBFT):
             o Requires n = 3f + 1 nodes to tolerate up to f faults.
             o Used in modern systems like blockchain (Hyperledger,
                 Tendermint).
    Challenges
       1. Communication Complexity: High number of messages exchanged
          for consensus.
       2. Fault Detection: Hard to differentiate between a faulty and malicious
          node.
       3. Scalability: BFT protocols become complex and resource-intensive
          as the number of nodes increases.
    Applications
          Blockchain Technology: BFT is critical in decentralized systems like
           Bitcoin and Ethereum to handle malicious actors.
          Distributed Systems: Ensures reliable consensus even in the
           presence of arbitrary faults or malicious behavior.
    Explain the two popular flavours of the Byzantine agreement
6   problem.Illustrate Lamport-Shostak-Pease Algorithm with a CO5                 K2
    neatdiagram,
                                          PART – C
    Show how would you utilize the Koo-Toueg coordinated CO5
    Checkpointing algorithm in maintaining system consistency
    during failures in a distributed environment?
    The Koo-Toueg Coordinated Checkpointing Algorithm ensures system
    consistency during failures in distributed systems by coordinating the
    checkpointing of processes without requiring all of them to checkpoint
    simultaneously. It handles in-transit messages and ensures a consistent
1   global state.                                                              K3
    Key Concepts:
          Coordinator Process: A process responsible for managing the
           checkpointing protocol.
          In-Transit Messages: Messages sent but not yet received, which
           need to be logged to ensure consistency.
          Checkpoint: A saved state of a process, allowing recovery after a
       failure.
The Koo-Toueg Algorithm operates as follows:
      Step 1: Checkpoint Initiation: One node, known as the
   coordinator, initiates the checkpointing process by broadcasting a
   checkpoint request to all other nodes in the distributed system.
      Step 2: Request Propagation: Upon receiving the
   checkpoint request, each node propagates it to all its neighbors
   to ensure that every node is aware of the checkpointing process.
      Step 3: Freezing Application Execution: Nodes halt their
   application processes temporarily to ensure that no new
   messages are processed during the checkpointing phase. This
   ensures consistency in the captured state.
      Step 4: State Recording: Each node records its local state,
   including memory contents, register values, and process states,
   to stable storage.
      Step 5: Message Logging: Nodes log all messages sent and
   received during the checkpointing period to maintain a
   comprehensive record of system communication.
      Step 6: Acknowledgment: After recording their states, nodes
   send acknowledgments back to the coordinator to indicate that
   they have completed their checkpointing process.
      Step 7: Coordinator Confirmation: The coordinator waits to
   receive acknowledgments from all nodes. Once all
   acknowledgments are received, the coordinator confirms that the
   checkpointing process is complete, and the system can resume
   normal operation.
Handling In-Transit Messages:
The key feature of the Koo-Toueg algorithm is its ability to handle in-transit
messages. Without accounting for these messages, checkpoints could be
inconsistent. The algorithm ensures that:
      Sent but unreceived messages are logged.
      These messages are replayed during recovery to ensure that no data
       or state is lost.
Advantages of Koo-Toueg Coordinated Checkpointing:
   1. Consistency: By coordinating checkpoints and handling in-transit
      messages, the system ensures that all processes recover to a globally
      consistent state.
   2. Minimal Overhead: The algorithm reduces the overhead of
      checkpointing by allowing processes to continue without halting until
      absolutely necessary.
   3. Avoids Domino Effect: Since the algorithm coordinates checkpoints,
      it avoids the need for cascading rollbacks (domino effect), ensuring
           efficient recovery.
    Applications use cases of koo-toueg algorithm
           Distributed Databases: Ensuring data consistency and
        reliability in distributed database systems.
           Scientific Computing: Facilitating fault-tolerant computations
        in distributed scientific applications.
           Real-time Systems: Supporting fault recovery and
        continuous operation in real-time distributed systems.
           High-Performance Computing: Enabling coordinated
        checkpointing in parallel and distributed computing
        environments.
    Identify whether the dominos effect is presented if the process CO5
    P3 fails as shown in the below given model of distributed
    computation. If so track the path of the dominos effect
2                                                                            K3