Unit 4
Failures and Their Classification:
Definition: A failure in a distributed database system refers to any event that disrupts the normal
operation of the system, resulting in the loss of data consistency, availability, or reliability.
Types of Failures:
 - Hardware Failures: Failures in physical components such as servers, disks, or network devices.
 - Software Failures: Errors or bugs in the software components of the database system, such as the
database management system (DBMS) or applications.
 - Network Failures: Communication failures or network outages that prevent data transmission
between distributed nodes.
  - Site Failures: Failures that affect an entire site or data center, resulting in the loss of access to all
resources hosted at that location.
 - Media Failures: Physical damage or corruption to storage media, such as disks or tapes, leading to
data loss or corruption.
Classification:
  - Transient Failures: Temporary failures that can be recovered from quickly, such as a network
glitch or a brief power outage.
  - Permanent Failures: Irreversible failures that require more extensive recovery procedures, such
as hardware failures or data corruption.
__________________________________________________________________________________
Checkpoints and Recovery:
1. Checkpoints:
  - Definition: Checkpoints are predefined moments in time when the state of a distributed database
system is saved to stable storage, allowing recovery to a consistent state after a failure.
 - Purpose: Checkpoints help reduce the amount of work needed during recovery by providing a
consistent starting point.
 - Types:
   - Periodic Checkpoints: Scheduled at regular intervals to save the current state of the system.
   - Forced Checkpoints: Triggered manually or automatically in response to specific events, such as
transaction commits or system checkpoints.
2. Recovery:
 - Definition: Recovery in a distributed database system involves restoring the system to a
consistent state after a failure occurs.
 - Phases:
  - Analysis: Identifying the transactions that were in progress at the time of failure and
determining the necessary actions for recovery.
   - Undo: Reverting the effects of incomplete transactions by rolling them back to their pre-failure
state.
  - Redo: Reapplying the effects of committed transactions that were lost due to the failure.
 - Techniques:
   - Backward Recovery: Reverting to a previous consistent state and replaying transactions from
that point forward.
   - Forward Recovery: Applying recovery actions directly to the current state of the system without
reverting to a previous state.
3. Recovery Protocols:
  - Two-Phase Commit (2PC): Ensures atomicity and durability of distributed transactions by
coordinating commit or rollback decisions among participating nodes.
 - Three-Phase Commit (3PC): Enhances the reliability of 2PC by introducing a prepare phase to
handle failure scenarios more robustly.
Process Resilience
Definition: Process resilience refers to the ability of a system or application to continue functioning
despite failures or disruptions.
Fault Tolerance:
 - Redundancy: Introducing duplicate processes or components to ensure continued operation if
one fails.
 - Failure Detection: Detecting failures quickly to initiate recovery processes.
  - Recovery Mechanisms: Implementing strategies such as checkpointing and rollback to recover
from failures.
Techniques -
Replication: Running multiple instances of a process on different nodes to tolerate failures.
  - Isolation: Isolating individual processes to prevent failures from propagating to other
components.
 - Graceful Degradation: Prioritizing essential functions to maintain basic functionality during failure
conditions.
Challenges:
 - Overhead: Replication and recovery mechanisms can introduce overhead in terms of resources
and performance.
 - Consistency: Ensuring consistency across replicated processes while maintaining performance.
  - Complexity: Designing and managing resilient systems can be complex and require careful
planning.
__________________________________________________________________________________
Reliable Client-Server Communication:
Definition: Reliable client-server communication ensures that data is transmitted accurately and in
the correct order between clients and servers, even in the presence of failures or network issues.
Techniques
  - Acknowledgments: Using acknowledgments to confirm successful receipt of data and
retransmitting if necessary.
 - Sequence Numbers: Assigning sequence numbers to data packets to ensure correct ordering.
 - Timeouts and Retransmissions: Setting timeouts to detect lost packets and retransmitting them if
no acknowledgment is received.
Protocols:
 - TCP (Transmission Control Protocol): Provides reliable, connection-oriented communication with
mechanisms such as acknowledgment, retransmission, and flow control.
 - HTTP (Hypertext Transfer Protocol): Built on top of TCP, it ensures reliable transfer of web data
between clients and servers.
 - RPC (Remote Procedure Call): Provides reliable communication between distributed systems by
abstracting procedure calls over the network.
4. Challenges:
 - Performance: Ensuring reliability without sacrificing performance can be challenging.
 - Overhead: Adding reliability mechanisms can increase network overhead and latency.
  - Scalability: Maintaining reliability in large-scale distributed systems with many clients and servers
can be complex.
_____________________________________________________________________
Reliable Group Communication:
Definition: Reliable group communication ensures that messages are delivered to all members of a
group in a consistent and ordered manner, even in the presence of failures or network partitions.
Techniques
 - Total Order: Ensuring that messages are delivered to all group members in the same order.
 - View Synchronization: Keeping group members synchronized to detect failures and maintain
consistency.
 - Membership Management: Handling dynamic changes in group membership due to joins, leaves,
or failures.
3. Protocols:
 - IP Multicast: Allows for one-to-many communication by sending packets to a group of destination
hosts.
  - Paxos: A consensus protocol used to ensure agreement among a group of nodes in a distributed
system.
 - Virtual Synchrony: Maintains a consistent view of the group by synchronizing membership
changes and message delivery.
4. Challenges:
 - Scalability: Ensuring reliable group communication in large-scale distributed systems with many
members.
 - Fault Tolerance: Handling failures and network partitions while maintaining consistency.
 - Complexity: Designing and implementing reliable group communication protocols can be complex
and require careful consideration of various factors.
Mechanism for commit and recovery in distributed Database system
Ans: In distributed database systems, the Two-Phase Commit (2PC) protocol is commonly used for
commit, and recovery is often facilitated by techniques such as logging and checkpoints.
Two-Phase Commit Protocol:
1. Prepare Phase:
  - The coordinator (typically the transaction manager) sends a prepare request to all participants
(resource managers) involved in the transaction.
 - Each participant responds with either a "yes" (vote to commit) or "no" (vote to abort).
 - If any participant votes "no" (indicating it cannot commit the transaction), the coordinator
proceeds to the abort phase.
2. Commit Phase:
 - If all participants vote "yes" in the prepare phase, the coordinator sends a commit request to all
participants.
  - Upon receiving the commit request, each participant performs the commit operation, making the
transaction's changes permanent.
 - After successfully committing, the participant acknowledges the coordinator.
3. Abort Phase:
  - If any participant votes "no" in the prepare phase or if the coordinator times out waiting for
responses, the coordinator sends an abort request to all participants.
 - Upon receiving the abort request, each participant rolls back the transaction, undoing any
changes made by the transaction.
 - After successfully aborting, the participant acknowledges the coordinator.
Recovery Mechanisms: Logging and Checkpoints
1. Logging:
  - Logging involves recording all changes made by transactions to a log file before they are applied
to the database.
  - During recovery, the log is replayed to redo committed transactions or undo aborted
transactions, bringing the system to a consistent state.
 - Write-Ahead Logging (WAL) is a common logging protocol where changes are written to the log
before being applied to the database to ensure durability.
2. Checkpoints:
 - Checkpoints involve periodically saving the system state to stable storage.
  - During recovery, the system can roll back to the last checkpoint and replay the log from that point
to recover transactions committed after the checkpoint.
  - Checkpoints help reduce the time and resources required for recovery by providing a consistent
starting point.