Distrsyslectureset7 Win20
Distrsyslectureset7 Win20
Distributed Systems
    ICS 230
    Prof. Nalini Venkatasubramanian
    (with slides/animations adapted
    from Prof. Ghosh, University of Iowa
    and Prof. Indranil Gupta, UIUC and
    Prof. Ken Birman, Cornell Univ.)
                Fundamentals
● What is fault?
   ● A fault is a blemish, weakness, or shortcoming of a
     particular hardware or software component.
   ● Fault, error and failures
● Why fault tolerant?
  ● Availability, reliability, dependability, …
● Characterizing faults, fault tolerance and limits
● How to provide fault tolerance ?
   ● Replication
   ● Checkpointing and message logging
   ● Hybrid
3   Reliability
    ● Reliability is an emerging and critical concern in
      traditional and new settings
      ● Transaction processing, mobile applications, cyberphysical
        systems
    ● New enhanced technology makes devices vulnerable to
      errors due to high complexity and high integration
      ● Technology scaling causes problems
         ● Exponential increase of soft error rate
      ● Mobile/pervasive applications running close to humans
         ● E.g Failure of healthcare devices cause serious results
      ● Redundancy techniques incur high overheads of power and
        performance
         ● TMR (Triple Modular Redundancy) may exceed 200% overheads
           without optimization [Nieuwland, 06]
    ● Challenging to optimize multiple properties (e.g.,
      performance, QoS, and reliability)
Classification of failures
Bug
                   Application             Pack
                                           et
                                           Loss
                                   Soft
                     Hardware      Error
      Hardware Errors and Error
10    Control Schemes
                                                Metric           Traditional
    Failures               Causes
                                                  s              Approaches
    Soft Errors,   External Radiations,         FIT,          Spatial Redundancy (TMR,
    Hard Failures, Thermal Effects,             MTTF,         Duplex, RAID-1 etc.) and
    System Crash Power Loss, Poor               MTBF          Data Redundancy (EDC,
                   Design, Aging                              ECC, RAID-5, etc.)
                                                                     •FIT: Failures in Time (109 hours)
◻    Hardware failures are increasing as technology                  •MTTF: Mean Time To Failure
                                                                     •MTBF: Mean Time b/w Failures
     scales                                                          •TMR: Triple Modular Redundancy
                                                                     •EDC: Error Detection Codes
        (e.g.) SER increases by up to 1000 times [Mastipuram, 04]    •ECC: Error Correction Codes
                                                                     •RAID: Redundant Array of
◻    Redundancy techniques are expensive                            Inexpensive Drives
     Transistor
                                            5 hours MTTF
        1
        0
                                            1 month MTTF
      Bit Flip
                                                                   •MTTF: Mean time To
                                                                   Failure
     Soft errors
12                           SER (FIT)         MTTF         Reason
     1 Mbit @ 0.13 µm        1000              104 years
     64 MB @ 0.13 µm         64x8x1000         81 days      High Integration
                             SER (FIT)         MTTF         Reason
     128 MB @ 65 nm          2x1000x64x8x100   1 hour       Technology scaling and
     1 Mbit @ 0.13 µm        1000
                             0         104 years Twice Integration
     64 MB @@
     A system 0.13 µm 64x8x1000
                65 nm 2x2x1000x64x8x10 81
                                       30 days   High
                                                 Memory Integration
                                                           takes up 50%
     128 MB @ 65 nm 002x1000x64x8x10 minutes
                                       1 hour    of soft errors in
                                                 Technology        a
                                                                scaling
                      00                         system
                                                 and Twice Integration
     A
     A system
       system with
                @ 65         100x2x2x1000x64x
                             2x2x1000x64x8x 1830            Exponential
                                                            Memory takesrelationship
                                                                            up
     voltage scaling @       8x1000           seconds       b/w SER & Supply
     nm                      1000              minutes      50% of soft errors in
     65 nm                                                  Voltage
                                                            a system
     A system with           800x100x2x2x1000 0.02          High Intensity of
     A system
     voltage     with@
              scaling        100x2x2x1000x6
                             x64x8x1000 FIT   18            Exponential
                                                            Neutron   Flux at flight
                                              seconds
     voltage
     flight    scaling
            (35,000  ft) @   4x8x1000           seconds     relationship
                                                            (high altitude)b/w SER
     @ nm
     65  65 nm                                              & Supply Voltage
     Soft Error Rate (SER) – FIT (Failures in Time) = number of errors in 109 hours
      Software Errors and Error
13
      Control Schemes
                                                                   Traditional
 Failures               Causes               Metrics
                                                                   Approaches
 Wrong            Incomplete                Number of         Spatial Redundancy
 outputs,         Specification, Poor       Bugs/Klines,      (N-version Programming,
 Infinite         software design,          QoS, MTTF,        etc.), Temporal
 loops, Crash     Bugs, Unhandled           MTBF              Redundancy (Checkpoints
                  Exception                                   and Backward Recovery,
                                                              etc.)
◻    Software errors become dominant as system’s complexity increases
       (e.g.) Several bugs per kilo lines
◻    Hard to debug, and redundancy techniques are expensive
       (e.g.) Backward recovery with checkpoints is inappropriate for real-time applications
Memory leak
  Processes fail to entirely free up the physical memory that has
 been allocated to them. This effectively reduces the size of the
 available physical memory over time. When this becomes
 smaller than the minimum memory needed to support an
 application, it crashes.
Masking tolerance.
Application runs as it is. The failure does not have a visible impact.
All properties (both liveness & safety) continue to hold.
Non-masking tolerance.
Safety property is temporarily affected, but not liveness.
 Fail-safe tolerance
 Given safety predicate is preserved, but liveness may be affected
 Example. Due to failure, no process can enter its critical section for
 an indefinite period. In a traffic crossing, failure changes the traffic in
 both directions to red.
Graceful degradation
Application continues, but in a “degraded” mode. Much depends on
what kind of degradation is acceptable.
                                  Progra               Program
     ● Voter mechanism              fault
                                    m i                   j
     ● Tolerate some
       software bugs
                             Programmer K            Programmer L
22   3) Error-Control Coding
     ● Error-Control Coding
       ● Replication is effective
         but expensive                           fault
       ● Error-Detection Coding                Data
                                    Producer
         and Error-Correction          A
                                                                Consumer
         Coding
          ● (example) Parity Bit,                      Error
            Hamming Code, CRC                         Control
                                                                   Data
     ● Much less redundancy
        than replication
Concept: Consensus
Reaching Agreement is a fundamental problem in distributed
computing
●Mutual Exclusion
    ● processes agree on which process can enter the critical section
●Leader Election
    ● processes agree on which is the elected process
●Totally Ordered Multicast
    ● the processes agree on the order of message delivery
●Commit or Abort in distributed transactions
●Reaching agreement about which processes have failed
●Other examples
    ●   Air traffic control system: all aircrafts must have the same view
    ●   Spaceship engine control – action from multiple control processes( “proceed” or “abort” )
    ●   Two armies should decide consistently to attack or retreat.
                    Defining Consensus
N processes
    ● Every process contributes a value
    ● Goal: To have all processes decide on the same (some) value
        ● Once made, the decision cannot be changed.
Each process p has
   ●    input variable xp : initially either 0 or 1
   ●    output variable yp : initially b (b=undecided) – can be changed only once
●   Termination
     ● Every non-faulty process          ●   Agreement
        must eventually decide.               ○ The final decision of every
●   Integrity                                   non-faulty process must be
     ● The decided value must                   identical.
        have been proposed by            ●   Non-triviality
        some process                          ○ There is at least one initial
●   Validity                                    system state that leads to
     ● If every non-faulty process              each of the all-0’s or all-1’s
        proposes the same value v,              outcomes
        then their final decision must
        be v.
Variant of Consensus Problem
                                                                                 28
Consensus in a Synchronous System
 ● Possible
    ● With one or more faulty processes
                                                                             30
Consensus with at most f failures :
Synchronous Systems
                                                                         Possible to
                                                                         achieve!
For a system with at most f processes crashing
      - All processes are synchronized and operate in “rounds” of time
      - the algorithm proceeds in f+1 rounds (with timeout), using reliable communication to
          all members. Round length >> max transmission delay.
      -   Valuesri: the set of proposed values known to pi at the beginning of round r.
After f+1 rounds, all non-faulty processes would have received the same set of Values.
Proof by contradiction.
● Assume that two non-faulty processes, say pi and pj , differ in their final set of values (i.e., after
    f+1 rounds)
                                                                                                                       32
    Asynchronous Consensus
Sam                                      Jill
       Jill, the weather is beautiful!
       Let’s meet at the sandwich
       stand outside.
                                          36
Sam had better send an
         Ack
  Sam                                     Jill
        Jill, the weather is beautiful!
        Let’s meet at the sandwich
        stand outside.
                                             38
      New and improved
          protocol
                                                 39
How Sam and Jill’s
 romance ended
Yup…
             Got that…
                    ...
Maybe tomorrow?
                                               40
   Things we just can’t do
•   http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
                                                                                      46
                MORE DETAILS LATER!!
The Saddest Moment
http://scholar.harvard.edu/files/mickens/files/thesaddestmoment.pdf
Failure detection
                               1                3
        0
                      6
                                        5
7 4 2
● Completeness Guaranteed
● Accuracy                         Partial/Probabilistic
                                       guarantee
● Speed
  ● Time to first detection of a failure
● Scale                              Time until some
                                     process detects the failure
  ● Equal Load on each member
  ● Network Message Load
                                     No bottlenecks/single
                                     failure point
Detection of crash failures
 Failure can be detected using heartbeat messages
 (periodic “I am alive” broadcast) and timeout
     - if processor speed has a known lower bound
     - channel delays have a known upper bound.
Centralized Heartbeating
                      ☹ Hotspot
         pi
                                         ☹ Unpredictable on
                          pi             simultaneous multiple
pi, Heartbeat Seq. l++                      failures
pj
                                             …
                    …
pj
❖   Replication Transparency
                                                                   Service
    User/client need not know that multiple physical copies of data exist.
❖ Replication Consistency
    Data is consistent on all of the replicas (or is converging towards
     becoming consistent)
Replication Management
❖   Request Communication
    ❖ Requests can be made to a single RM or to multiple RMs
❖ Coordination: The RMs decide
    ❖ whether the request is to be applied
    ❖ the order of requests
      ❖ FIFO ordering: If a FE issues r then r’, then any correct RM handles r and
            then r’.
        ❖ Causal ordering: If the issue of r “happened before” the issue of r’, then
            any correct RM handles r and then r’.
        ❖ Total ordering: If a correct RM handles r and then r’, then any correct RM
            handles r and then r’.
Group
          Address
         Expansion             Leave
                                       Membership
           Group
           Send
                                       Management
                               Fail
          Multicast
          Comm.               Join
…. RM
N processes
    ● Every process contributes a value
    ● Goal: To have all processes decide on the same (some) value
        ● Once made, the decision cannot be changed.
Each process p has
   ●    input variable xp : initially either 0 or 1
   ●    output variable yp : initially b (b=undecided) – can be changed only once
                                                                           75
             more: Stoppable Paxos, Vertical Paxos, Egalitarian Paxos, …
The PAXOS Algorithm
-- Towards a Practical Approach to Consensus
•   http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
                                                                                      78
Paxos Strategy
Phase 1 – Election
•   Potential leader chooses a unique ballot id, higher than seen anything so far
•   Sends to all processes
•   Processes wait, respond once to highest ballot id
     –   If potential leader sees a higher ballot id, it can’t be a leader
     –   Paxos tolerant to multiple leaders, but we’ll only discuss 1 leader case
     –   Processes also log received ballot ID on disk
•   If a process has in a previous round decided on a value v’, it includes value v’ in its response
•   If majority (i.e., quorum) respond OK then you are the leader
     –   If no one has majority, start new round
•   (If things go right) A round cannot have two leaders (why?)
                                                                                                79
Paxos Strategy
Phase 2 – Proposal
Value v ok?
                                                                80
Paxos Strategy
Phase 3 – Decision
  • If leader hears a majority of OKs, it lets everyone know of the
    decision
  • Recipients receive decision, log it on disk
  • Consensus is reached
Value v ok? v!
                                                                81
         Paxos Protocol Implementation -
                     Terms
●   Proposer      P1                         ●   Proposal
     ●   Suggests values for consideration       ●   An alternative proposed by a
         by Acceptors.                               proposer.
     ●   Advocates for a client.                 ●   Consists of a unique number
                                                     and a proposed value.
●   Acceptor      A1
                                                      ( 42, B )
     ●   Considers the values proposed
         by proposers.
     ●   Renders an accept/reject            ●   We say a value is chosen when
         decision.                               consensus is reached on that value.
●   Learner
     ●   Learns the chosen value.
●   In practice, each node will usually
    play all three roles.
                                                       82
                     Strong Majority
●   “Strong Majority” / “Quorum”
     ●  A set of acceptors consisting of more             A2        A5
        than half of all acceptors.
●   Any two quorums have a nonempty
    intersection.
                                                     A1        A6
●   Helps avoid “split-brain” problem.
     ●  Acceptors decisions are not in                                   A7
        agreement.
     ●  Common node acts as “tie-breaker.”                A4
                                                                         A3
●   In a system with 2F+1 acceptors, F
    acceptors can fail and we'll be OK.         Quorums in a system with
                                                seven acceptors.
                                                83
                                       Consensus                                              time
A1              (N1, V1)                          (N5, V3)
A4 (N3, V3)
●    P2c . For any v and n, if a proposal with value v and number n is issued, then there is a set
     S consisting of a majority of acceptors such that either:
                                                                             84
                              Basic Paxos Algorithm
Phase 1a: “Prepare”
Select proposal number* N and send a prepare(N) request to a
quorum of acceptors.
             prepare(1)
                        promise(1, -)
       promise(1, -)
                      accept(1, A)
                       accept(1, A)
     accepted(1, A)
accepted(1, A)
time
                                             86
P1   P2                       A1   A2   A3
accepted(1, A)
                                              continued...
            prepare(2)
promise(2, -)
promise(2, (1,A))
accept(2, A)
accepted(2, A)
time
                                   87
                           Other Considerations
●     Liveness                          ●   Learning the Chosen Value
       ●   Can't be guaranteed in           ●   Acceptors notify some
           general.                             set of learners upon
       ●   Distinguished Proposer               acceptance.
             −   All proposals are          ●   Distinguished Learner
                 funneled through one
                 node.
       ●   Can re-elect on failure.
●     A node may play the role of both distinguished proposer and
      distinguished learner – we call such a node the master.
●   Byzantine Failures
●   Message Logging
●   Checkpointing
●   Backwards vs. Forwards Error Recovery
●
Backward vs. forward error
recovery
Backward error recovery
When safety property is violated, the computation rolls
back and resumes from a previous correct state.
                                                          tim
                                                          e
                          rollback
 Forward error recovery
 Computation does not care about getting the history right, but
 moves on, as long as eventually the safety property is restored.
 True for self-stabilizing systems.
Message Logging
                                                       k
               ac
                                                      ac
            tt                   att
                                                           retreat
                                                                      att
                                                     att
          a
                        attack
                                    ac                                   ac
                                         k                                  k
Lieutenants agree on what the commander      Lieutenants agree on what the commander
                   says                                         says
Byzantine Generals Problem
EXTRA SLIDES
Linearizability
❖ Let the sequence of read and update operations
   that client i performs in some execution be oi1,
   oi2,….
   ❖ “Program order” for the client
❖ A replicated shared object service is linearizable if
   for any execution (real), there is some interleaving
   of operations (virtual) issued by all clients that:
   ❑ meets the specification of a single correct copy of objects
   ❑ is consistent with the real times at which each operation occurred during
      the execution
❑ Main goal: any client will see (at any point of time) a copy
   of the object that is correct and consistent
Sequential Consistency
❖ The real-time requirement of linearizability is hard, if not
    impossible, to achieve in real systems
❖     A less strict criterion is sequential consistency: A replicated
    shared object service is sequentially consistent if for any execution
    (real), there is some interleaving of clients’ operations (virtual) that:
                                                                           101
  Proof Setup
● For impossibility proof, OK to consider
1. more restrictive system model, and
2. easier problem
     • Why is this is ok?
                                            102
Network
p p’
    send(p’,m)
                                         receive(p’)
                                              may return null
“Network”
                                                                103
 States
● State of a process
● Configuration=global state. Collection of states, one for each
  process; alongside state of the global buffer.
● Each Event (different from Lamport events) is atomic and
  consists of three steps
    • receipt of a message by a process (say p)
    • processing of message (may change recipient’s state)
    • sending out of all necessary messages by p
● Schedule: sequence of events
                                                             104
        Configuration
C       C
                                        C
      Event
      e’=(p’,m’)
                           Schedule
C’                         s=(e’,e’’)
                                        C’’
     Event e’’=(p’’,m’’)
C’’
            Equivale                          105
            nt
   Lemma 1
Schedule s1
s1 and s2 involve      C’
disjoint sets of
receiving processes,        Schedule s2
and are each
applicable                                   s1
on C                               C’’
                                                                   106
 Easier Consensus Problem
Easier Consensus Problem: some process
    eventually sets yp to be 0 or 1
                                                  107
Easier Consensus Problem
                                                                       108
 What the FLP proof shows
      1      1       0     1      0
    1
                                                                                     111
 What we’ll show
                                                        113
Lemma 3
      A bivalent initial
      config.              let e=(p,m) be some event
                             applicable to the initial
                           config.
                    Let C be the set of configs.
                    reachable
                     without applying e
114