Check Point Threat Extraction secured this document Get Original
Coordination And Agreement
Introduction
Fundamental issue: for a set of processes, how to coordinate their actions
or to agree on one or more values?
even no fixed master-slave relationship between the components
Further issue: how to consider and deal with failures when designing
algorithms
Topics covered
mutual exclusion
how to elect one of a collection of processes to perform a special role
multicast communication
agreement problem: consensus and byzantine agreement
Failure Assumptions and Failure Detectors
Failure assumptions of this chapter
Reliable communication channels
Processes only fail by crashing unless state otherwise
Failure detector: object/code in a process that detects failures of other
processes
unreliable failure detector
One of two values: unsuspected or suspected
Evidence of possible failures
Example: most practical systems
Each process sends ―alive/I‘m here‖ message to everyone else
If not receiving ―alive‖ message after timeout, it‘s suspected
maybe function correctly, but network partitioned
reliable failure detector
One of two accurate values: unsuspected or failure – few practical systems
12.2 Distributed Mutual Exclusion
Process coordination in a multitasking OS
Race condition: several processes access and manipulate the same data
concurrently and the outcome of the execution depends on the particular
order in which the access take place
critical section: when one process is executing in a critical section, no other
process is to be allowed to execute in its critical section
Mutual exclusion: If a process is executing in its critical section, then no
other processes can be executing in their critical sections
Distributed mutual exclusion
Provide critical region in a distributed environment
message passing
for example, locking files, locked daemon in UNIX (NFS is stateless, no
file-locking at the NFS level)
Algorithms for mutual exclusion
Problem: an asynchronous system of N processes
processes don't fail
message delivery is reliable; not share variables
only one critical region
application-level protocol: enter(), resourceAccesses(), exit()
Requirements for mutual exclusion
Essential
[ME1] safety: only one process at a time
[ME2] liveness: eventually enter or exit
Additional
[ME3] happened-before ordering: ordering of enter() is the same as HB
ordering
Performance evaluation
overhead and bandwidth consumption: # of messages sent
client delay incurred by a process at entry and exit
throughput measured by synchronization delay: delay between one's exit
and next's entry
A central server algorithm
server keeps track of a token---permission to enter critical region
a process requests the server for the token
the server grants the token if it has the token
a process can enter if it gets the token, otherwise waits
when done, a process sends release and exits
A central server algorithm: discussion
Properties safety, why?
liveness, why?
HB ordering not guaranteed, why?
Performance
enter overhead: two messages (request and grant)
enter delay: time between request and grant
exit overhead: one message (release)
exit delay: none
synchronization delay: between release and grant
centralized server is the bottleneck
A ring-based algorithm
Arrange processes in a logical ring to rotate a token
Wait for the token if it requires to enter the critical section
The ring could be unrelated to the physical configuration
pi sends messages to p(i+1) mod N
when a process requires to enter the critical section, waits for the token
when a process holds the token
If it requires to enter the critical section, it can enter
when a process releases a token (exit), it sends to its neighbor
If it doesn‘t, just immediately forwards the token to its neighbor
An algorithm using multicast and logical clocks
Multicast a request message for the token (Ricart and Agrawala [1981])
enter only if all the other processes reply
totally-ordered timestamps: <T, pi >
Each process keeps a state: RELEASED, HELD, WANTED
if all have state = RELEASED, all reply, a process can hold the token and
enter
if a process has state = HELD, doesn't reply until it exits
if more than one process has state = WANTED, process with the lowest
timestamp will get all
N-1 replies first
An algorithm using multicast: discussion
•Properties safety, why?
liveness, why?
HB ordering, why?
Performance
bandwidth consumption: no token keeps circulating
entry overhead: 2(N-1), why? [with multicast support: 1 + (N -1) = N]
entry delay: delay between request and getting all replies
exit overhead: 0 to N-1 messages
exit delay: none
synchronization delay: delay for 1 message (one last reply from the
previous holder) Maekawa‘s voting algorithm
•Observation: not all peers to grant it access
Only obtain permission from subsets, overlapped by any two processes
•Maekawa‘s approach
subsets Vi,Vj for process Pi, Pj
Pi Vi, Pj Vj
Vi ∩ Vj ≠ , there is at least one common member
subset |Vi|=K, to be fair, each process should have the same size
Pi cannot enter the critical section until it has received all K reply messages
Choose a subset
Simple way (2√N): place processes in a √N by √N matrix and let Vi be the
union of the row and column containing Pi
If P1, P2 and P3 concurrently request entry to the critical section, then its
possible that each process has received one (itself) out of two replies, and
none can proceed adapted and solved by [Saunders 1987]
Elections
Election: choosing a unique process for a particular role
All the processes agree on the unique choice
For example, server in dist. Mutex assumptions
Each process can call only one election at a time multiple concurrent
elections can be called by different processes
Participant: engages in an election each process pi has variable electedi = ?
(don't know) initially process with the largest identifier wins.
The (unique) identifier could be any useful value Properties
[E1] electedi of a ―participant‖ process must be P (elected
process=largestid) or ⊥ (undefined)
[E2] liveness: all processes participate and eventually set electedi
!= ⊥ (or crash) Performance
overhead (bandwidth consumption): # of messages
turnaround time: # of messages to complete an election
A ring-based election algorithm
Arrange processes in a logical ring o pi sends messages to p(i+1) mod N
o It could be unrelated to the physical configuration
o Elect the coordinator with the largest id o
Assume no failures
Initially, every process is a non-participant. Any process can call an
election o Marks itself as participant o Places its id in an election message
o Sends the message to its neighbor o Receiving an election message
if id > myid, forward the msg, mark participant
if id < myid o non-participant: replace id with myid: forward the msg, mark
participant o participant: stop forwarding (why? Later, multiple elections)
if id = myid, coordinator found, mark non-participant, electedi := id, send
elected o message with myid
o Receiving an elected message
id != myid, mark non-participant, electedi := id forward the msg if id =
myid, stop forwarding
Figure 12.7 A ring-based election in progress
Receiving an election message:
if id > myid, forward the msg, mark participant
if id < myid
non-participant: replace id with myid: forward the msg, mark participant
participant: stop forwarding (why? Later, multiple elections)
if id = myid, coordinator found, mark non-participant, electedi := id, send
elected message with
myid
Receiving an elected message: – id != myid, mark non-participant,
electedi := id forward the msg
if id = myid, stop forwarding
A ring-based election algorithm: discussion
•Properties
safety: only the process with the largest id can send an elected message
liveness: every process in the ring eventually participates in the election;
extra elections are stopped Performance
one election, best case, when?
N election messages
N elected messages
turnaround: 2N messages
one election, worst case, when?
2N - 1 election messages
N elected messages
turnaround: 3N - 1 messages
can't tolerate failures, not very practical
The bully election algorithm
•Assumption
– Each process knows which processes have higher identifiers, and that it can
communicate with all such processes
•Compare with ring-based election
– Processes can crash and be detected by timeouts
• synchronous
• timeout T = 2Ttransmitting (max transmission delay) + Tprocessing (max
processing delay)
•Three types of messages
– Election: announce an election
– Answer: in response to Election
– Coordinator: announce the identity of the elected process
The bully election algorithm: howto
• Start an election when detect the coordinator has failed or begin to replace the
coordinator, which has lower identifier
– Send an election message to all processes with higher id's and waits for answers
(except the failed coordinator/process)
• If no answers in time T
– Considers it is the coordinator
– sends coordinator message (with its id) to all processes with lower id's • else
– waits for a coordinator message and starts an election if T‘ timeout
– To be a coordinator, it has to start an election
• A higher id process can replace the current coordinator (hence ―bully‖)
– The highest one directly sends a coordinator message to all process with lower
identifiers
• Receiving an election message
– sends an answer message back
– starts an election if it hasn't started one—send election messages to all higher-
id processes
(including the ―failed‖ coordinator—the coordinator might be up by now)
• Receiving a coordinator message
– set electedi to the new coordinator
The bully election
algorithm:
discussion
Properties safety:
a lower-id process always yields to a higher-id process
However, it‘s guaranteed
if processes that have crashed are replaced by processes with the same
identifier since message delivery order might not be guaranteed and
failure detection might be unreliable
liveness: all processes participate and know the coordinator at the end
Performance
best case: when?
overhead: N-2 coordinator messages
turnaround delay: no election/answer messages
Multicast Communication
Group (multicast) communication: for each of a group of processes to
receive copies of the messages sent to the group, often with delivery
guarantees
The set of messages that every process of the group should receive
On the delivery ordering across the group members
Challenges
Efficiency concerns include minimizing overhead activities and increasing
throughput and bandwidth utilization
Delivery guarantees ensure that operations are completed
Types of group
Static or dynamic: whether joining or leaving is considered Closed or open
A group is said to be closed if only members of the group can multicast to
it. Reliable Multicast
Simple basic multicasting (B-multicast) is sending a message to every
process that is a member of a defined group
B-multicast (g, m) for each process p ∈ group g, send (p, message m)
On receive (m) at p: B-deliver (m) at p
Reliable multicasting (R-multicast) requires these properties
Integrity: a correct process sends a message to only a member of the group
Validity: if a correct process sends a message, it will eventually bedelivered
Agreement: if a message is delivered to a correct process, all other correct
processes in the group will deliver it
Types of message ordering
Three types of message ordering
– FIFO (First-in, first-out) ordering: if a correct process delivers a message
before another, every correct process will deliver the first message before the
other
– Casual ordering: any correct process that delivers the second message will
deliver the previous message first
– Total ordering: if a correct process delivers a message before another, any
other correct process that delivers the second message will deliver the first
message first
•Note that
– FIFO ordering and casual ordering are only partial orders
– Not all messages are sent by the same sending process
– Some multicasts are concurrent, not able to be ordered by happened before –
Total order demands consistency, but not a particular order
Figure 12.12 Total, FIFO and causal ordering of multicast messages
Notice
the consistent ordering of totally ordered messages T1 and T2,
the FIFO-related messages F1 and F2 and
the causally related messages C1 and C3 and
the otherwise arbitrary delivery ordering of messages
Note that T1 and T2 are delivered in opposite order to the physical time of
message creation
Bulletin board example (FIFO ordering)
• A bulletin board such as Web Board at NJIT illustrates the desirability of
consistency and FIFO ordering. A user can best refer to preceding messages if
they are delivered in order. Message 25 in Figure 12.13 refers to message 24,
and message 27 refers to message 23.
• Note the further advantage that Web Board allows by permitting messages to
begin threads by replying to a particular message. Thus messages do not have
to be displayed in the same order they are delivered
Implementing total ordering
• The normal approach to total ordering is to assign totally ordered identifiers to
multicast messages, using the identifiers to make ordering decisions.
• One possible implementation is to use a sequencer process to assign identifiers.
See Figure
12.14. A drawback of this is that the sequencer can become a bottleneck.
• An alternative is to have the processes collectively agree on identifiers. A
simple algorithm is shown in Figure 12.15.
Figure 12.15 The ISIS algorithm for total ordering
Each process q in group g keeps
• Aq g: the largest agreed sequence number it has observed so far for the group g
• Pq g: its own largest proposed sequence number
Algorithm for process p to multicast a message m to group g
1. B-multicasts <m, i> to g, where i is a unique identifier for m
2. Each process q replies to the sender p with a proposal for the message‘s agreed
sequence number of Pq g :=Max(Aq g, Pq g)+1
3. Collects all the proposed sequence numbers and selects the largest one a as the
next agreed sequence number. It then B-multicasts <i, a> to g.
4. Each process q in g sets Aq g := Max(Aq g, a) and attaches a to the message
identified by i
Implementing casual ordering
• Causal ordering using vector timestamps (Figure 12.16)
– Only orders multicasts, and ignores one-to-one messages between processes
– Each process updates its vector timestamp before delivering a message to
maintain the count of precedent messages
Consensus and related problems
• Problems of agreement
– For processes to agree on a value (consensus) after one or more of the processes
has proposed what that value should be
– Covered topics: byzantine generals, interactive consistency, totally ordered
multicast
• The byzantine generals problem: a decision whether multiple armies should
attack or retreat, assuming that united action will be more successful than
some attacking and some retreating • Another example might be space ship
controllers deciding whether to proceed or abort. Failure handling during
consensus is a key concern
• Assumptions
– communication (by message passing) is reliable
– processes may fail
• Sometimes up to f of the N processes are faulty
Consensus Process
1. Each process pi begins in an undecided state and proposes a single value vi,
drawn from a set D (i=1…N)
2. Processes communicate with each other, exchanging values
3. Each process then sets the value of a decision variable di and enters the decided
state
Requirements for Consensus
• Three requirements of a consensus algorithm
– Termination: Eventually every correct process sets its decision variable
– Agreement: The decision value of all correct processes is the same: if pi and pj
are correct and have entered the decided state, then di=dj
(i,j=1,2, …, N)
– Integrity: If the correct processes all proposed the same value, then any correct
process in the decided state has chosen that value
The byzantine generals problem
• Problem description
– Three or more generals must agree to attack or to retreat
– One general, the commander, issues the order
– Other generals, the lieutenants, must decide to attack or retreat
– One or more generals may be treacherous
• A treacherous general tells one general to attack and another to retreat
• Difference from consensus is that a single process supplies the value to agree
on
• Requirements
– Termination: eventually each correct process sets its decision variable
– Agreement: the decision variable of all correct processes is the same
– Integrity: if the commander is correct, then all correct processes agree on the
value that the commander has proposed (but the commander need not be
correct)
The interactive consistency problem
• Interactive consistency: all correct processes agree on a vector of values, one
for each process. This is called the decision vector
– Another variant of consensus
• Requirements
– Termination: eventually each correct process sets its decision variable
– Agreement: the decision vector of all correct processes is the same
– Integrity: if any process is correct, then all correct processes decide the correct
value for that
process
Relating consensus to other problems
• Consensus (C), Byzantine Generals (BG), and Interactive Consensus (IC)
are all problems concerned with making decisions in the context of arbitrary
or crash failures
• We can sometimes generate solutions for one problem in terms of another.
For example – We can derive IC from BG by running BG N times, once for
each process with that process acting as commander
– We can derive C from IC by running IC to produce a vector of values at each
process, then applying a function to the vector‘s values to derive a single value.
– We can derive BG from C by
• Commander sends proposed value to itself and each remaining process
• All processes run C with received values
• They derive BG from the vector of C values
Consensus in a Synchronous System
• Up to f processes may have crash failures, all failures occurring during f+1
rounds.
During each round, each of the correct processes multicasts the values among
themselves
• The algorithm guarantees all surviving correct processes are in a position to
agree
• Note: any process with f failures will require at least f+1 rounds to agree
Limits for solutions to Byzantine Generals
• Some cases of the Byzantine Generals problems have no solutions
– Lamport et al found that if there are only 3 processes, there is no solution
– Pease et al found that if the total number of processes is less than three times
the number of failures plus one, there is no solution
• Thus there is a solution with 4 processes and 1 failure, if there are two rounds
– In the first, the commander sends the values
– while in the second, each lieutenant sends the values it received
Figure 12.20 Four Byzantine generals
Asynchronous Systems
• All solutions to consistency and Byzantine generals problems are limited to
synchronous systems
• Fischer et al found that there are no solutions in an asynchronous system with
even one failure
• This impossibility is circumvented by masking faults or using failure detection
• There is also a partial solution, assuming an adversary process, based on
introducing random values in the process to prevent an effective thwarting
strategy. This does not always reach consensus